OpenVPN Server on EL6: Difference between revisions

From Alteeve Wiki
Jump to navigation Jump to search
No edit summary
Line 1: Line 1:
{{howto_header}}
{{howto_header}}


= Install =
{{note|1=This is the second edition of the original [[Red Hat Cluster Service 2 Tutorial]]. This version is updated to use the Red Hat Cluster Suite, Stable version 3. It replaces [[Xen]] in favour of [[KVM]] to stay in-line with [[Red Hat]]'s supported configuration. It also uses [[corosync]], replacing [[openais]], as the core cluster communication stack.}}


OpenVPN is not in the normal repo, so you need access to the [http://dag.wieers.com/rpm/ DAG] repo. You may need to install this repo on your clients.
This paper has one goal;


To install the [http://wiki.centos.org/AdditionalResources/Repositories/RPMForge#head-f0c3ecee3dbb407e4eed79a56ec0ae92d1398e01 DAG] repository on [[EL6]], please run the following commands:
* Creating a 2-node, high-availability cluster hosting [[KVM]] virtual machines using [[RHCS]] "stable 3" with [[DRBD]] and clustered [[LVM]] for synchronizing storage data. This is an updated version of the earlier [[Red Hat Cluster Service 2 Tutorial]] Tutorial. You will find much in common with that tutorial if you've previously followed that document. Please don't skip large sections though. There are some differences that are subtle but important.


Add the DAG repository key.
Grab a coffee, put on some nice music and settle in for some geekly fun.
 
= The Task Ahead =
 
Before we start, let's take a few minutes to discuss clustering and it's complexities.
 
== Technologies We Will Use ==
 
* ''Red Hat Enterprise Linux 6'' ([[EL6]]); You can use  a derivative like [[CentOS]] v6.
* ''Red Hat Cluster Services'' "Stable" version 3. This describes the following core components:
** ''Corosync''; Provides cluster communications using the [[totem]] protocol.
** ''Cluster Manager'' (<span class="code">[[cman]]</span>); Manages the starting, stopping and managing of the cluster.
** ''Resource Manager'' (<span class="code">[[rgmanager]]</span>); Manages cluster resources and services. Handles service recovery during failures.
** ''Clustered Logical Volume Manager'' (<span class="code">[[clvm]]</span>); Cluster-aware (disk) volume manager. Backs [[GFS2]] [[filesystem]]s and [[KVM]] virtual machines.
** ''Global File Systems'' version 2 (<span class="code">[[gfs2]]</span>); Cluster-aware, concurrently mountable file system.
* ''Distributed Redundant Block Device'' ([[DRBD]]); Keeps shared data synchronized across cluster nodes.
* ''KVM''; [[Hypervisor]] that controls and supports virtual machines.
 
== A Note on Patience ==
 
There is nothing inherently hard about clustering. However, there are many components that you need to understand before you can begin. The result is that clustering has an inherently steep learning curve.
 
You '''must''' have patience. Lots of it.
 
Many technologies can be learned by creating a very simple base and then building on it. The classic "Hello, World!" script created when first learning a programming language is an example of this. Unfortunately, there is no real analogue to this in clustering. Even the most basic cluster requires several pieces be in place and working together. If you try to rush by ignoring pieces you think are not important, you will almost certainly waste time. A good example is setting aside [[fencing]], thinking that your test cluster's data isn't important. The cluster software has no concept of "test". It treats everything as critical all the time and ''will'' shut down if anything goes wrong.
 
Take your time, work through these steps, and you will have the foundation cluster sooner than you realize. Clustering is fun '''because''' it is a challenge.
 
== Prerequisites ==
 
It is assumed that you are familiar with Linux systems administration, specifically [[Red Hat]] [[Enterprise Linux]] and its derivatives. You will need to have somewhat advanced networking experience as well. You should be comfortable working in a terminal (directly or over <span class="code">[[ssh]]</span>). Familiarity with [[XML]] will help, but is not terribly required as it's use here is pretty self-evident.
 
If you feel a little out of depth at times, don't hesitate to set this tutorial aside. Branch over to the components you feel the need to study more, then return and continue on. Finally, and perhaps most importantly, you '''must''' have patience! If you have a manager asking you to "go live" with a cluster in a month, tell him or her that it simply '''won't happen'''. If you rush, you will skip important points and '''you will fail'''.
 
Patience is vastly more important than any pre-existing skill.
 
== Focus and Goal ==
 
There is a different cluster for every problem. Generally speaking though, there are two main problems that clusters try to resolve; Performance and High Availability. Performance clusters are generally tailored to the application requiring the performance increase. There are some general tools for performance clustering, like [[Red Hat]]'s [[LVS]] (Linux Virtual Server) for load-balancing common applications like the [[Apache]] web-server.
 
This tutorial will focus on High Availability clustering, often shortened to simply '''HA''' and not to be confused with the [[Linux-HA]] "heartbeat" cluster suite, which we will not be using here. The cluster will provide a shared file systems and will provide for the high availability on [[KVM]]-based virtual servers. The goal will be to have the virtual servers live-migrate during planned node outages and automatically restart on a surviving node when the original host node fails.
 
Below is a ''very'' brief overview;
 
High Availability clusters like ours have two main parts; Cluster management and resource management.
 
The cluster itself is responsible for maintaining the cluster nodes in a group. This group is part of a "Closed Process Group", or [[CPG]]. When a node fails, the cluster manager must detect the failure, reliably eject the node from the cluster using fencing and then reform the CPG. Each time the cluster changes, or "re-forms", the resource manager is called. The resource manager checks to see how the cluster changed, consults it's configuration and determines what to do, if anything.
 
The details of all this will be discussed in detail a little later on. For now, it's sufficient to have in mind these two major roles and understand that they are somewhat independent entities.
 
== Platform ==
 
This tutorial was written using [[RHEL]] version 6.1 and [[CentOS]] version 6.0 [[x86_64]]. No attempt was made to test on [[i686]] or other [[EL6]] derivatives. That said, there is no reason to believe that this tutorial will not apply to any variant. As much as possible, the language will be distro-agnostic. It is advised that you use an [[x86_64]] (64-[[bit]]) platform if at all possible.
 
== A Word On Complexity ==
 
Introducing the <span class="code">Fabimer Principle</span>:
 
Clustering is not inherently hard, but it is inherently complex. Consider;
 
* Any given program has <span class="code">N</span> bugs.
** [[RHCS]] uses; <span class="code">cman</span>, <span class="code">corosync</span>, <span class="code">dlm</span>, <span class="code">fenced</span>, <span class="code">rgmanager</span>, and many more smaller apps.
** We will be adding <span class="code">DRBD</span>, <span class="code">GFS2</span>, <span class="code">clvmd</span>, <span class="code">libvirtd</span> and <span class="code">KVM</span>.
** Right there, we have <span class="code">N^10</span> possible bugs. We'll call this <span class="code">A</span>.
* A cluster has <span class="code">Y</span> nodes.
** In our case, <span class="code">2</span> nodes, each with <span class="code">3</span> networks across <span class="code">6</span> interfaces bonded into pairs.
** The network infrastructure (Switches, routers, etc). We will be using two managed switches, adding another layer of complexity.
** This gives us another <span class="code">Y^(2*(3*2))+2</span>, the <span class="code">+2</span> for managed switches. We'll call this <span class="code">B</span>.
* Let's add the human factor. Let's say that a person needs roughly 5 years of cluster experience to be considered an proficient. For each year less than this, add a <span class="code">Z</span> "oops" factor, <span class="code">(5-Z)^2</span>. We'll call this <span class="code">C</span>.
* So, finally, add up the complexity, using this tutorial's layout, 0-years of experience and managed switches.
** <span class="code">(N^10) * (Y^(2*(3*2))+2) * ((5-0)^2) == (A * B * C)</span> == an-unknown-but-big-number.
 
This isn't meant to scare you away, but it is meant to be a sobering statement. Obviously, those numbers are somewhat artificial, but the point remains.
 
Any one piece is easy to understand, thus, clustering is inherently easy. However, given the large number of variables, you must really understand all the pieces and how they work together. '''''DO NOT''''' think that you will have this mastered and working in a month. Certainly don't try to sell clusters as a service without a ''lot'' of internal testing.
 
Clustering is kind of like chess. The rules are pretty straight forward, but the complexity can take some time to master.
 
= Overview of Components =
 
When looking at a cluster, there is a tendency to want to dive right into the configuration file. That is not very useful in clustering.
 
* When you look at the configuration file, it is quite short.
 
It isn't like most applications or technologies though. Most of us learn by taking something, like a configuration file, and tweaking it this way and that to see what happens. I tried that with clustering and learned only what it was like to bang my head against the wall.
 
* Understanding the parts and how they work together is critical.
 
You will find that the discussion on the components of clustering, and how those components and concepts interact, will be much longer than the initial configuration. It is true that we could talk very briefly about the actual syntax, but it would be a disservice. Please, don't rush through the next section or, worse, skip it and go right to the configuration. You will waste far more time than you will save.
 
* Clustering is easy, but it has a complex web of inter-connectivity. You must grasp this network if you want to be an effective cluster administrator!
 
== Component; cman ==
 
This was, traditionally, the <span class="code">c</span>luster <span class="code">man</span>ager. In the 3.0 series, which is what all versions of [[EL6]] will use, <span class="code">cman</span> acts mainly as a [[quorum]] provider, tallying votes and deciding on a critical property of the cluster: quorum. As of the 3.1 series, which future [[EL]] releases will use, <span class="code">cman</span> will be removed entirely.
 
The <span class="code">cman</span> service is used to start and stop the cluster communication, membership, locking, fencing and other cluster foundation applications.
 
== Component; corosync ==
 
Corosync is the heart of the cluster. Almost all other cluster compnents operate though this.
 
In Red Hat clusters, <span class="code">corosync</span> is configured via the central <span class="code">cluster.conf</span> file. It can be configured directly in <span class="code">corosync.conf</span>, but given that we will be building an RHCS cluster, we will only use <span class="code">cluster.conf</span>. That said, almost all <span class="code">corosync.conf</span> options are available in <span class="code">cluster.conf</span>. This is important to note as you will see references to both configuration files when searching the Internet.
 
Corosync sends messages using [[multicast]] messaging by default. Recently, [[unicast]] support has been added, but due to network latency, it is only recommended for use with small clusters of two to four nodes. We will be using [[multicast]] in this tutorial.
 
=== A Little History ===
 
There were significant changes between [[RHCS]] version 2, which we are using, and version 3 available on [[EL6]] and recent [[Fedora]]s.
 
In the RHCS version 2, there was a component called <span class="code">openais</span> which provided <span class="code">totem</span>. The OpenAIS project was designed to be the heart of the cluster and was based around the [http://www.saforum.org/ Service Availability Forum]'s [http://www.saforum.org/Application-Interface-Specification~217404~16627.htm Application Interface Specification]. AIS is an open [[API]] designed to provide inter-operable high availability services.
 
In 2008, it was decided that the AIS specification was overkill for most clustered applications being developed in the open source community.  At that point, OpenAIS was split in to two projects: Corosync and OpenAIS. The former, Corosync, provides totem, cluster membership, messaging, and basic APIs for use by clustered applications, while the OpenAIS project became an optional add-on to corosync for users who want the full AIS API.
 
You will see a lot of references to OpenAIS while searching the web for information on clustering. Understanding it's evolution will hopefully help you avoid confusion.
 
== Concept; quorum ==
 
[[Quorum]] is defined as the minimum set of hosts required in order to provide clustered services and is used to prevent [[split-brain]] situations.
 
The quorum algorithm used by the RHCS cluster is called "simple majority quorum", which means that more than half of the hosts must be online and communicating in order to provide service. While simple majority quorum a very common quorum algorithm, other quorum algorithms exist ([[grid quorum]], [[YKD Dyanamic Linear Voting]], etc.).
 
The idea behind quorum is that, when a cluster splits into two or more partitions, which ever group of machines has quorum can safely start clustered services knowing that no other lost nodes will try to do the same.
 
Take this scenario;
 
* You have a cluster of four nodes, each with one vote.
** The cluster's <span class="code">expected_votes</span> is <span class="code">4</span>. A clear majority, in this case, is <span class="code">3</span> because <span class="code">(4/2)+1</span>, rounded down, is <span class="code">3</span>.
** Now imagine that there is a failure in the network equipment and one of the nodes disconnects from the rest of the cluster.
** You now have two partitions; One partition contains three machines and the other partition has one.
** The three machines will have quorum, and the other machine will lose quorum.
** The partition with quorum will reconfigure and continue to provide cluster services.
** The partition without quorum will withdraw from the cluster and shut down all cluster services.
 
When the cluster reconfigures and the partition wins quorum, it will fence the node(s) in the partition without quorum. Once the fencing has been confirmed successful, the partition with quorum will begin accessing clustered resources, like shared filesystems.
 
This also helps explain why an even <span class="code">50%</span> is not enough to have quorum, a common question for people new to clustering. Using the above scenario, imagine if the split were 2 and 2 nodes. Because either can't be sure what the other would do, neither can safely proceed. If we allowed an even 50% to have quorum, both partition might try to take over the clustered services and disaster would soon follow.
 
There is one, and '''only''' one except to this rule.
 
In the case of a two node cluster, as we will be building here, any failure results in a 50/50 split. If we enforced quorum in a two-node cluster, there would never be high availability because and failure would cause both nodes to withdraw. The risk with this exception is that we now place the entire safety of the cluster on [[fencing]], a concept we will cover in a second. Fencing is a second line of defense and something we are loath to rely on alone.
 
Even in a two-node cluster though, proper quorum can be maintained by using a quorum disk, called a [[qdisk]]. Unfortunately, <span class="code">qdisk</span> on a [[DRBD]] resource comes with it's own problems, so we will not be able to use it here.
 
== Concept; Virtual Synchrony ==
 
Many cluster operations, like fencing, distributed locking and so on, have to occur in the same order across all nodes. This concept is called "virtual synchrony".
 
This is provided by <span class="code">corosync</span> using "closed process groups", <span class="code">[[CPG]]</span>. A closed process group is simply a private group of processes in a cluster. Within this closed group, all messages between members are ordered. Delivery, however, is not guaranteed. If a member misses messages, it is up to the member's application to decide what action to take.
 
Let's look at two scenarios showing how locks are handled using CPG;
 
* The cluster starts up cleanly with two members.
* Both members are able to start <span class="code">service:foo</span>.
* Both want to start it, but need a lock from [[DLM]] to do so.
** The <span class="code">an-node01</span> member has it's totem token, and sends it's request for the lock.
** DLM issues a lock for that service to <span class="code">an-node01</span>.
** The <span class="code">an-node02</span> member requests a lock for the same service.
** DLM rejects the lock request.
* The <span class="code">an-node01</span> member successfully starts <span class="code">service:foo</span> and announces this to the CPG members.
* The <span class="code">an-node02</span> sees that <span class="code">service:foo</span> is now running on <span class="code">an-node01</span> and no longer tries to start the service.
 
* The two members want to write to a common area of the <span class="code">/shared</span> GFS2 partition.
** The <span class="code">an-node02</span> sends a request for a DLM lock against the FS, gets it.
** The <span class="code">an-node01</span> sends a request for the same lock, but DLM sees that a lock is pending and rejects the request.
** The <span class="code">an-node02</span> member finishes altering the file system, announces the changed over CPG and releases the lock.
** The <span class="code">an-node01</span> member updates it's view of the filesystem, requests a lock, receives it and proceeds to update the filesystems.
** It completes the changes, annouces the changes over CPG and releases the lock.
 
Messages can only be sent to the members of the CPG while the node has a totem tokem from corosync.
 
== Concept; Fencing ==
 
Fencing is a '''absolutely critical''' part of clustering. Without '''fully''' working fence devices, '''''your cluster will fail'''''.
 
Was that strong enough, or should I say that again? Let's be safe:
 
'''''DO NOT BUILD A CLUSTER WITHOUT PROPER, WORKING AND TESTED FENCING'''''.
 
Sorry, I promise that this will be the only time that I speak so strongly. Fencing really is critical, and explaining the need for fencing is nearly a weekly event.
 
So then, let's discuss fencing.
 
When a node stops responding, an internal timeout and counter start ticking away. During this time, no [[DLM]] locks are allowed to be issued. Anything using DLM, including <span class="code">rgmanager</span>, <span class="code">clvmd</span> and <span class="code">gfs2</span>, are effectively hung. The hung node is detected using a totem token timeout. That is, if a token is not received from a node within a period of time, it is considered lost and a new token is sent. After a certain number of lost tokens, the cluster declares the node dead. The remaining nodes reconfigure into a new cluster and, if they have quorum (or if quorum is ignored), a fence call against the silent node is made.
 
The fence daemon will look at the cluster configuration and get the fence devices configured for the dead node. Then, one at a time and in the order that they appear in the configuration, the fence daemon will call those fence devices, via their fence agents, passing to the fence agent any configured arguments like username, password, port number and so on. If the first fence agent returns a failure, the next fence agent will be called. If the second fails, the third will be called, then the forth and so on. Once the last (or perhaps only) fence device fails, the fence daemon will retry again, starting back at the start of the list. It will do this indefinitely until one of the fence devices success.
 
Here's the flow, in point form:
 
* The totem token moves around the cluster members. As each member gets the token, it sends sequenced messages to the CPG members.
* The token is passed from one node to the next, in order and continuously during normal operation.
* Suddenly, one node stops responding.
** A timeout starts (~<span class="code">238</span>ms by default), and each time the timeout is hit, and error counter increments and a replacement token is created.
** The silent node responds before the failure counter reaches the limit.
*** The failure counter is reset to <span class="code">0</span>
*** The cluster operates normally again.
* Again, one node stops responding.
** Again, the timeout begins. As each totem token times out, a new packet is sent and the error count increments.
** The error counts exceed the limit (<span class="code">4</span> errors is the default); Roughly one second has passed (<span class="code">238ms * 4</span> plus some overhead).
** The node is declared dead.
** The cluster checks which members it still has, and if that provides enough votes for quorum.
*** If there are too few votes for quorum, the cluster software freezes and the node(s) withdraw from the cluster.
*** If there are enough votes for quorum, the silent node is declared dead.
**** <span class="code">corosync</span> calls <span class="code">fenced</span>, telling it to fence the node.
**** The <span class="code">fenced</span> daemon notifies [[DLM]] and locks are blocked.
**** Which fence device(s) to use, that is, what <span class="code">fence_agent</span> to call and what arguments to pass, is gathered.
**** For each configured fence device:
***** The agent is called and <span class="code">fenced</span> waits for the <span class="code">fence_agent</span> to exit.
***** The <span class="code">fence_agent</span>'s exit code is examined. If it's a success, recovery starts. If it failed, the next configured fence agent is called.
**** If all (or the only) configured fence fails, <span class="code">fenced</span> will start over.
**** <span class="code">fenced</span> will wait and loop forever until a fence agent succeeds. During this time, '''the cluster is effectively hung'''.
*** Once a <span class="code">fence_agent</span> succeeds, <span class="code">fenced</span> notifies DLM and lost locks are recovered.
**** [[GFS2]] partitions recover using their journal.
**** Lost cluster resources are recovered as per <span class="code">rgmanager</span>'s configuration (including file system recovery as needed).
* Normal cluster operation is restored, minus the lost node.
 
This skipped a few key things, but the general flow of logic should be there.
 
This is why fencing is so important. Without a properly configured and tested fence device or devices, the cluster will never successfully fence and the cluster will remain hung until a human can intervene.
 
== Component; totem ==
 
The <span class="code">[[totem]]</span> protocol defines message passing within the cluster and it is used by <span class="code">corosync</span>. A token is passed around all the nodes in the cluster, and nodes can only send messages while they have the token. A node will keep it's messages in memory until it gets the token back with no "not ack" messages. This way, if a node missed a message, it can request it be resent when it gets it's token. If a node isn't up, it will simply miss the messages.
 
The <span class="code">totem</span> protocol supports something called '<span class="code">rrp</span>', '''R'''edundant '''R'''ing '''P'''rotocol. Through <span class="code">rrp</span>, you can add a second backup ring on a separate network to take over in the event of a failure in the first ring. In RHCS, these rings are known as "<span class="code">ring 0</span>" and "<span class="code">ring 1</span>". The RRP is being re-introduced in RHCS version 3. It's use is experimental and should only be used with plenty of testing.
 
== Component; rgmanager ==
 
When the cluster membership changes, <span class="code">corosync</span> tells the <span class="code">rgmanager</span> that it needs to recheck it's services. It will examine what changed and then will start, stop, migrate or recover cluster resources as needed.
 
Within <span class="code">rgmanager</span>, one or more ''resources'' are brought together as a ''service''. This service is then optionally assigned to a ''failover domain'', an subset of nodes that can have preferential ordering.
 
The <span class="code">rgmanager</span> daemon runs separately from the cluster manager, <span class="code">cman</span>. This means that, to fully start the cluster, we need to start both <span class="code">cman</span> and then <span class="code">rgmanager</span>.
 
== Component; qdisk ==
 
{{note|1=<span class="code">qdisk</span> does not work reliably on a DRBD resource, so we will not be using it in this tutorial.}}
 
A Quorum disk, known as a <span class="code">qdisk</span> is small partition on [[SAN]] storage used to enhance quorum. It generally carries enough votes to allow even a single node to take quorum during a cluster partition. It does this by using configured heuristics, that is custom tests, to decided which which node or partition is best suited for providing clustered services during a cluster reconfiguration. These heuristics can be simple, like testing which partition has access to a given router, or they can be as complex as the administrator wishes using custom scripts.
 
Though we won't be using it here, it is well worth knowing about when you move to a cluster with [[SAN]] storage.
 
== Component; DRBD ==
 
[[DRBD]]; Distributed Replicating Block Device, is a technology that takes raw storage from two or more nodes and keeps their data synchronized in real time. It is sometimes described as "RAID 1 over Cluster Nodes", and that is conceptually accurate. In this tutorial's cluster, DRBD will be used to provide that back-end storage as a cost-effective alternative to a traditional [[SAN]] device.
 
To help visualize DRBD's use and role, Take a look at how we will implement our cluster's storage.
 
This shows;
* Each node having four physical disks tied together in a [[RAID_level_5#Level_5|RAID Level 5]] array and presented to the Node's OS as a single drive which is found at <span class="code">/dev/sda</span>.
* Each node's OS uses three primary partitions for <span class="code">/boot</span>, <span class="code"><swap></span> and <span class="code">/</span>.
* Three extended partitions are created;
** <span class="code">/dev/sda5</span> backs a small partition used as a [[GFS2]]-formatted shared mount point.
** <span class="code">/dev/sda6</span> backs the [[VM]]s designed to run primarily on <span class="code">an-node01</span>.
** <span class="code">/dev/sda7</span> backs the [[VM]]s designed to run primarily on <span class="code">an-node02</span>.
* All three extended partitions are combined using DRBD to create three DRBD resources;
** <span class="code">/dev/drbd0</span> is backed by <span class="code">/dev/sda5</span>.
** <span class="code">/dev/drbd1</span> is backed by <span class="code">/dev/sda6</span>.
** <span class="code">/dev/drbd2</span> is backed by <span class="code">/dev/sda7</span>.
* All three DRBD resources are managed by clustered LVM.
* The GFS2-formatted [[LV]] is mounted on <span class="code">/shared</span> on both nodes.
* Each [[VM]] gets it's own [[LV]].
* All three DRBD resources sync over the [[Storage Network]], which uses the bonded <span class="code">bond1</span> (backed be <span class="code">eth1</span> and <span class="code">eth4</span>).
 
Don't worry if this seems illogical at this stage. The main thing to look at are the <span class="code">drbdX</span> devices and how they each tie back to a corresponding <span class="code">sdaY</span> device on either node.
 
<source lang="text">
_________________________________________________                _________________________________________________
| [ an-node01 ]                                  |              |                                  [ an-node02 ] |
|  ________      __________                      |              |                      __________      ________  |
| [_disk_1_]--+--[_/dev/sda_]                    |              |                    [_/dev/sda_]--+--[_disk_1_] |
|  ________  |    |  ___________    _______    |              |    _______    ___________  |    |  ________  |
| [_disk_2_]--+    +--[_/dev/sda1_]--[_/boot_]    |              |    [_/boot_]--[_/dev/sda1_]--+    +--[_disk_2_] |
|  ________  |    |  ___________    ________    |              |    ________    ___________  |    |  ________  |
| [_disk_3_]--+    +--[_/dev/sda2_]--[_<swap>_]  |              |  [_<swap>_]--[_/dev/sda2_]--+    +--[_disk_3_] |
|  ________  |    |  ___________    ___        |              |        ___    ___________  |    |  ________  |
| [_disk_4_]--/    +--[_/dev/sda3_]--[_/_]        |              |        [_/_]--[_/dev/sda3_]--+    \--[_disk_4_] |
|                  |  ___________                |              |                ___________  |                  |
|                  +--[_/dev/sda5_]------------\  |              |  /------------[_/dev/sda5_]--+                  |
|                  |  ___________            |  |              |  |            ___________  |                  |
|                  +--[_/dev/sda6_]----------\ |  |              |  | /----------[_/dev/sda6_]--+                  |
|                  |  ___________          | |  |              |  | |          ___________  |                  |
|                  \--[_/dev/sda7_]--------\ | |  |              |  | | /--------[_/dev/sda7_]--/                  |
|        _______________    ____________  | | |  |              |  | | |  ____________    _______________        |
|    /--[_Clustered_LVM_]--[_/dev/drbd2_]--/ | |  |              |  | | \--[_/dev/drbd2_]--[_Clustered_LVM_]--\    |
|  _|__                    |  _______    | |  |              |  | |      |  _______                    __|_  |
|  [_PV_]                    \--{_bond1_}    | |  |              |  | |      \--{_bond1_}                  [_PV_]  |
|  _|_______                                | |  |              |  | |                                _______|_  |
|  [_an2-vg0_]                              | |  |              |  | |                              [_an2-vg0_]  |
|    |  _______________________    .......  | |  |              |  | |  _____    _______________________  |    |
|    +--[_/dev/an2-vg0/vm0003_1_]---:.vm3.:  | |  |              |  | |  [_vm3_]---[_/dev/an2-vg0/vm0003_1_]--+    |
|    |  _______________________    .......  | |  |              |  | |  _____    _______________________  |    |
|    \--[_/dev/an2-vg0/vm0004_1_]---:.vm4.:  | |  |              |  | |  [_vm4_]---[_/dev/an2-vg0/vm0004_1_]--/    |
|          _______________    ____________  | |  |              |  | |  ____________    _______________          |
|      /--[_Clustered_LVM_]--[_/dev/drbd1_]--/ |  |              |  | \--[_/dev/drbd1_]--[_Clustered_LVM_]--\      |
|    _|__                    |  _______    |  |              |  |      |  _______                    __|_    |
|    [_PV_]                    \--{_bond1_}    |  |              |  |      \--{_bond1_}                  [_PV_]    |
|    _|_______                                |  |              |  |                                ___ ___|_    |
|    [_an1-vg0_]                              |  |              |  |                              [_an1-vg0_]    |
|      |  _______________________    _____  |  |              |  |      .......    ___________________  |      |
|      +--[_/dev/an1-vg0/vm0001_1_]---[_vm1_]  |  |              |  |      :.vm1.:---[_/dev/vg0/vm0001_1_]--+      |
|      |  _______________________    _____  |  |              |  |      .......    ___________________  |      |
|      \--[_/dev/an1-vg0/vm0002_1_]---[_vm2_]  |  |              |  |      :.vm2.:---[_/dev/vg0/vm0002_1_]--/      |
|            _______________    ____________  |  |              |  |  ____________    _______________            |
|        /--[_Clustered_LVM_]--[_/dev/drbd0_]--/  |              |  \--[_/dev/drbd0_]--[_Clustered_LVM_]--\        |
|      _|__                    |  _______      |              |      |  _______                    __|_      |
|      [_PV_]                    \--{_bond1_}    |              |      \--{_bond1_}                  [_PV_]      |
|      _|__________                              |              |                              __________|_      |
|      [_shared-vg0_]                            |              |                            [_shared-vg0_]      |
|      _|_________________________              |              |              _________________________|_      |
|      [_/dev/shared-vg0/lv_shared_]              |              |              [_/dev/shared-vg0/lv_shared_]      |
|        |  ______    _________                  |              |                  _________    ______  |        |
|        \--[_GFS2_]--[_/shared_]                |              |                [_/shared_]--[_GFS2_]--/        |
|                                          _______|  _________  |_______                                          |
|                                        | bond1 =--| Storage |--= bond1 |                                        |
|                                        |______||  | Network |  ||______|                                        |
|_________________________________________________|  |_________|  |_________________________________________________|
.
</source>
 
== Component; Clustered LVM ==
 
With [[DRBD]] providing the raw storage for the cluster, we must next consider partitions. This is where Clustered [[LVM]], known as [[CLVM]], comes into play.
 
CLVM is ideal in that by using [[DLM]], the distributed lock manager. It won't allow access to cluster members outside of <span class="code">corosync</span>'s closed process group, which, in turn, requires quorum.
 
It is ideal because it can take one or more raw devices, known as "physical volumes", or simple as [[PV]]s, and combine their raw space into one or more "volume groups", known as [[VG]]s. These volume groups then act just like a typical hard drive and can be "partitioned" into one or more "logical volumes", known as [[LV]]s. These LVs are where [[KVM]]'s virtual machine guests will exist and where we will create our [[GFS2]] clustered file system.
 
LVM is particularly attractive because of how flexible it is. We can easily add new physical volumes later, and then grow an existing volume group to use the new space. This new space can then be given to existing logical volumes, or entirely new logical volumes can be created. This can all be done while the cluster is online offering an upgrade path with no down time.
 
== Component; GFS2 ==
 
With [[DRBD]] providing the clusters raw storage space, and [[Clustered LVM]] providing the logical partitions, we can now look at the clustered file system. This is the role of the Global File System version 2, known simply as [[GFS2]].
 
It works much like standard filesystem, with user-land tools like <span class="code">mkfs.gfs2</span>, <span class="code">fsck.gfs2</span> and so on. The major difference is that it and <span class="code">clvmd</span> use the cluster's [[DLM|distributed locking mechanism]] provided by the <span class="code">dlm_controld</span> daemon. Once formatted, the GFS2-formatted partition can be mounted and used by any node in the cluster's [[CPG|closed process group]]. All nodes can then safely read from and write to the data on the partition simultaneously.
 
== Component; DLM ==
 
One of the major roles of a cluster is to provide [[DLM|distributed locking]] for clustered storage and resource management.
 
Whenever a resource, GFS2 filesystem or clustered LVM LV needs a lock, it sends a request to <span class="code">dlm_controld</span> which runs in userspace. This communicates with DLM in kernel. If the lock group does not yet exist, DLM will create it and then give the lock to the requester. Should a subsequant lock request come in for the same lock group, it will be rejected. Once the application using the lock is finished with it, it will release the lock. After this, another node may request and receive a lock for the lock group.
 
If a node fails, <span class="code">fenced</span> will alert <span class="code">dlm_controld</span> that a fence is pending and new lock requests will block. After a successful fence, <span class="code">fenced</span> will alert DLM that the node is gone and any locks the victim node held are released. At this time, other nodes may request a lock on the lock groups the lost node held and can perform recovery, like replaying a GFS2 filesystem journal, prior to resuming normal operation.
 
Note that DLM locks are not used for actually locking the file system. That job is still handled by <span class="code">plock()</span> calls ([[POSIX]] locks).
 
== Component; KVM ==
 
Two of the most popular open-source virtualization platforms available in the Linux world today and [[Xen]] and [[KVM]]. The former is maintained by [http://www.citrix.com/xenserver Citrix] and the other by [http://www.redhat.com/solutions/virtualization/ Redhat]. It would be difficult to say which is "better", as they're both very good. Xen can be argued to be more mature where KVM is the "official" solution supported by Red Hat in [[EL6]].
 
We will be using the KVM [[hypervisor]] within which our highly-available virtual machine guests will reside. It is a type-2 hypervisor, which means that the host operating system runs directly on the bare hardware. Contrasted against Xen, which is a type-1 hypervisor where even the installed OS is itself just another virtual machine.
 
= Network =
 
The cluster will use three separate Class B networks;
 
{|class="wikitable"
!Purpose
!Subnet
!Notes
|-
|Internet-Facing Network ([[IFN]])
|<span class="code">10.255.0.0/16</span>
|
* Each node will use <span class="code">10.255.0.x</span> where <span class="code">x</span> matches the node ID.<br />
* Virtual Machines in the cluster that need to be connected to the Internet will use <span class="code">10.255.y.z</span> where <span class="code">y</span> corresponds to the cluster and <span class="code">z</span> is a simple sequence number matching the VM ID.
|-
|Storage Network ([[SN]])
|<span class="code">10.10.0.0/16</span>
|
* Each node will use <span class="code">10.10.0.x</span> where <span class="code">x</span> matches the node ID.
|-
|Back-Channel Network ([[BCN]])
|<span class="code">10.20.0.0/16</span>
|
* Each node will use <span class="code">10.20.0.x</span> where <span class="code">x</span> matches the node ID.<br />
* Node-specific [[IPMI]] or other out-of-band management devices will use <span class="code">10.20.1.x</span> where <span class="code">x</span> matches the node ID.<br />
* Multi-port fence devices will use <span class="code">10.20.2.z</span> where <span class="code">z</span> is a simple sequence.<br />
Miscellaneous equipment in the cluster, like managed switches, will use <span class="code">10.20.3.z</span> where <span class="code">z</span> is a simple sequence.<br />
|-
|''Optional'' OpenVPN Network
|<span class="code">10.30.0.0/16</span>
* For clients behind firewalls, I like to create a [[OpenVPN Server on EL6|VPN]] server for the cluster nodes to log into when support is needed. This way, the client retains control over when remote access is available simply by starting and stopping the <span class="code">openvpn</span> daemon. This will not be discussed any further in this tutorial.
|}
 
We will be using six interfaces, bonded into three pairs of two NICs in Active/Passive (mode 1) configuration. Each link of each bond will be on alternate, unstacked switches. This configuration is the only configuration supported by [[Red Hat]] in clusters. We will also configure affinity by specifying interfaces <span class="code">eth0</span>, <span class="code">eth1</span> and <span class="code">eth2</span> as primary for the <span class="code">bond0</span>, <span class="code">bond1</span> and <span class="code">bond2</span> interfaces, respectively. This way, when everything is working fine, all traffic is routed through the same switch for maximum performance.
 
{{note|1=Only the bonded interface used by corosync must be in Active/Passive configuration (<span class="code">bond0</span> in this tutorial). If you want to experiment with other bonding modes for <span class="code">bond1</span> or <span class="code">bond2</span>, please feel free to do so. That is outside the scope of this tutorial, however.}}
 
If you can not install six interfaces in your server, then four interfaces will do with the [[SN]] and [[BCN]] networks merged.
 
{{warning|1=If you wish to merge the [[SN]] and [[BCN]] onto one interface, test to ensure that the storage traffic will not block cluster communication. Test by forming your cluster and then pushing your storage to maximum read and write performance for an extended period of time (minimum of several seconds). If the cluster partitions, you will need to do some advanced quality-of-service or other network configuration to ensure reliable delivery of cluster network traffic.}}
 
In this tutorial, we will use two [http://dlink.ca/products/?pid=DGS-3100-24 D-Link DGS-3100-24], unstacked, using three [[VLAN]]s to isolate the three networks.
* [[IFN]] will have VLAN ID number 100.
* [[SN]] will have VLAN ID number 101.
* [[BCN]] will have VLAN IS number 102.
 
You could just as easily use four or six unmanaged [http://dlink.ca/products/?pid=DGS-1005G 5 port] or [http://dlink.ca/products/?pid=DGS-1008G 8 port] switches. What matters is that the three subnets are isolated and that each link of each bond is on a separate switch. Lastly, only connect the [[IFN]] switches or VLANs to the Internet for security reasons.
 
The actual mapping of interfaces to bonds to networks will be:
 
{|class="wikitable"
!Subnet
!Link 1
!Link 2
!Bond
!IP
|-
|[[BCN]]
|<span class="code">eth0</span>
|<span class="code">eth3</span>
|<span class="code">bond0</span>
|<span class="code">10.20.0.x</span>
|-
|[[SN]]
|<span class="code">eth1</span>
|<span class="code">eth4</span>
|<span class="code">bond1</span>
|<span class="code">10.10.0.x</span>
|-
|[[IFN]]
|<span class="code">eth2</span>
|<span class="code">eth5</span>
|<span class="code">bond2</span>
|<span class="code">10.255.0.x</span>
|}
 
== Setting Up the Network ==
 
{{warning|1=The following steps can easily get confusing, given how many files we need to edit. Losing access to your server's network is a very real possibility! '''Do not continue without direct access to your servers!''' If you have out-of-band access via [[iKVM]], console redirection or similar, be sure to test that it is working before proceeding.}}
 
=== Managed and Stacking Switch Notes ===
 
{{note|1=If you have two stacked switches, do not stack them!}}
 
There are two things you need to be wary of with managed switches.
 
* Don't stack them. It may seem like it makes sense to stack them and create Link Aggregation Groups, but this is not supported. Leave the two switches as independent units.
* Disable Spanning Tree Protocol on all ports used by the cluster. Otherwise, when a lost switch is recovered, STP negotiation will cause traffic to stop on the ports for upwards of thirty seconds. This is more than enough time to partition a cluster.
 
Enable STP on the ports you use for uplinking the two switches and disable it on all other ports.
 
=== Making Sure We Know Our Interfaces ===
 
When you installed the operating system, the network interfaces names are somewhat randomly assigned to the physical network interfaces. It more than likely that you will want to re-order.
 
Before you start moving interface names around, you will want to consider which physical interfaces you will want to use on which networks. At the end of the day, the names themselves have no meaning. At the very least though, make them consistent across nodes.
 
Some things to consider, in order of importance:
 
* If you have a shared interface for your out-of-band management interface, like [[IPMI]] or [[iLO]], you will want that interface to be on the [[Back-Channel Network]].
* For redundancy, you want to spread out which interfaces are paired up. In my case, I have three interfaces on my mainboard and three additional add-in cards. I will pair each onboard interface with an add-in interface. In my case, my IPMI interface physically piggy-backs on one of the onboard interfaces so this interface will need to be part of the [[BCN]] bond.
* Your interfaces with the lowest latency should be used for the back-channel network.
* Your two fastest interfaces should be used for your storage network.
* The remaining two slowest interfaces should be used for the [[Internet-Facing Network]] bond.
 
In my case, all six interfaces are identical, so there is little to consider. The left-most interface on my system has IPMI, so it's paired network interface will be <span class="code">eth0</span>. I simply work my way left, incrementing as I go. What you do will be whatever makes most sense to you.
 
There is a separate, short tutorial on re-ordering network interface;
 
* '''[[Changing the ethX to Ethernet Device Mapping in EL6 and Fedora 12+]]'''
 
Once you have the physical interfaces named the way you like, proceed to the next step.
 
=== Planning Our Network ===
 
To setup our network, we will need to edit the <span class="code">ifcfg-ethX</span>, <span class="code">ifcfg-bondX</span> and <span class="code">ifcfg-vbrX</span> scripts. The last one will create bridges which will be used to route network connections to the virtual machines. We '''won't''' be creating an <span class="code">vbr1</span> bridge though, and <span class="code">bond1</span> will be dedicated to the storage and never used by a VM. The bridges will have the [[IP]] addresses, not the bonded interfaces. They will instead be slaved to their respective bridges.
 
We're going to be editing a lot of files. It's best to lay out what we'll be doing in a chart. So our setup will be:
 
{|class="wikitable"
!Node
!BCN IP and Device
!SN IP and Device
!IFN IP and Device
|-
|<span class="code">an-node01</span>
|<span class="code">10.20.0.1</span> on <span class="code">vbr0</span> (<span class="code">bond0</span> slaved)
|<span class="code">10.10.0.1</span> on <span class="code">bond1</span>
|<span class="code">10.255.0.1</span> on <span class="code">vbr2</span> (<span class="code">bond2</span> slaved)
|-
|<span class="code">an-node02</span>
|<span class="code">10.20.0.2</span> on <span class="code">vbr0</span> (<span class="code">bond0</span> slaved)
|<span class="code">10.10.0.2</span> on <span class="code">bond1</span>
|<span class="code">10.255.0.2</span> on <span class="code">vbr2</span> (<span class="code">bond2</span> slaved)
|}
 
=== Creating Some Network Configuration Files ===
 
Bridge configuration files '''must''' have a file name that sorts '''after''' the interfaces and bridges. The actual device name can be whatever you want though. If the system tries to start the bridge before it's interface is up, it will fail. I personally like to use the name <span class="code">vbrX</span> for "virtual machine bridge". You can use whatever makes sense to you, with the above concern in mind.
 
Start by <span class="code">touch</span>ing the configuration files we will need.
 
<source lang="bash">
touch /etc/sysconfig/network-scripts/ifcfg-bond{0,1,2}
touch /etc/sysconfig/network-scripts/ifcfg-vbr{0,2}
</source>
 
Now make a backup of your configuration files, in case something goes wrong and you want to start over.


<source lang="bash">
<source lang="bash">
rpm -Uvh http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm
mkdir /root/backups/
rsync -av /etc/sysconfig/network-scripts/ifcfg-eth* /root/backups/
</source>
</source>
<source lang="text">
<source lang="text">
Retrieving http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm
sending incremental file list
warning: /var/tmp/rpm-tmp.5paPdJ: Header V3 DSA/SHA1 Signature, key ID 6b8d79e6: NOKEY
ifcfg-eth0
Preparing...               ########################################### [100%]
ifcfg-eth1
  1:rpmforge-release      ########################################### [100%]
ifcfg-eth2
ifcfg-eth3
ifcfg-eth4
ifcfg-eth5
 
sent 1467 bytes  received 126 bytes  3186.00 bytes/sec
total size is 1119  speedup is 0.70
</source>
 
=== Configuring Our Bridges ===
 
Now lets start in reverse order. We'll write the bridge configuration, then the bond interfaces and finally alter the interface configuration files. The reason for doing this in reverse is to minimize the amount of time where a sudden restart would leave us without network access.
 
{{note|1=If you know now that none of your VMs will ever need access to the [[BCN]], as might be the case if all [[VM]]s will be web-facing, then you can skip the creation of <span class="code">vbr0</span>. In this case, move the [[IP]] address and related values to the <span class="code">ifcfg-bond0</span> configuration file.}}
 
'''<span class="code">an-node01</span>''' BCN Bridge:
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-vbr0
</source>
<source lang="bash">
# Back-Channel Network - Bridge
DEVICE="vbr0"
TYPE="Bridge"
BOOTPROTO="static"
IPADDR="10.20.0.1"
NETMASK="255.255.0.0"
</source>
 
'''<span class="code">an-node01</span>''' IFN Bridge:
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-vbr2
</source>
<source lang="bash">
# Internet-Facing Network - Bridge
DEVICE="vbr2"
TYPE="Bridge"
BOOTPROTO="static"
IPADDR="10.255.0.1"
NETMASK="255.255.0.0"
GATEWAY="10.255.255.254"
DNS1="192.139.81.117"
DNS2="192.139.81.1"
DEFROUTE="yes"
</source>
</source>


Now install the <span class="code">openvpn</span> package.
=== Creating the Bonded Interfaces ===
 
Now we can create the actual bond configuration files.


To explain the <span class="code">[http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/sec-Using_Channel_Bonding.html BONDING_OPTS]</span> options;
* <span class="code">mode=1</span> sets the bonding mode to <span class="code">active-backup</span>.
* The <span class="code">miimon=100</span> tells the bonding driver to check if the network cable has been unplugged or plugged in every 100 milliseconds.
* The <span class="code">use_carrier=1</span> tells the driver to use the driver to maintain the link state. Some drivers don't support that. If you run into trouble, try changing this to <span class="code">0</span>.
* The <span class="code">updelay=120000</span> tells the driver to delay switching back to the primary interface for 120,000 milliseconds (2 minutes). This is designed to give the switch connected to the primary interface time to finish booting. Setting this too low may cause the bonding driver to sitch back before the network switch is ready to actually move data.
* The <span class="code">downdelay=0</span> tells the driver not to wait before changing the state of an interface when the link goes down. That is, when the driver detects a fault, it will switch to the backup interface immediately.
'''<span class="code">an-node01</span>''' BCN Bond:
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-bond0
</source>
<source lang="bash">
<source lang="bash">
yum install openvpn
# Back-Channel Network - Bond
DEVICE="bond0"
BRIDGE="vbr0"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=eth0"
</source>
</source>


This will drag in a couple dependencies, which is expected and fine.
'''<span class="code">an-node01</span>''' SN Bond:
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-bond1
</source>
<source lang="bash">
# Storage Network - Bond
DEVICE="bond1"
BOOTPROTO="static"
NM_CONTROLLED="no"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=eth1"
IPADDR="10.10.0.1"
NETMASK="255.255.0.0"
</source>


If you don't want to install the DAG repository on the clients, you can download just the OpenVPN RPM.
'''<span class="code">an-node01</span>''' IFN Bond:
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-bond2
</source>
<source lang="bash">
# Internet-Facing Network - Bond
DEVICE="bond2"
BRIDGE="vbr2"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=eth2"
</source>
 
=== Alter The Interface Configurations ===
 
Now, finally, alter the interfaces themselves to join their respective bonds.


'''<span class="code">an-node01</span>''''s <span class="code">eth0</span>, the BCN <span class="code">bond0</span>, Link 1:
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-eth0
</source>
<source lang="bash">
<source lang="bash">
# Check http://pkgs.repoforge.org/ for the latest packages.
# Back-Channel Network - Link 1
wget -c http://pkgs.repoforge.org/openvpn/openvpn-2.2.0-3.el6.rf.x86_64.rpm
HWADDR="00:E0:81:C7:EC:49"
wget -c http://pkgs.repoforge.org/pkcs11-helper/pkcs11-helper-1.08-1.el6.rf.x86_64.rpm
DEVICE="eth0"
rpm -Uvh openvpn-2.2.0-3.el6.rf.x86_64.rpm pkcs11-helper-1.08-1.el6.rf.x86_64.rpm
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond0"
SLAVE="yes"
</source>
</source>


You are now ready to setup the OpenVPN server.
'''<span class="code">an-node01</span>''''s <span class="code">eth1</span>, the SN <span class="code">bond1</span>, Link 1:
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-eth0
</source>
<source lang="bash">
# Storage Network - Link 1
HWADDR="00:E0:81:C7:EC:48"
DEVICE="eth1"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond1"
SLAVE="yes"
</source>


= Setup =
'''<span class="code">an-node01</span>''''s <span class="code">eth2</span>, the IFN <span class="code">bond2</span>, Link 1:
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-eth0
</source>
<source lang="bash">
# Internet-Facing Network - Link 1
HWADDR="00:E0:81:C7:EC:47"
DEVICE="eth2"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond2"
SLAVE="yes"
</source>


OpenVPN v2 Setup; 1 Server to Many Clients:
'''<span class="code">an-node01</span>''''s <span class="code">eth3</span>, the BCN <span class="code">bond0</span>, Link 2:
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-eth3
</source>
<source lang="bash">
# Back-Channel Network - Link 2
DEVICE="eth3"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond0"
SLAVE="yes"
</source>


For a more complete version of this setup, please see this [http://openvpn.net/index.php/open-source/documentation/howto.html#quick excellent tutorial]. This tutorial is aimed to be more of a "quick recipe" type tutorial.
'''<span class="code">an-node01</span>''''s <span class="code">eth4</span>, the SN <span class="code">bond1</span>, Link 2:
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-eth0
</source>
<source lang="bash">
# Storage Network - Link 2
HWADDR="00:1B:21:BF:6F:FE"
DEVICE="eth4"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond1"
SLAVE="yes"
</source>


Setting up a local PKI (Private Key Infrastructure) involves:
'''<span class="code">an-node01</span>''''s <span class="code">eth5</span>, the IFN <span class="code">bond2</span>, Link 2:
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-eth0
</source>
<source lang="bash">
# Internet-Facing Network - Link 2
HWADDR="00:1B:21:BF:70:02"
DEVICE="eth5"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond2"
SLAVE="yes"
</source>


* Creating;
== Loading The New Network Configuration ==
** A separate certificate (aka: Public Key) for the server.
** A private key for the Server and each client.
** A master CA (Certificate Authority) certificate and key which will be used to sign the Server's and clients' public certificates.


{{note|1=OpenVPN uses bi-directional authentication. Both the server and the client must authenticate one another for a successful connection to be established.}}
Simple restart the <span class="code">network</span> service.


== Generate the Master CA Certificate and Key ==
<source lang="bash">
/etc/init.d/network restart
</source>


Switch to the default config file directory and copy the files into <span class="code">/etc/openvpn</span>.
== Updating /etc/hosts ==


{{warning|1=The directory below assumes that you installed the same version as listed here. If you got a different version, you will need to change the <span class="code">openvpn-2.2.0</span> to match your installation's location.}}
On both nodes, update the <span class="code">/etc/hosts</span> file to reflect your network configuration. Remember to add entries for your [[IPMI]], switched PDUs and other devices.


<source lang="bash">
<source lang="bash">
cd /etc/openvpn/
vim /etc/hosts
rsync -av /usr/share/doc/openvpn-2.2.0/easy-rsa/2.0/* /etc/openvpn/
</source>
</source>
<source lang="text">
<source lang="text">
sending incremental file list
127.0.0.1  localhost localhost.localdomain localhost4 localhost4.localdomain4
Makefile
::1        localhost localhost.localdomain localhost6 localhost6.localdomain6
README
 
build-ca
# an-node01
build-dh
10.20.0.1 an-node01 an-node01.bcn an-node01.alteeve.com
build-inter
10.20.1.1 an-node01.ipmi
build-key
10.10.0.1 an-node01.sn
build-key-pass
10.255.0.1 an-node01.ifn
build-key-pkcs12
 
build-key-server
# an-node01
build-req
10.20.0.2 an-node02 an-node02.bcn an-node02.alteeve.com
build-req-pass
10.20.1.2 an-node02.ipmi
clean-all
10.10.0.2 an-node02.sn
inherit-inter
10.255.0.2 an-node02.ifn
list-crl
 
openssl-0.9.6.cnf
# Fence devices
openssl.cnf
10.20.2.1      pdu1 pdu1.alteeve.com
pkitool
10.20.2.2      pdu2 pdu2.alteeve.com
revoke-full
 
sign-req
# VPN interfaces, if used.
vars
10.30.0.1 an-node01.vpn
whichopensslcnf
10.30.0.2 an-node02.vpn
</source>
 
{{warning|1=Which ever switch you have the IPMI interfaces connected to, be sure to connect the PDU into the '''opposite''' switch! If both fence types are on one switch, then that switch becomes a single point of failure!}}
 
= Configuring The Cluster Foundation =
 
We need to configure the cluster in two stages. This is because we have something of a chicken-and-egg problem.
 
* We need clustered storage for our virtual machines.
* Our clustered storage needs the cluster for fencing.
 
Conveniently, clustering has two logical parts;
* Cluster communication and membership.
* Cluster resource management.
 
The first, communication and membership, covers which nodes are part of the cluster and ejecting faulty nodes from the cluster, among other tasks. The second part, resource management, is provided by a second tool called <span class="code">rgmanager</span>. It's this second part that we will set aside for later.
 
== Installing Required Programs ==
 
Installing the cluster software is pretty simple;


sent 46661 bytes  received 411 bytes  94144.00 bytes/sec
<source lang="bash">
total size is 45452  speedup is 0.97
yum install cman corosync rgmanager ricci gfs2-utils
</source>
</source>


Next, edit the <span class="code">/etc/openvpn/vars</span> file and set the following values (change the entries here if needed):
== Configuration Methods ==
 
In [[Red Hat]] Cluster Services, the heart of the cluster is found in the <span class="code">[[RHCS v3 cluster.conf|/etc/cluster/cluster.conf]]</span> [[XML]] configuration file.
 
There are three main ways of editing this file. Two are already well documented, so I won't bother discussing them, beyond introducing them. The third way is by directly hand-crafting the <span class="code">cluster.conf</span> file. This method is not very well documented, and directly manipulating configuration files is my preferred method. As my boss loves to say; "''The more computers do for you, the more they do to you''".
 
The first two, well documented, graphical tools are:
 
* <span class="code">[http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/ch-config-scc-CA.html system-config-cluster]</span>, older GUI tool run directly from one of the cluster nodes.
* [http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/ch-config-conga-CA.html Conga], comprised of the <span class="code">ricci</span> node-side client and the <span class="code">luci</span> web-based server (can be run on machines outside the cluster).
 
I do like the tools above, but I often find issues that send me back to the command line. I'd recommend setting them aside for now as well. Once you feel comfortable with <span class="code">cluster.conf</span> syntax, then by all means, go back and use them. I'd recommend not relying on them though, which might be the case if you try to use them too early in your studies.
 
== The First cluster.conf Foundation Configuration ==
 
The very first stage of building the cluster is to create a configuration file that is as minimal as possible. To do that, we need to define a few thing;
 
* The name of the cluster and the cluster file version.
** Define <span class="code">cman</span> options
** The nodes in the cluster
*** The fence method for each node
** Define fence devices
** Define <span class="code">fenced</span> options
 
That's it. Once we've defined this minimal amount, we will be able to start the cluster for the first time! So lets get to it, finally.
 
=== Name the Cluster and Set The Configuration Version ===
 
The <span class="code">[[RHCS_v3_cluster.conf#cluster.3B_The_Parent_Tag|cluster]]</span> tag is the parent tag for the entire cluster configuration file.


<source lang="bash">
<source lang="bash">
cp /etc/openvpn/vars /etc/openvpn/vars.orig
vim /etc/cluster/cluster.conf
vim /etc/openvpn/vars
</source>
diff -u /etc/openvpn/vars.orig /etc/openvpn/vars
<source lang="xml">
<?xml version="1.0"?>
<cluster name="an-clusterA" config_version="1">
</cluster>
</source>
 
This has two attributes that we need to set are <span class="code">name=""</span> and <span class="code">config_version=""</span>.
 
The <span class="code">[[RHCS v3 cluster.conf#name|name]]=""</span> attribute defines the name of the cluster. It must be unique amongst the clusters on your network. It should be descriptive, but you will not want to make it too long, either. You will see this name in the various cluster tools and you will enter in, for example, when creating a [[GFS2]] partition later on. This tutorial uses the cluster name <span class="code">an_clusterA</span>. The reason for the <span class="code">A</span> is to help differentiate it from the nodes which use sequence numbers.
 
The <span class="code">[[RHCS v3 cluster.conf#config_version|config_version]]=""</span> attribute is an integer marking the version of the configuration file. Whenever you make a change to the <span class="code">cluster.conf</span> file, you will need to increment this version number by 1. If you don't increment this number, then the cluster tools will not know that the file needs to be reloaded. As this is the first version of this configuration file, it will start with <span class="code">1</span>. Note that this tutorial will increment the version after every change, regardless of whether it is explicitly pushed out to the other nodes and reloaded. The reason is to help get into the habit of always increasing this value.
 
=== Configuring cman Options ===
 
We are going to setup a special case for our cluster; A 2-Node cluster.
 
This is a special case because traditional quorum will not be useful. With only two nodes, each having a vote of <span class="code">1</span>, the total votes is <span class="code">2</span>. Quorum needs <span class="code">50% + 1</span>, which means that a single node failure would shut down the cluster, as the remaining node's vote is <span class="code">50%</span> exactly. That kind of defeats the purpose to having a cluster at all.
 
So to account for this special case, there is a special attribute called <span class="code">[[RHCS_v3_cluster.conf#two_node|two_node]]="1"</span>. This tells the cluster manager to continue operating with only one vote. This option requires that the <span class="code">[[RHCS_v3_cluster.conf#expected_votes|expected_votes]]=""</span> attribute be set to <span class="code">1</span>. Normally, <span class="code">expected_votes</span> is set automatically to the total sum of the defined cluster nodes' votes (which itself is a default of <span class="code">1</span>). This is the other half of the "trick", as a single node's vote of <span class="code">1</span> now always provides quorum (that is, <span class="code">1</span> meets the <span class="code">50% + 1</span> requirement).
 
<source lang="xml">
<?xml version="1.0"?>
<cluster name="an-clusterA" config_version="2">
<cman expected_votes="1" two_node="1"/>
</cluster>
</source>
 
Take note of the self-closing <span class="code"><... /></span> tag. This is an [[XML]] syntax that tells the parser not to look for any child or a closing tags.
 
=== Defining Cluster Nodes ===
 
This example is a little artificial, please don't load it into your cluster as we will need to add a few child tags, but one thing at a time.
 
This actually introduces two tags.
 
The first is parent <span class="code">[[RHCS_v3_cluster.conf#clusternodes.3B_Defining_Cluster_Nodes|clusternodes]]</span> tag, which takes no variables of it's own. It's sole purpose is to contain the <span class="code">[[RHCS_v3_cluster.conf#Tag.3B_clusternode|clusternode]]</span> child tags.
 
<source lang="xml">
<?xml version="1.0"?>
<cluster name="an-clusterA" config_version="3">
<cman expected_votes="1" two_node="1"/>
<clusternodes>
<clusternode name="an-node01.alteeve.com" nodeid="1" />
<clusternode name="an-node02.alteeve.com" nodeid="2" />
</clusternodes>
</cluster>
</source>
 
The <span class="code">[[RHCS_v3_cluster.conf#Tag.3B_clusternode|clusternode]]</span> tag defines each cluster node. There are many attributes available, but we will look at just the two required ones.
 
The first is the <span class="code">[[RHCS_v3_cluster.conf#clusternode.27s_name_attribute|name]]=""</span> attribute. This '''should''' match the name given by <span class="code">uname -n</span> (<span class="code">$HOSTNAME</span>) when run on each node. The [[IP]] address that the <span class="code">name</span> resolves to also sets the interface and subnet that the [[totem]] ring will run on. That is, the main cluster communications, which we are calling the '''Back-Channel Network'''. This is why it is so important to setup our <span class="code">[[2-Node_Red_Hat_KVM_Cluster_Tutorial#Setup_.2Fetc.2Fhosts|/etc/hosts]]</span> file correctly. Please see the [[RHCS_v3_cluster.conf#clusternode.27s_name_attribute|clusternode's name]] attribute document for details on how name to interface mapping is resolved.
 
The second attribute is <span class="code">[[RHCS_v3_cluster.conf#clusternode.27s_nodeid_attribute|nodeid]]=""</span>. This must be a unique integer amongst the <span class="code"><clusternode ...></span> tags. It is used by the cluster to identify the node.
 
=== Defining Fence Devices ===
 
[[2-Node_Red_Hat_KVM_Cluster_Tutorial#Concept.3B_Fencing|Fencing]] devices are designed to forcible eject a node from a cluster. This is generally done by forcing it to power off or reboot. Some [[SAN]] switches can logically disconnect a node from the shared storage device, which has the same effect of guaranteeing that the defective node can not alter the shared storage. A common, third type of fence device is one that cuts the mains power to the server.
 
In this tutorial, our nodes support [[IPMI]], which we will use as the primary fence device. We also have an [http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900 APC] brand switched PDU which will act as a backup in case a fault in the node disables the IPMI [[BMC]].
 
{{note|1=Not all brands of switched PDUs are supported as fence devices. Before you purchase a fence device, confirm that it is supported.}}
 
All fence devices are contained within the parent <span class="code">[[RHCS_v3_cluster.conf#fencedevices.3B_Defining_Fence_Devices|fencedevices]]</span> tag. This parent tag has no attributes. Within this parent tag are one or more <span class="code">[[RHCS_v3_cluster.conf#Tag.3B_fencedevice|fencedevice]]</span> child tags.
 
<source lang="xml">
<?xml version="1.0"?>
<cluster name="an-clusterA" config_version="4">
<cman expected_votes="1" two_node="1"/>
<clusternodes>
<clusternode name="an-node01.alteeve.com" nodeid="1" />
<clusternode name="an-node02.alteeve.com" nodeid="2" />
</clusternodes>
<fencedevices>
<fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-node01.ipmi" login="root" passwd="secret" />
<fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-node02.ipmi" login="root" passwd="secret" />
<fencedevice name="pdu2" agent="fence_apc" ipaddr="pdu2.alteeve.com" login="apc" passwd="secret" />
</fencedevices>
</cluster>
</source>
</source>
<source lang="diff">
 
--- /etc/openvpn/vars.orig 2011-09-29 00:13:32.074414343 -0400
Every fence device used in your cluster will have it's own <span class="code">[[RHCS_v3_cluster.conf#Tag.3B_fencedevice|fencedevice]]</span> tag. If you are using [[IPMI]], this means you will have a <span class="code">fencedevice</span> entry for each node, as each physical IPMI [[BMC]] is a unique fence device. On the other hand, fence devices that support multiple nodes, like switched PDUs, will have just one entry. In our case, we're using both types, so we have three fences devices; The two IPMI BMCs plus the switched PDU.
+++ /etc/openvpn/vars 2011-09-29 00:18:56.877164906 -0400
 
@@ -61,8 +61,8 @@
All <span class="code">fencedevice</span> tags share two basic attributes; <span class="code">[[RHCS_v3_cluster.conf#fencedevice.27s_name_attribute|name]]=""</span> and <span class="code">[[RHCS_v3_cluster.conf#fencedevice.27s_agent_attribute|agent]]=""</span>.
# These are the default values for fields
 
# which will be placed in the certificate.
* The <span class="code">name</span> attribute must be unique among all the fence devices in your cluster. As we will see in the next step, this name will be used within the <span class="code"><clusternode...></span> tag.
# Don't leave any of these fields blank.
* The <span class="code">agent</span> tag tells the cluster which [[fence agent]] to use when the <span class="code">[[fenced]]</span> daemon needs to communicate with the physical fence device. A fence agent is simple a shell script that acts as a glue layer between the <span class="code">fenced</span> daemon and the fence hardware. This agent takes the arguments from the daemon, like what port to act on and what action to take, and executes the node. The agent is responsible for ensuring that the execution succeeded and returning an appropriate success or failure exit code, depending. For those curious, the full details are described in the <span class="code">[https://fedorahosted.org/cluster/wiki/FenceAgentAPI FenceAgentAPI]</span>. If you have two or more of the same fence device, like IPMI, then you will use the same fence <span class="code">agent</span> value a corresponding number of times.
-export KEY_COUNTRY="US"
 
-export KEY_PROVINCE="CA"
Beyond these two attributes, each fence agent will have it's own subset of attributes. The scope of which is outside this tutorial, though we will see examples for IPMI and a switched PDU. Most, if not all, fence agents have a corresponding man page that will show you what attributes it accepts and how they are used. The two fence agents we will see here have their attributes defines in the following <span class="code">[[man]]</span> pages.
-export KEY_CITY="SanFrancisco"
 
-export KEY_ORG="Fort-Funston"
* <span class="code">man fence_ipmilan</span> - IPMI fence agent.
-export KEY_EMAIL="me@myhost.mydomain"
* <span class="code">man fence_apc</span> - APC-brand switched PDU.
+export KEY_COUNTRY="CA"
 
+export KEY_PROVINCE="ON"
The example above is what this tutorial will use.
+export KEY_CITY="Toronto"
 
+export KEY_ORG="Alteeve's Niche!"
==== Example <fencedevice...> Tag For IPMI ====
+export KEY_EMAIL="admin@alteeve.com"
 
Here we will show what [[IPMI]] <span class="code"><fencedevice...></span> tags look like. We won't be using it ourselves, but it is quite popular as a fence device so I wanted to show an example of it's use.
 
<source lang="xml">
                ...
                        <fence>
                                <method name="ipmi">
                                        <device name="ipmi_an01" action="reboot"/>
                                </method>
                        </fence>
                ...
<fencedevices>
<fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-node01.ipmi" login="root" passwd="secret" />
<fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-node02.ipmi" login="root" passwd="secret" />
</fencedevices>
</source>
 
* <span class="code">ipaddr</span>; This is the resolvable name or [[IP]] address of the device. If you use a resolvable name, it is strongly advised that you put the name in <span class="code">/etc/hosts</span> as [[DNS]] is another layer of abstraction which could fail.
* <span class="code">login</span>; This is the login name to use when the <span class="code">fenced</span> daemon connects to the device.
* <span class="code">passwd</span>; This is the login password to use when the <span class="code">fenced</span> daemon connects to the device.
* <span class="code">name</span>; This is the name of this particular fence device within the cluster which, as we will see shortly, is matched in the <span class="code"><clusternode...></span> element where appropriate.
 
{{note|1=We will see shortly that, unlike switched PDUs or other network fence devices, [[IPMI]] does not have ports. This is because each [[IPMI]] BMC supports just it's host system. More on that later.}}
 
==== Example <fencedevice...> Tag For APC Switched PDUs ====
 
Here we will show how to configure APC switched [[PDU]] <span class="code"><fencedevice...></span> tags. We won't be using it in this tutorial, but in the real world, it is '''highly''' recommended as a backup fence device for [[IPMI]] and similar primary fence devices.
 
<source lang="xml">
...
<fence>
<method name="pdu">
<device name="pdu2" port="1" action="reboot"/>
</method>
</fence>
...
<fencedevices>
<fencedevice name="pdu2" agent="fence_apc" ipaddr="pdu2.alteeve.com" login="apc" passwd="secret" />
</fencedevices>
</source>
 
* <span class="code">ipaddr</span>; This is the resolvable name or [[IP]] address of the device. If you use a resolvable name, it is strongly advised that you put the name in <span class="code">/etc/hosts</span> as [[DNS]] is another layer of abstraction which could fail.
* <span class="code">login</span>; This is the login name to use when the <span class="code">fenced</span> daemon connects to the device.
* <span class="code">passwd</span>; This is the login password to use when the <span class="code">fenced</span> daemon connects to the device.
* <span class="code">name</span>; This is the name of this particular fence device within the cluster which, as we will see shortly, is matched in the <span class="code"><clusternode...></span> element where appropriate.
 
=== Using the Fence Devices ===
 
Now we have nodes and fence devices defined, we will go back and tie them together. This is done by:
* Defining a <span class="code">[[RHCS_v3_cluster.conf#Tag.3B_fence|fence]]</span> tag containing all fence methods and devices.
** Defining one or more <span class="code">[[RHCS_v3_cluster.conf#Tag.3B_method|method]]</span> tag(s) containing the device call(s) needed for each fence attempt.
*** Defining one or more <span class="code">[[RHCS_v3_cluster.conf#Tag.3B_device|device]]</span> tag(s) containing attributes describing how to call the fence device to kill this node.
 
Here is how we implement [[IPMI]] as the primary fence device with the switched PDU as the backup method.
 
<source lang="xml">
<?xml version="1.0"?>
<cluster name="an-clusterA" config_version="5">
        <cman expected_votes="1" two_node="1"/>
        <clusternodes>
                <clusternode name="an-node01.alteeve.com" nodeid="1">
                        <fence>
                                <method name="ipmi">
                                        <device name="ipmi_an01" action="reboot"/>
                                </method>
                                <method name="pdu">
                                        <device name="pdu2" port="1" action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node02.alteeve.com" nodeid="2">
                        <fence>
                                <method name="ipmi">
                                        <device name="ipmi_an02" action="reboot"/>
                                </method>
                                <method name="pdu">
                                        <device name="pdu2" port="2" action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-node01.ipmi" login="root" passwd="secret" />
                <fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-node02.ipmi" login="root" passwd="secret" />
                <fencedevice name="pdu2" agent="fence_apc" ipaddr="pdu2.alteeve.com" login="apc" passwd="secret" />
        </fencedevices>
</cluster>
</source>
</source>


Now to initialize the [[PKI]].
First, notice that the <span class="code">[[RHCS_v3_cluster.conf#Tag.3B_fence|fence]]</span> tag has no attributes. It's merely a container for the <span class="code">[[RHCS_v3_cluster.conf#Tag.3B_method|method]](s)</span>.


* Make the script executable.
The next level is the <span class="code">[[RHCS_v3_cluster.conf#Tag.3B_method|method]]</span> named <span class="code">ipmi</span>. This name is merely a description and can be whatever you feel is most appropriate. It's purpose is simply to help you distinguish this method from other methods. The reason for <span class="code">method</span> tags is that some fence device calls will have two or more steps. A classic example would be a node with a redundant power supply on a switch PDU acting as the fence device. In such a case, you will need to define multiple <span class="code">[[RHCS_v3_cluster.conf#Tag.3B_device|device]]</span> tags, one for each power cable feeding the node. In such a case, the cluster will not consider the fence a success unless and until all contained <span class="code">[[RHCS_v3_cluster.conf#Tag.3B_device|device]]</span> calls execute successfully.
 
The same pair of <span class="code">method</span> and <span class="code">device</span> tags are supplied a second time. The first pair defined the IPMI interfaces, and the second pair defined the switched PDU. Note that the PDU definition needs a <span class="code">port=""</span> attribute where the IPMI fence device does not. When a fence call is needed, the fence devices will be called in the order they are found here. If both devices fail, the cluster will go back to the start and try again, looping indefinitely until one device succeeds.
 
{{note|1=It's important to understand why we use IPMI as the primary fence device. It is suggested, but not required, that the fence device confirm that the node is off. IPMI can do this, the switched PDU does not. Thus, IPMI won't return a success unless the node is truly off. The PDU though will return a success once the power is cut to the requested port. However, a misconfigured node with redundant PDU may in fact still be running, leading to disastrous consequences.}}
 
The actual fence <span class="code">[[RHCS_v3_cluster.conf#Tag.3B_device|device]]</span> configuration is the final piece of the puzzle. It is here that you specify per-node configuration options and link these attributes to a given <span class="code">[[RHCS_v3_cluster.conf#Tag.3B_fencedevice|fencedevice]]</span>. Here, we see the link to the <span class="code">fencedevice</span> via the <span class="code">[[RHCS_v3_cluster.conf#device.27s_name_attribute|name]]</span>, <span class="code">fence_na01</span> in this example.
 
Let's step through an example fence call to help show how the per-cluster and fence device attributes are combined during a fence call.
 
* The cluster manager decides that a node needs to be fenced. Let's say that the victim is <span class="code">an-node02</span>.
* The first <span class="code">method</span> in the <span class="code">fence</span> section under <span class="code">an-node02</span> is consulted. Within it there are two <span class="code">method</span> entries, named <span class="code">ipmi</span> and <span class="code">pdu</span>. The IPMI method's <span class="code">device</span> has one attribute while the PDU's <span class="code">device</span> has two attributes;
** <span class="code">port</span>; only found in the PDU <span class="code">method</span>, this tells the cluster that <span class="code">an-node02</span> is connected to switched PDU's port number <span class="code">2</span>.
** <span class="code">action</span>; Found on both devices, this tells the cluster that the fence action to take is <span class="code">reboot</span>. How this action is actually interpreted depends on the fence device in use, though the name certainly implies that the node will be forced off and then restarted.
* The cluster searches in <span class="code">fencedevices</span> for a <span class="code">fencedevice</span> matching the name <span class="code">ipmi_an02</span>. This fence device has four attributes;
** <span class="code">agent</span>; This tells the cluster to call the <span class="code">fence_ipmilan</span> fence agent script, as we discussed earlier.
** <span class="code">ipaddr</span>; This tells the fence agent where on the network to find this particular IPMI BMC. This is how multiple fence devices of the same type can be used in the cluster.
** <span class="code">login</span>; This is the login user name to use when authenticating against the fence device.
** <span class="code">passwd</span>; This is the password to supply along with the <span class="code">login</span> name when authenticating against the fence device.
* Should the IPMI fence call fail for some reason, the cluster will move on to the second <span class="code">pdu</span> method, repeating the steps above but using the PDU values.
 
When the cluster calls the fence agent, it does so by initially calling the fence agent script with no arguments.


<source lang="bash">
<source lang="bash">
chmod 755 whichopensslcnf clean-all build-ca pkitool build-key-server build-key build-dh
/usr/sbin/fence_ipmilan
</source>
</source>
Then it will pass to that agent the following arguments:
<source lang="text">
<source lang="text">
total 136K
ipaddr=an-node02.ipmi
drwxr-xr-x    2 root root 4.0K Sep 29 00:18 .
login=root
drwxr-xr-x. 116 root root  12K Sep 28 23:55 ..
passwd=secret
-rwxr-xr-x    1 root root  119 Apr  6 12:05 build-ca
action=reboot
-rwxr-xr-x    1 root root  352 Apr  6 12:05 build-dh
-rw-r--r--    1 root root  188 Apr  6 12:05 build-inter
-rwxr-xr-x    1 root root  163 Apr  6 12:05 build-key
-rw-r--r--    1 root root  157 Apr  6 12:05 build-key-pass
-rw-r--r--    1 root root  249 Apr  6 12:05 build-key-pkcs12
-rwxr-xr-x    1 root root  268 Apr  6 12:05 build-key-server
-rw-r--r--    1 root root  213 Apr  6 12:05 build-req
-rw-r--r--    1 root root  158 Apr  6 12:05 build-req-pass
-rwxr-xr-x    1 root root  428 Apr  6 12:05 clean-all
-rw-r--r--    1 root root 1.5K Apr  6 12:05 inherit-inter
-rw-r--r--    1 root root  295 Apr  6 12:05 list-crl
-rw-r--r--    1 root root  389 Oct 21  2010 Makefile
-rw-r--r--    1 root root 7.6K Oct 21  2010 openssl-0.9.6.cnf
-rw-r--r--    1 root root 8.2K Oct 21  2010 openssl.cnf
-rwxr-xr-x    1 root root  13K Apr  6 12:05 pkitool
-rw-r--r--   1 root root 9.1K Oct 21  2010 README
-rw-r--r--    1 root root  918 Apr  6 12:05 revoke-full
-rw-r--r--    1 root root  178 Apr  6 12:05 sign-req
-rw-r--r--    1 root root 1.7K Sep 29 00:18 vars
-rw-r--r--    1 root root 1.7K Sep 29 00:13 vars.orig
-rwxr-xr-x    1 root root  190 Oct 21  2010 whichopensslcnf
</source>
</source>


* Load the <span class="code">./vars</span> file.
As you can see then, the first three arguments are from the <span class="code">fencedevice</span> attributes and the last one is from the <span class="code">device</span> attributes under <span class="code">an-node02</span>'s <span class="code">clusternode</span>'s <span class="code">fence</span> tag.  
 
If this method fails, then the PDU will be called in a very similar way, but with an extra argument from the <span class="code">device</span> attributes.


<source lang="bash">
<source lang="bash">
. ./vars
/usr/sbin/fence_apc
</source>
</source>
Then it will pass to that agent the following arguments:
<source lang="text">
<source lang="text">
NOTE: If you run ./clean-all, I will be doing a rm -rf on /etc/openvpn/keys
ipaddr=pdu2.alteeve.com
login=root
passwd=secret
port=2
action=reboot
</source>
</source>


* Clean everything up
Should this fail, the cluster will go back and try the IPMI interface again. It will loop through the fence device methods forever until one of the methods succeeds.
 
=== Give Nodes More Time To Start ===
 
Clusters with more than three nodes will have to gain quorum before they can fence other nodes. As we saw earlier though, this is not really the case when using the <span class="code">[[RHCS_v3_cluster.conf#two_node|two_node]]="1"</span> attribute in the <span class="code">[[RHCS_v3_cluster.conf#cman.3B_The_Cluster_Manager|cman]]</span> tag. What this means in practice is that if you start the cluster on one node and then wait too long to start the cluster on the second node, the first will fence the second.
 
The logic behind this is; When the cluster starts, it will try to talk to it's fellow node and then fail. With the special <span class="code">two_node="1"</span> attribute set, the cluster knows that it is allowed to start clustered services, but it has no way to say for sure what state the other node is in. It could well be online and hosting services for all it knows. So it has to proceed on the assumption that the other node is alive and using shared resources. Given that, and given that it can not talk to the other node, it's only safe option is to fence the other node. Only then can it be confident that it is safe to start providing clustered services.
 
<source lang="xml">
<?xml version="1.0"?>
<cluster name="an-clusterA" config_version="7">
        <cman expected_votes="1" two_node="1"/>
        <clusternodes>
                <clusternode name="an-node01.alteeve.com" nodeid="1">
                        <fence>
                                <method name="ipmi">
                                        <device name="ipmi_an01" action="reboot"/>
                                </method>
                                <method name="pdu">
                                        <device name="pdu2" port="1" action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node02.alteeve.com" nodeid="2">
                        <fence>
                                <method name="ipmi">
                                        <device name="ipmi_an02" action="reboot"/>
                                </method>
                                <method name="pdu">
                                        <device name="pdu2" port="2" action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-node01.ipmi" login="root" passwd="secret" />
                <fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-node02.ipmi" login="root" passwd="secret" />
                <fencedevice name="pdu2" agent="fence_apc" ipaddr="pdu2.alteeve.com" login="apc" passwd="secret" />
        </fencedevices>
        <fence_daemon post_join_delay="30"/>
</cluster>
</source>
 
The new tag is <span class="code">[[RHCS_v3_cluster.conf#fence_daemon.3B_Fencing|fence_daemon]]</span>, seen near the bottom if the file above. The change is made using the <span class="code">[[RHCS_v3_cluster.conf#post_join_delay|post_join_delay]]="60"</span> attribute. By default, the cluster will declare the other node dead after just <span class="code">6</span> seconds. The reason is that the larger this value, the slower the start-up of the cluster services will be. During testing and development though, I find this value to be far too short and frequently led to unnecessary fencing. Once your cluster is setup and working, it's not a bad idea to reduce this value to the lowest value that you are comfortable with.
 
=== Configuring Totem ===
 
This is almost a misnomer, as we're more or less ''not'' configuring the [[totem]] protocol in this cluster.
 
<source lang="xml">
<?xml version="1.0"?>
<cluster name="an-clusterA" config_version="8">
        <cman expected_votes="1" two_node="1"/>
        <clusternodes>
                <clusternode name="an-node01.alteeve.com" nodeid="1">
                        <fence>
                                <method name="ipmi">
                                        <device name="ipmi_an01" action="reboot"/>
                                </method>
                                <method name="pdu">
                                        <device name="pdu2" port="1" action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node02.alteeve.com" nodeid="2">
                        <fence>
                                <method name="ipmi">
                                        <device name="ipmi_an02" action="reboot"/>
                                </method>
                                <method name="pdu">
                                        <device name="pdu2" port="2" action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-node01.ipmi" login="root" passwd="secret" />
                <fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-node02.ipmi" login="root" passwd="secret" />
                <fencedevice name="pdu2" agent="fence_apc" ipaddr="pdu2.alteeve.com" login="apc" passwd="secret" />
        </fencedevices>
        <fence_daemon post_join_delay="30"/>
        <totem rrp_mode="none" secauth="off"/>
</cluster>
</source>
 
{{note|1=At this time, [[redundant ring protocol]] is not supported ([[RHEL6]].1 and lower) and will be in technology preview mode ([[RHEL6]].2 and above). For this reason, we will not be using it. However, we are using bonding, so we still have removed a single point of failure.}}
 
[[RRP]] is an optional second ring that can be used for cluster communication in the case of a break down in the first ring. However, if you wish to explore it further, please take a look at the <span class="code">clusternode</span> element tag called <span class="code"><[[RHCS_v3_cluster.conf#Tag.3B_altname|altname]]...></span>. When <span class="code">altname</span> is used though, then the <span class="code">[[RHCS_v3_cluster.conf#rrp_mode|rrp_mode]]</span> attribute will need to be changed to either <span class="code">active</span> or <span class="code">passive</span> (the details of which are outside the scope of this tutorial).
 
The second option we're looking at here is the <span class="code">[[RHCS_v3_cluster.conf#secauth|secauth]]="off"</span> attribute. This controls whether the cluster communications are encrypted or not. We can safely disable this because we're working on a known-private network, which yields two benefits; It's simpler to setup and it's a lot faster. If you must encrypt the cluster communications, then you can do so here. The details of which are also outside the scope of this tutorial though.
 
=== Validating and Pushing the /etc/cluster/cluster.conf File ===
 
One of the most noticeable changes in [[RHCS]] cluster stable 3 is that we no longer have to make a long, cryptic <span class="code">xmllint</span> call to validate our cluster configuration. Now we can simply call <span class="code">ccs_config_validate</span>.
 
<source lang="bash">
<source lang="bash">
./clean-all
ccs_config_validate
</source>
<source lang="text">
Configuration validates
</source>
</source>


Now we will generate the certificate authority. This will prompt you for values, but if you edited <span class="code">./vars</span> properly, you should be able to accept the defaults.
If there was a problem, you need to go back and fix it. '''DO NOT''' proceed until your configuration validates. Once it does, we're ready to move on!
 
With it validated, we need to push it to the other node. As the cluster is not running yet, we will push it out using <span class="code">rsync</span>.


<source lang="bash">
<source lang="bash">
./build-ca
rsync -av /etc/cluster/cluster.conf root@an-node02:/etc/cluster/
</source>
</source>
<source lang="text">
<source lang="text">
Generating a 1024 bit RSA private key
sending incremental file list
....++++++
cluster.conf
.....................................................++++++
 
writing new private key to 'ca.key'
sent 1228 bytes  received 31 bytes  2518.00 bytes/sec
-----
total size is 1148  speedup is 0.91
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [CA]:
State or Province Name (full name) [ON]:
Locality Name (eg, city) [Toronto]:
Organization Name (eg, company) [Alteeve's Niche!]:
Organizational Unit Name (eg, section) []:
Common Name (eg, your name or your server's hostname) [Alteeve's Niche! CA]:
Name []:
Email Address [admin@alteeve.com]:
</source>
</source>


=== Generate the Server's Key and Certificate ===
=== Setting Up ricci ===


Run the following, substituting <span class="code">daimon.alteeve.com</span> with the name of the server you are using:
Another change from [[RHCS]] stable 2 is how configuration changes are propagated. Before, after a change, we'd push out the updated cluster configuration by calling <span class="code">ccs_tool update /etc/cluster/cluster.conf</span>. Now this is done with <span class="code">cman_tool version -r</span>. More fundamentally though, the cluster needs to authenticate against each node and does this using the local <span class="code">ricci</span> system user. The user has no password initially, so we need to set one.
 
On '''both''' nodes:


<source lang="bash">
<source lang="bash">
./build-key-server daimon.alteeve.com
passwd ricci
</source>
<source lang="text">
Changing password for user ricci.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
</source>
</source>


As with the last step, you will need to press <span class="code"><enter></span> after confirming that each prompt is what you want. This time though you will be asked to set an optional password and company name; Leave them blank. Then you will be asked if you want to sign the certificate and then commit the changes; Enter <span class="code">y</span> for both if you are happy with the output.
You will need to enter this password once from each node against the other node. We will see this later.


It should look like this:
Now make sure that the <span class="code">ricci</span> daemon is set to start on boot and is running now.


<source lang="bash">
<source lang="bash">
./build-key-server daimon.alteeve.com
chkconfig ricci on
/etc/init.d/ricci start
</source>
</source>
<source lang="text">
<source lang="text">
Generating a 1024 bit RSA private key
Starting ricci:                                            [  OK  ]
..................................++++++
</source>
..................................................++++++
 
writing new private key to 'daimon.alteeve.com.key'
{{note|1=If you don't see <span class="code">[  OK  ]</span>, don't worry, it is probably because it was already running.}}
-----
 
You are about to be asked to enter information that will be incorporated
=== Starting the Cluster for the First Time ===
into your certificate request.
 
What you are about to enter is what is called a Distinguished Name or a DN.
It's a good idea to open a second terminal on either node and <span class="code">tail</span> the <span class="code">/var/log/messages</span> [[syslog]] file. All cluster messages will be recorded here and it will help to debug problems if you can watch the logs. To do this, in the new terminal windows run;
There are quite a few fields but you can leave some blank
 
For some fields there will be a default value,
<source lang="bash">
If you enter '.', the field will be left blank.
clear; tail -f -n 0 /var/log/messages
-----
</source>
Country Name (2 letter code) [CA]:
 
State or Province Name (full name) [ON]:
This will clear the screen and start watching for new lines to be written to syslog. When you are done watching syslog, press the <span class="code"><ctrl></span> + <span class="code">c</span> key combination.
Locality Name (eg, city) [Toronto]:
 
Organization Name (eg, company) [Alteeve's Niche!]:
How you lay out your terminal windows is, obviously, up to your own preferences. Below is a configuration I have found very useful.
Organizational Unit Name (eg, section) []:
 
Common Name (eg, your name or your server's hostname) [daimon.alteeve.com]:
[[Image:2-node-rhcs3_terminal-window-layout_01.png|thumb|center|700px|Terminal window layout for watching 2 nodes. Left windows are used for entering commands and the left windows are used for tailing syslog.]]
Name []:
 
Email Address [admin@alteeve.com]:
With the terminals setup, lets start the cluster!
 
{{warning|1=If you don't start <span class="code">cman</span> on both nodes within 30 seconds, the slower node will be fenced.}}


Please enter the following 'extra' attributes
On '''both''' nodes, run:
to be sent with your certificate request
 
A challenge password []:
<source lang="bash">
An optional company name []:
/etc/init.d/cman start
Using configuration from /etc/openvpn/openssl.cnf
Check that the request matches the signature
Signature ok
The Subject's Distinguished Name is as follows
countryName          :PRINTABLE:'CA'
stateOrProvinceName  :PRINTABLE:'ON'
localityName          :PRINTABLE:'Toronto'
organizationName      :T61STRING:'Alteeve's Niche!'
commonName            :PRINTABLE:'daimon.alteeve.com'
emailAddress          :IA5STRING:'admin@alteeve.com'
Certificate is to be certified until Sep 26 04:30:26 2021 GMT (3650 days)
</source>
</source>
<source lang="text">
<source lang="text">
Sign the certificate? [y/n]:y
Starting cluster:
  Checking Network Manager...                            [  OK  ]
  Global setup...                                        [  OK  ]
  Loading kernel modules...                              [  OK  ]
  Mounting configfs...                                    [  OK  ]
  Starting cman...                                        [  OK  ]
  Waiting for quorum...                                  [  OK  ]
  Starting fenced...                                      [  OK  ]
  Starting dlm_controld...                                [ OK  ]
  Starting gfs_controld...                                [  OK  ]
  Unfencing self...                                      [  OK  ]
  Joining fence domain...                                [  OK  ]
</source>
</source>
Here is what you should see in syslog:
<source lang="text">
<source lang="text">
1 out of 1 certificate requests certified, commit? [y/n]y
Sep 14 13:33:58 an-node01 kernel: DLM (built Jun 27 2011 19:51:46) installed
Sep 14 13:33:58 an-node01 corosync[18897]:  [MAIN  ] Corosync Cluster Engine ('1.2.3'): started and ready to provide service.
Sep 14 13:33:58 an-node01 corosync[18897]:  [MAIN  ] Corosync built-in features: nss rdma
Sep 14 13:33:58 an-node01 corosync[18897]:  [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Sep 14 13:33:58 an-node01 corosync[18897]:  [MAIN  ] Successfully parsed cman config
Sep 14 13:33:58 an-node01 corosync[18897]:  [TOTEM ] Initializing transport (UDP/IP).
Sep 14 13:33:58 an-node01 corosync[18897]:  [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Sep 14 13:33:58 an-node01 corosync[18897]:  [TOTEM ] The network interface [10.20.0.1] is now up.
Sep 14 13:33:58 an-node01 corosync[18897]:  [QUORUM] Using quorum provider quorum_cman
Sep 14 13:33:58 an-node01 corosync[18897]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Sep 14 13:33:58 an-node01 corosync[18897]:  [CMAN  ] CMAN 3.0.12 (built Jul  4 2011 22:35:06) started
Sep 14 13:33:58 an-node01 corosync[18897]:  [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Sep 14 13:33:58 an-node01 corosync[18897]:  [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Sep 14 13:33:58 an-node01 corosync[18897]:  [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Sep 14 13:33:58 an-node01 corosync[18897]:  [SERV  ] Service engine loaded: corosync configuration service
Sep 14 13:33:58 an-node01 corosync[18897]:  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Sep 14 13:33:58 an-node01 corosync[18897]:  [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Sep 14 13:33:58 an-node01 corosync[18897]:  [SERV  ] Service engine loaded: corosync profile loading service
Sep 14 13:33:58 an-node01 corosync[18897]:  [QUORUM] Using quorum provider quorum_cman
Sep 14 13:33:58 an-node01 corosync[18897]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Sep 14 13:33:58 an-node01 corosync[18897]:  [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Sep 14 13:33:58 an-node01 corosync[18897]:  [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 14 13:33:58 an-node01 corosync[18897]:  [CMAN  ] quorum regained, resuming activity
Sep 14 13:33:58 an-node01 corosync[18897]:  [QUORUM] This node is within the primary component and will provide service.
Sep 14 13:33:58 an-node01 corosync[18897]:  [QUORUM] Members[1]: 1
Sep 14 13:33:58 an-node01 corosync[18897]:  [QUORUM] Members[1]: 1
Sep 14 13:33:58 an-node01 corosync[18897]:  [CPG  ] downlist received left_list: 0
Sep 14 13:33:58 an-node01 corosync[18897]:  [CPG  ] chosen downlist from node r(0) ip(10.20.0.1)
Sep 14 13:33:58 an-node01 corosync[18897]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 14 13:34:02 an-node01 corosync[18897]:  [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 14 13:34:02 an-node01 corosync[18897]:  [QUORUM] Members[2]: 1 2
Sep 14 13:34:02 an-node01 corosync[18897]:  [QUORUM] Members[2]: 1 2
Sep 14 13:34:02 an-node01 corosync[18897]:  [CPG  ] downlist received left_list: 0
Sep 14 13:34:02 an-node01 corosync[18897]:  [CPG  ] downlist received left_list: 0
Sep 14 13:34:02 an-node01 corosync[18897]:  [CPG  ] chosen downlist from node r(0) ip(10.20.0.1)
Sep 14 13:34:02 an-node01 corosync[18897]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 14 13:34:02 an-node01 fenced[18954]: fenced 3.0.12 started
Sep 14 13:34:02 an-node01 dlm_controld[18978]: dlm_controld 3.0.12 started
Sep 14 13:34:02 an-node01 gfs_controld[19000]: gfs_controld 3.0.12 started
</source>
 
Now to confirm that the cluster is operating properly, run <span class="code">cman_tool status</span>;
 
<source lang="bash">
cman_tool status
</source>
<source lang="bash">
Version: 6.2.0
Config Version: 8
Cluster Name: an-clusterA
Cluster Id: 29382
Cluster Member: Yes
Cluster Generation: 8
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Node votes: 1
Quorum: 1 
Active subsystems: 7
Flags: 2node
Ports Bound: 0 
Node name: an-node01.alteeve.com
Node ID: 1
Multicast addresses: 239.192.114.57
Node addresses: 10.20.0.1
</source>
 
We can see that the both nodes are talking because of the <span class="code">Nodes: 2</span> entry.
 
If you ever want to see the nitty-gritty configuration, you can run <span class="code">corosync-objctl</span>.
 
<source lang="bash">
corosync-objctl
</source>
</source>
<source lang="text">
<source lang="text">
Write out database with 1 new entries
cluster.name=an-clusterA
Data Base Updated
cluster.config_version=8
cluster.cman.expected_votes=1
cluster.cman.two_node=1
cluster.cman.nodename=an-node01.alteeve.com
cluster.cman.cluster_id=29382
cluster.clusternodes.clusternode.name=an-node01.alteeve.com
cluster.clusternodes.clusternode.nodeid=1
cluster.clusternodes.clusternode.fence.method.name=ipmi
cluster.clusternodes.clusternode.fence.method.device.name=ipmi_an01
cluster.clusternodes.clusternode.fence.method.device.action=reboot
cluster.clusternodes.clusternode.fence.method.name=pdu
cluster.clusternodes.clusternode.fence.method.device.name=pdu2
cluster.clusternodes.clusternode.fence.method.device.port=1
cluster.clusternodes.clusternode.fence.method.device.action=reboot
cluster.clusternodes.clusternode.name=an-node02.alteeve.com
cluster.clusternodes.clusternode.nodeid=2
cluster.clusternodes.clusternode.fence.method.name=ipmi
cluster.clusternodes.clusternode.fence.method.device.name=ipmi_an02
cluster.clusternodes.clusternode.fence.method.device.action=reboot
cluster.clusternodes.clusternode.fence.method.name=pdu
cluster.clusternodes.clusternode.fence.method.device.name=pdu2
cluster.clusternodes.clusternode.fence.method.device.port=2
cluster.clusternodes.clusternode.fence.method.device.action=reboot
cluster.fencedevices.fencedevice.name=ipmi_an01
cluster.fencedevices.fencedevice.agent=fence_ipmilan
cluster.fencedevices.fencedevice.ipaddr=an-node01.ipmi
cluster.fencedevices.fencedevice.login=root
cluster.fencedevices.fencedevice.passwd=secret
cluster.fencedevices.fencedevice.name=ipmi_an02
cluster.fencedevices.fencedevice.agent=fence_ipmilan
cluster.fencedevices.fencedevice.ipaddr=an-node02.ipmi
cluster.fencedevices.fencedevice.login=root
cluster.fencedevices.fencedevice.passwd=secret
cluster.fencedevices.fencedevice.name=pdu2
cluster.fencedevices.fencedevice.agent=fence_apc
cluster.fencedevices.fencedevice.ipaddr=pdu2.alteeve.com
cluster.fencedevices.fencedevice.login=apc
cluster.fencedevices.fencedevice.passwd=secret
cluster.fence_daemon.post_join_delay=30
cluster.totem.rrp_mode=none
cluster.totem.secauth=off
totem.rrp_mode=none
totem.secauth=off
totem.version=2
totem.nodeid=1
totem.vsftype=none
totem.token=10000
totem.join=60
totem.fail_recv_const=2500
totem.consensus=2000
totem.key=an-clusterA
totem.interface.ringnumber=0
totem.interface.bindnetaddr=10.20.0.1
totem.interface.mcastaddr=239.192.114.57
totem.interface.mcastport=5405
libccs.next_handle=7
libccs.connection.ccs_handle=3
libccs.connection.config_version=8
libccs.connection.fullxpath=0
libccs.connection.ccs_handle=4
libccs.connection.config_version=8
libccs.connection.fullxpath=0
libccs.connection.ccs_handle=5
libccs.connection.config_version=8
libccs.connection.fullxpath=0
logging.timestamp=on
logging.to_logfile=yes
logging.logfile=/var/log/cluster/corosync.log
logging.logfile_priority=info
logging.to_syslog=yes
logging.syslog_facility=local4
logging.syslog_priority=info
aisexec.user=ais
aisexec.group=ais
service.name=corosync_quorum
service.ver=0
service.name=corosync_cman
service.ver=0
quorum.provider=quorum_cman
service.name=openais_ckpt
service.ver=0
runtime.services.quorum.service_id=12
runtime.services.cman.service_id=9
runtime.services.ckpt.service_id=3
runtime.services.ckpt.0.tx=0
runtime.services.ckpt.0.rx=0
runtime.services.ckpt.1.tx=0
runtime.services.ckpt.1.rx=0
runtime.services.ckpt.2.tx=0
runtime.services.ckpt.2.rx=0
runtime.services.ckpt.3.tx=0
runtime.services.ckpt.3.rx=0
runtime.services.ckpt.4.tx=0
runtime.services.ckpt.4.rx=0
runtime.services.ckpt.5.tx=0
runtime.services.ckpt.5.rx=0
runtime.services.ckpt.6.tx=0
runtime.services.ckpt.6.rx=0
runtime.services.ckpt.7.tx=0
runtime.services.ckpt.7.rx=0
runtime.services.ckpt.8.tx=0
runtime.services.ckpt.8.rx=0
runtime.services.ckpt.9.tx=0
runtime.services.ckpt.9.rx=0
runtime.services.ckpt.10.tx=0
runtime.services.ckpt.10.rx=0
runtime.services.ckpt.11.tx=2
runtime.services.ckpt.11.rx=3
runtime.services.ckpt.12.tx=0
runtime.services.ckpt.12.rx=0
runtime.services.ckpt.13.tx=0
runtime.services.ckpt.13.rx=0
runtime.services.evs.service_id=0
runtime.services.evs.0.tx=0
runtime.services.evs.0.rx=0
runtime.services.cfg.service_id=7
runtime.services.cfg.0.tx=0
runtime.services.cfg.0.rx=0
runtime.services.cfg.1.tx=0
runtime.services.cfg.1.rx=0
runtime.services.cfg.2.tx=0
runtime.services.cfg.2.rx=0
runtime.services.cfg.3.tx=0
runtime.services.cfg.3.rx=0
runtime.services.cpg.service_id=8
runtime.services.cpg.0.tx=4
runtime.services.cpg.0.rx=8
runtime.services.cpg.1.tx=0
runtime.services.cpg.1.rx=0
runtime.services.cpg.2.tx=0
runtime.services.cpg.2.rx=0
runtime.services.cpg.3.tx=16
runtime.services.cpg.3.rx=23
runtime.services.cpg.4.tx=0
runtime.services.cpg.4.rx=0
runtime.services.cpg.5.tx=2
runtime.services.cpg.5.rx=3
runtime.services.confdb.service_id=11
runtime.services.pload.service_id=13
runtime.services.pload.0.tx=0
runtime.services.pload.0.rx=0
runtime.services.pload.1.tx=0
runtime.services.pload.1.rx=0
runtime.services.quorum.service_id=12
runtime.connections.active=6
runtime.connections.closed=110
runtime.connections.fenced:18954:16.service_id=8
runtime.connections.fenced:18954:16.client_pid=18954
runtime.connections.fenced:18954:16.responses=5
runtime.connections.fenced:18954:16.dispatched=9
runtime.connections.fenced:18954:16.requests=5
runtime.connections.fenced:18954:16.sem_retry_count=0
runtime.connections.fenced:18954:16.send_retry_count=0
runtime.connections.fenced:18954:16.recv_retry_count=0
runtime.connections.fenced:18954:16.flow_control=0
runtime.connections.fenced:18954:16.flow_control_count=0
runtime.connections.fenced:18954:16.queue_size=0
runtime.connections.dlm_controld:18978:24.service_id=8
runtime.connections.dlm_controld:18978:24.client_pid=18978
runtime.connections.dlm_controld:18978:24.responses=5
runtime.connections.dlm_controld:18978:24.dispatched=8
runtime.connections.dlm_controld:18978:24.requests=5
runtime.connections.dlm_controld:18978:24.sem_retry_count=0
runtime.connections.dlm_controld:18978:24.send_retry_count=0
runtime.connections.dlm_controld:18978:24.recv_retry_count=0
runtime.connections.dlm_controld:18978:24.flow_control=0
runtime.connections.dlm_controld:18978:24.flow_control_count=0
runtime.connections.dlm_controld:18978:24.queue_size=0
runtime.connections.dlm_controld:18978:19.service_id=3
runtime.connections.dlm_controld:18978:19.client_pid=18978
runtime.connections.dlm_controld:18978:19.responses=0
runtime.connections.dlm_controld:18978:19.dispatched=0
runtime.connections.dlm_controld:18978:19.requests=0
runtime.connections.dlm_controld:18978:19.sem_retry_count=0
runtime.connections.dlm_controld:18978:19.send_retry_count=0
runtime.connections.dlm_controld:18978:19.recv_retry_count=0
runtime.connections.dlm_controld:18978:19.flow_control=0
runtime.connections.dlm_controld:18978:19.flow_control_count=0
runtime.connections.dlm_controld:18978:19.queue_size=0
runtime.connections.gfs_controld:19000:22.service_id=8
runtime.connections.gfs_controld:19000:22.client_pid=19000
runtime.connections.gfs_controld:19000:22.responses=5
runtime.connections.gfs_controld:19000:22.dispatched=8
runtime.connections.gfs_controld:19000:22.requests=5
runtime.connections.gfs_controld:19000:22.sem_retry_count=0
runtime.connections.gfs_controld:19000:22.send_retry_count=0
runtime.connections.gfs_controld:19000:22.recv_retry_count=0
runtime.connections.gfs_controld:19000:22.flow_control=0
runtime.connections.gfs_controld:19000:22.flow_control_count=0
runtime.connections.gfs_controld:19000:22.queue_size=0
runtime.connections.fenced:18954:25.service_id=8
runtime.connections.fenced:18954:25.client_pid=18954
runtime.connections.fenced:18954:25.responses=5
runtime.connections.fenced:18954:25.dispatched=8
runtime.connections.fenced:18954:25.requests=5
runtime.connections.fenced:18954:25.sem_retry_count=0
runtime.connections.fenced:18954:25.send_retry_count=0
runtime.connections.fenced:18954:25.recv_retry_count=0
runtime.connections.fenced:18954:25.flow_control=0
runtime.connections.fenced:18954:25.flow_control_count=0
runtime.connections.fenced:18954:25.queue_size=0
runtime.connections.corosync-objctl:19188:23.service_id=11
runtime.connections.corosync-objctl:19188:23.client_pid=19188
runtime.connections.corosync-objctl:19188:23.responses=435
runtime.connections.corosync-objctl:19188:23.dispatched=0
runtime.connections.corosync-objctl:19188:23.requests=438
runtime.connections.corosync-objctl:19188:23.sem_retry_count=0
runtime.connections.corosync-objctl:19188:23.send_retry_count=0
runtime.connections.corosync-objctl:19188:23.recv_retry_count=0
runtime.connections.corosync-objctl:19188:23.flow_control=0
runtime.connections.corosync-objctl:19188:23.flow_control_count=0
runtime.connections.corosync-objctl:19188:23.queue_size=0
runtime.totem.pg.mrp.srp.orf_token_tx=2
runtime.totem.pg.mrp.srp.orf_token_rx=744
runtime.totem.pg.mrp.srp.memb_merge_detect_tx=365
runtime.totem.pg.mrp.srp.memb_merge_detect_rx=365
runtime.totem.pg.mrp.srp.memb_join_tx=3
runtime.totem.pg.mrp.srp.memb_join_rx=5
runtime.totem.pg.mrp.srp.mcast_tx=46
runtime.totem.pg.mrp.srp.mcast_retx=0
runtime.totem.pg.mrp.srp.mcast_rx=57
runtime.totem.pg.mrp.srp.memb_commit_token_tx=4
runtime.totem.pg.mrp.srp.memb_commit_token_rx=4
runtime.totem.pg.mrp.srp.token_hold_cancel_tx=4
runtime.totem.pg.mrp.srp.token_hold_cancel_rx=7
runtime.totem.pg.mrp.srp.operational_entered=2
runtime.totem.pg.mrp.srp.operational_token_lost=0
runtime.totem.pg.mrp.srp.gather_entered=2
runtime.totem.pg.mrp.srp.gather_token_lost=0
runtime.totem.pg.mrp.srp.commit_entered=2
runtime.totem.pg.mrp.srp.commit_token_lost=0
runtime.totem.pg.mrp.srp.recovery_entered=2
runtime.totem.pg.mrp.srp.recovery_token_lost=0
runtime.totem.pg.mrp.srp.consensus_timeouts=0
runtime.totem.pg.mrp.srp.mtt_rx_token=1903
runtime.totem.pg.mrp.srp.avg_token_workload=0
runtime.totem.pg.mrp.srp.avg_backlog_calc=0
runtime.totem.pg.mrp.srp.rx_msg_dropped=0
runtime.blackbox.dump_flight_data=no
runtime.blackbox.dump_state=no
cman_private.COROSYNC_DEFAULT_CONFIG_IFACE=xmlconfig:cmanpreconfig
</source>
</source>


=== Generate Keys and Certificates for Clients ===
== Testing Fencing ==


{{note|1=You will need to do this for '''each client device'''! Not just once per user. Well, there is a way to do that, but we want to be safe so pretend there isn't.}}
We need to thoroughly test our fence configuration and devices before we proceed. Should the cluster call a fence, and if the fence call fails, the cluster will hang until the fence finally succeeds. There is no way to abort a fence, so this could effectively hang the cluster. If we have problems, we need to find them now.


For each client device, pick a somewhat descriptive name and run the following. In this example, I will create a certificate for my main laptop (called <span class="code">lework</span>). When you follow this section, replace <span class="code">digimer-lework</span> with the descriptive name of the client's device this new key will be for.
We need to run two tests from each node against the other node for a total of four tests.
* The first test will use <span class="code">fence_ipmilan</span>. To do this, we will hang the victim node by running <span class="code">echo c > /proc/sysrq-trigger</span> on it. This will immediately and completely hang the kernel. The other node should detect the failure and reboot the victim. You can confirm that IPMI was used by watching the fence PDU and '''not''' seeing it power-cycle the port.
* Secondly, we will pull the power on the victim node. This is done to ensure that the IPMI BMC is also dead and will simulate a failure in the power supply. You should see the other node try to fence the victim, fail initially, then try again using the second, switched PDU. If you want the PDU, you should see the power indicator LED go off and then come back on.  


The next steps will be similar to how we created the server's key. We'll load <span class="code">vars</span> and the call <span class="code">build-key</span> followed by the name of the device we're creating the key for.
{{note|1=To "pull the power", we can actually just log into the PDU and turn off the victim's power. In this case, we'll see the power restored when the PDU is used to fence the node. We can actually use the <span class="code">fence_apc</span> fence agent to pull the power, as we'll see.}}
 
{|class="wikitable"
!Test
!Victim
!Pass?
|-
|<span class="code">echo c > /proc/sysrq-trigger</span>
|<span class="code">an-node01</span>
|
|-
|<span class="code">fence_apc -a pdu2.alteeve.com -l apc -p secret -n 1 -o off</span>
|<span class="code">an-node01</span>
|
|-
|<span class="code">echo c > /proc/sysrq-trigger</span>
|<span class="code">an-node02</span>
|
|-
|<span class="code">fence_apc -a pdu2.alteeve.com -l apc -p secret -n 2 -o off</span>
|<span class="code">an-node02</span>
|
|}
 
After the lost node is recovered, remember to restart <span class="code">cman</span> before starting the next test.
 
=== Hanging an-node01 ===
 
Be sure to be <span class="code">tail</span>ing the <span class="code">/var/log/messages</span> on <span class="code">an-node02</span>. Go to <span class="code">an-node01</span>'s first terminal and run the following command.
 
{{warning|1=This command will not return and you will lose all ability to talk to this node until it is rebooted.}}
 
On '''<span class="code">an-node01</span>''' run:


<source lang="bash">
<source lang="bash">
cd /etc/openvpn
echo c > /proc/sysrq-trigger
. ./vars
</source>
</source>
On '''<span class="code">an-node02</span>''''s syslog terminal, you should see the following entries in the log.
<source lang="text">
<source lang="text">
NOTE: If you run ./clean-all, I will be doing a rm -rf on /etc/openvpn/keys
Sep 15 16:08:17 an-node02 corosync[12347]:  [TOTEM ] A processor failed, forming new configuration.
Sep 15 16:08:19 an-node02 corosync[12347]:  [QUORUM] Members[1]: 2
Sep 15 16:08:19 an-node02 corosync[12347]:  [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 15 16:08:19 an-node02 corosync[12347]:  [CPG  ] downlist received left_list: 1
Sep 15 16:08:19 an-node02 corosync[12347]:  [CPG  ] chosen downlist from node r(0) ip(10.20.0.2)
Sep 15 16:08:19 an-node02 corosync[12347]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 16:08:19 an-node02 kernel: dlm: closing connection to node 1
Sep 15 16:08:19 an-node02 fenced[12403]: fencing node an-node01.alteeve.com
Sep 15 16:08:33 an-node02 fenced[12403]: fence an-node01.alteeve.com success
</source>
</source>
Perfect!
If you are watching <span class="code">an-node01</span>'s display, you should now see it starting back up. Once it finished booting, log into it and restart <span class="code">cman</span> before moving on to the next test.
=== Cutting the Power to an-node01 ===
As was discussed earlier, IPMI has a fatal flaw as a fence device. IPMI's [[BMC]] draws it's power from the same power supply as the node itself. Thus, when the power supply itself fails (or the mains connection is pulled/tripped over), fencing via IPMI will fail. This makes the power supply a single point of failure, which is what the PDU protects us against.
So to simulate a failed power supply, we're going to use <span class="code">an-node02</span>'s <span class="code">fence_apc</span> fence agent to turn off the power to <span class="code">an-node01</span>.
Alternatively, you could also just unplug the power and the fence would still succeed. The fence call only needs to confirm that the node is off to succeed. Whether the node restarts after or not is not important so far as the cluster is concerned.
From '''<span class="code">an-node02</span>''', pull the power on <span class="code">an-node01</span> with the following call;
<source lang="bash">
<source lang="bash">
./build-key digimer-lework
fence_apc -a pdu2.alteeve.com -l apc -p secret -n 1 -o off
</source>
</source>
<source lang="text">
<source lang="text">
Generating a 2048 bit RSA private key
Success: Powered OFF
.................................+++
</source>
................................................+++
writing new private key to 'digimer-lework.key'
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [CA]:
State or Province Name (full name) [ON]:
Locality Name (eg, city) [Toronto]:
Organization Name (eg, company) [Alteeve's Niche!]:
Organizational Unit Name (eg, section) []:
Common Name (eg, your name or your server's hostname) [digimer-lework]:
Email Address [admin@alteeve.com]:


Please enter the following 'extra' attributes
Back on <span class="code">an-node02</span>'s syslog, we should see the following entries;
to be sent with your certificate request
 
A challenge password []:
<source lang="text">
An optional company name []:
Sep 15 16:18:06 an-node02 corosync[12347]:  [TOTEM ] A processor failed, forming new configuration.
Using configuration from /etc/openvpn/openssl.cnf
Sep 15 16:18:08 an-node02 corosync[12347]:   [QUORUM] Members[1]: 2
Check that the request matches the signature
Sep 15 16:18:08 an-node02 corosync[12347]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Signature ok
Sep 15 16:18:08 an-node02 kernel: dlm: closing connection to node 1
The Subject's Distinguished Name is as follows
Sep 15 16:18:08 an-node02 corosync[12347][CPG  ] downlist received left_list: 1
countryName          :PRINTABLE:'CA'
Sep 15 16:18:08 an-node02 corosync[12347]:   [CPG  ] chosen downlist from node r(0) ip(10.20.0.2)
stateOrProvinceName   :PRINTABLE:'ON'
Sep 15 16:18:08 an-node02 corosync[12347]:  [MAIN  ] Completed service synchronization, ready to provide service.
localityName          :PRINTABLE:'Toronto'
Sep 15 16:18:08 an-node02 fenced[12403]: fencing node an-node01.alteeve.com
organizationName      :PRINTABLE:'Alteeve's Niche!'
Sep 15 16:18:31 an-node02 fenced[12403]: fence an-node01.alteeve.com dev 0.0 agent fence_ipmilan result: error from agent
commonName            :T61STRING:'digimer-lework'
Sep 15 16:18:31 an-node02 fenced[12403]: fence an-node01.alteeve.com success
emailAddress          :IA5STRING:'admin@alteeve.com'
Certificate is to be certified until Nov  3 03:33:58 2019 GMT (3650 days)
</source>
</source>
Hoozah!
Notice that there is an error from the <span class="code">fence_ipmilan</span>. This is exactly what we expected because of the IPMI BMC losing power.
So now we know that <span class="code">an-node01</span> can be fenced successfully from both fence devices. Now we need to run the same tests against <span class="code">an-node02</span>.
=== Hanging an-node02 ===
{{warning|1='''DO NOT ASSUME THAT <span class="code">an-node02</span> WILL FENCE PROPERLY JUST BECAUSE <span class="code">an-node01</span> PASSED!'''. There are many ways that a fence could fail; Bad password, misconfigured device, plugged into the wrong port on the PDU and so on. Always test all nodes using all methods!}}
Be sure to be <span class="code">tail</span>ing the <span class="code">/var/log/messages</span> on <span class="code">an-node02</span>. Go to <span class="code">an-node01</span>'s first terminal and run the following command.
{{note|1=This command will not return and you will lose all ability to talk to this node until it is rebooted.}}
On '''<span class="code">an-node02</span>''' run:
<source lang="bash">
echo c > /proc/sysrq-trigger
</source>
On '''<span class="code">an-node01</span>''''s syslog terminal, you should see the following entries in the log.
<source lang="text">
<source lang="text">
Sign the certificate? [y/n]:y
Sep 15 15:26:19 an-node01 corosync[2223]:  [TOTEM ] A processor failed, forming new configuration.
Sep 15 15:26:21 an-node01 corosync[2223]:  [QUORUM] Members[1]: 1
Sep 15 15:26:21 an-node01 corosync[2223]:  [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 15 15:26:21 an-node01 corosync[2223]:  [CPG  ] downlist received left_list: 1
Sep 15 15:26:21 an-node01 corosync[2223]:  [CPG  ] chosen downlist from node r(0) ip(10.20.0.1)
Sep 15 15:26:21 an-node01 corosync[2223]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 15:26:21 an-node01 fenced[2280]: fencing node an-node02.alteeve.com
Sep 15 15:26:21 an-node01 kernel: dlm: closing connection to node 2
Sep 15 15:26:36 an-node01 fenced[2280]: fence an-node02.alteeve.com success
</source>
 
Again, perfect!
 
=== Cutting the Power to an-node02 ===
 
From '''<span class="code">an-node01</span>''', pull the power on <span class="code">an-node02</span> with the following call;
 
<source lang="bash">
fence_apc -a pdu2.alteeve.com -l apc -p secret -n 2 -o off
</source>
</source>
<source lang="text">
<source lang="text">
1 out of 1 certificate requests certified, commit? [y/n]y
Success: Powered OFF
</source>
</source>
Back on <span class="code">an-node01</span>'s syslog, we should see the following entries;
<source lang="text">
<source lang="text">
Write out database with 1 new entries
Sep 15 15:36:30 an-node01 corosync[2223]:  [TOTEM ] A processor failed, forming new configuration.
Data Base Updated
Sep 15 15:36:32 an-node01 corosync[2223]:  [QUORUM] Members[1]: 1
Sep 15 15:36:32 an-node01 corosync[2223]:  [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 15 15:36:32 an-node01 kernel: dlm: closing connection to node 2
Sep 15 15:36:32 an-node01 corosync[2223]:  [CPG  ] downlist received left_list: 1
Sep 15 15:36:32 an-node01 corosync[2223]:  [CPG  ] chosen downlist from node r(0) ip(10.20.0.1)
Sep 15 15:36:32 an-node01 corosync[2223]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 15:36:32 an-node01 fenced[2280]: fencing node an-node02.alteeve.com
Sep 15 15:36:55 an-node01 fenced[2280]: fence an-node02.alteeve.com dev 0.0 agent fence_ipmilan result: error from agent
Sep 15 15:36:55 an-node01 fenced[2280]: fence an-node02.alteeve.com success
</source>
</source>


Repeat this step for every client device you want to give access to your network.
Woot!
 
We can now safely say that our fencing is setup and working properly.
 
== Testing Network Redundancy ==
 
Next up of the testing block is our network configuration. Seeing as we've build our bonds, we need to now test that they are working properly.
 
To run this test, we're going to do the following;
 
* Make sure that <span class="code">cman</span> has started on both nodes.
* On both nodes, start a ping flood against the opposing node in the first window and start <span class="code">tail</span>ing syslog in the second window.
* Look at the current state of the bonds to see which interfaces are active.
* Pull the power on the switch those interfaces are using. If the interfaces are spread across both switches, don't worry. Pick one and we will kill it again later.
* Check the state of the bonds again and see that they've switched to their backup links. If a node gets fenced, you know something went wrong.
* Wait about a minute, then restore power to the lost switch. Wait a good five minutes to ensure that it is in fact back up and that the network was not interrupted.
* Repeat the power off/on cycle for the second switch.
* If the initial state had the bonds spread across both switches, repeat the power off/on for the first switch.
 
If all of these steps pass and the cluster doesn't partition, then you can be confident that your network is configured properly for full redundancy.


=== Generating Diffie Hellman Parameters ===
=== How to Know if the Tests Passed ===


To learn more about this, read [http://www.rsa.com/rsalabs/node.asp?id=2248 this].
Well, the most obvious answer to this question is if the cluster is still working after a switch is powered off.


In short, it is what allows keys to be safely passed over an insecure network.
We can be a little more subtle than that though.


Run:
The state of each bond is viewable by looking in the special <span class="code">/proc/net/bonding/bondX</span> files, where <span class="code">X</span> is the bond number. Lets take a look at <span class="code">bond0</span> on <span class="code">an-node01</span>.


<source lang="bash">
<source lang="bash">
cd /etc/openvpn
cat /proc/net/bonding/bond0
./build-dh
</source>
</source>
<source lang="text">
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)


The output should look like this:
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0


Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:e0:81:c7:ec:49
Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1b:21:9d:59:fc
</source>
We can see that the currently active interface is <span class="code">eth0</span>. This is the key bit we're going to be watching for these tests. I know that <span class="code">eth0</span> on <span class="code">an-node01</span> is connected to by first switch. So when I pull the power to that switch, I should see <span class="code">eth3</span> take over.
We'll also be watching syslog. If things work right, we should not see any messages from the cluster during failure and recovery.
If you have the screen space for it, I'd recommend opening six more terminal windows, one for each bond. Run <span class="code">watch cat /proc/net/bonding/bondX</span> so that you can quickly see any change in the bond states. Below's an example of the layout I use for this test.
[[Image:2-node_el6-tutorial_network-test_01.png|thumb|center|700px|Terminal layout used for monitoring the bonded link status during HA network testing. The right window shows two columns of terminals, <span class="code">an-node01</span> on the left and <span class="code">an-node02</span> on the right, stacked into three rows, <span class="code">bond0</span> on the top, <span class="code">bond1</span> in the middle and <span class="code">bond2</span> at the bottom. The left window shows the standard <span class="code">tail</span> on syslog.]]
=== Failing The First Switch ===
In my case, all of the bonds on both nodes are using their first links as the current active links. This means that all network traffic is going through the first switch. So I will first power down that switch. You need to sort out which switch you should shut down first. If you've got the network traffic going over both switches, then just pick on to start with.
{{note|1=Make sure that <span class="code">cman</span> is running before beginning the test!}}
After killing the switch, I can see in syslog the following messages:
<source lang="text">
Sep 16 13:12:46 an-node01 kernel: e1000e: eth2 NIC Link is Down
Sep 16 13:12:46 an-node01 kernel: e1000e: eth0 NIC Link is Down
Sep 16 13:12:46 an-node01 kernel: e1000e: eth1 NIC Link is Down
Sep 16 13:12:46 an-node01 kernel: bonding: bond1: link status definitely down for interface eth1, disabling it
Sep 16 13:12:46 an-node01 kernel: bonding: bond1: making interface eth4 the new active one.
Sep 16 13:12:46 an-node01 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Sep 16 13:12:46 an-node01 kernel: bonding: bond0: making interface eth3 the new active one.
Sep 16 13:12:46 an-node01 kernel: device eth0 left promiscuous mode
Sep 16 13:12:46 an-node01 kernel: device eth3 entered promiscuous mode
Sep 16 13:12:46 an-node01 kernel: bonding: bond2: link status definitely down for interface eth2, disabling it
Sep 16 13:12:46 an-node01 kernel: bonding: bond2: making interface eth5 the new active one.
Sep 16 13:12:46 an-node01 kernel: device eth2 left promiscuous mode
Sep 16 13:12:46 an-node01 kernel: device eth5 entered promiscuous mode
</source>
I can look at <span class="code">an-node01</span>'s <span class="code">/proc/net/bonding/bond0</span> file and see:
<source lang="bash">
cat /proc/net/bonding/bond0
</source>
<source lang="text">
<source lang="text">
Generating DH parameters, 1024 bit long safe prime, generator 2
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)
This is going to take a long time
 
.........+............................................(many more dots and +).......++*++*++*
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth3
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
 
Slave Interface: eth0
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:e0:81:c7:ec:49
 
Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1b:21:9d:59:fc
</source>
</source>


== Enabling Access ==
Notice <span class="code">Currently Active Slave</span> is now <span class="code">eth3</span>? You can also see now that <span class="code">eth0</span>'s link is down (<span class="code">MII Status: down</span>).


If you are running a firewall, you will need to open up [[UDP]] port <span class="code">1194</span>. How you do this will depend entirely on the firewall you are using. As an example though, if you are using the stock <span class="code">[[TLUG Talk: Netfilter|iptables]]</span> firewall that come with [[EL6]], you will need to edit the <span class="code">/etc/sysconfig/iptables</span> file.
It's the same story for all the other bonds on both switches.
 
If we check the status of the cluster, we'll see that all is good.


<source lang="bash">
<source lang="bash">
cp /etc/sysconfig/iptables /etc/sysconfig/iptables.orig
cman_tool status
vim /etc/sysconfig/iptables
</source>
diff -u /etc/sysconfig/iptables.orig /etc/sysconfig/iptables
<source lang="text">
Version: 6.2.0
Config Version: 8
Cluster Name: an-clusterA
Cluster Id: 29382
Cluster Member: Yes
Cluster Generation: 72
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Node votes: 1
Quorum: 1 
Active subsystems: 7
Flags: 2node
Ports Bound: 0 
Node name: an-node01.alteeve.com
Node ID: 1
Multicast addresses: 239.192.114.57
Node addresses: 10.20.0.1
</source>
 
How cool is that?!
 
=== Restoring The First Switch ===
 
Now that we've confirmed all of the bonds are working on the backup switch, lets restore power to the first switch.
 
{{warning|1=Be sure to wait a solid five minutes after restoring power before declaring the recovery a success!}}
 
It is very important to wait for a while after restoring power to the switch. Some of the common problems that can break your cluster will not show up immediately. A good example is a misconfiguration of [[STP]]. In this case, the switch will come up, a short time will pass and then the switch will trigger an STP reconfiguration. Once this happens, both switches will block traffic for many seconds. This will partition you cluster.
 
So then, lets power it back up.
 
Within a few moments, you should see this in your syslog;
 
<source lang="text">
Sep 16 13:23:57 an-node01 kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:23:57 an-node01 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:23:57 an-node01 kernel: bonding: bond0: link status definitely up for interface eth0.
Sep 16 13:23:57 an-node01 kernel: bonding: bond1: link status definitely up for interface eth1.
Sep 16 13:23:58 an-node01 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:23:58 an-node01 kernel: bonding: bond2: link status definitely up for interface eth2.
</source>
</source>
It looks up, but lets keep waiting for another minute (note the time stamps).
<source lang="text">
<source lang="text">
--- /etc/sysconfig/iptables.orig 2011-09-29 00:50:04.311413922 -0400
Sep 16 13:24:52 an-node01 kernel: e1000e: eth0 NIC Link is Down
+++ /etc/sysconfig/iptables 2011-09-29 00:50:44.470165140 -0400
Sep 16 13:24:52 an-node01 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
@@ -14,6 +14,7 @@
Sep 16 13:24:53 an-node01 kernel: e1000e: eth1 NIC Link is Down
-A INPUT -m state --state NEW -m udp -p udp --dport 53 -j ACCEPT
Sep 16 13:24:53 an-node01 kernel: bonding: bond1: link status definitely down for interface eth1, disabling it
-A INPUT -m state --state NEW -m tcp -p tcp --dport 953 -j ACCEPT
Sep 16 13:24:54 an-node01 kernel: e1000e: eth2 NIC Link is Down
-A INPUT -m state --state NEW -m udp -p udp --dport 953 -j ACCEPT
Sep 16 13:24:54 an-node01 kernel: bonding: bond2: link status definitely down for interface eth2, disabling it
+-A INPUT -m state --state NEW -m udp -p udp --dport 1194 -j ACCEPT
Sep 16 13:24:55 an-node01 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
-A INPUT -j REJECT --reject-with icmp-host-prohibited
Sep 16 13:24:55 an-node01 kernel: bonding: bond0: link status definitely up for interface eth0.
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
Sep 16 13:24:56 an-node01 kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
COMMIT
Sep 16 13:24:56 an-node01 kernel: bonding: bond1: link status definitely up for interface eth1.
Sep 16 13:24:57 an-node01 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:24:57 an-node01 kernel: bonding: bond2: link status definitely up for interface eth2.
Sep 16 13:24:58 an-node01 kernel: e1000e: eth0 NIC Link is Down
Sep 16 13:24:58 an-node01 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Sep 16 13:24:59 an-node01 kernel: e1000e: eth1 NIC Link is Down
Sep 16 13:24:59 an-node01 kernel: bonding: bond1: link status definitely down for interface eth1, disabling it
Sep 16 13:25:00 an-node01 kernel: e1000e: eth2 NIC Link is Down
Sep 16 13:25:00 an-node01 kernel: bonding: bond2: link status definitely down for interface eth2, disabling it
Sep 16 13:25:00 an-node01 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:25:00 an-node01 kernel: bonding: bond0: link status definitely up for interface eth0.
Sep 16 13:25:02 an-node01 kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:25:02 an-node01 kernel: bonding: bond1: link status definitely up for interface eth1.
Sep 16 13:25:02 an-node01 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:25:02 an-node01 kernel: bonding: bond2: link status definitely up for interface eth2.
</source>
</source>


Once updated, restart the firewall.
See all that bouncing? That is caused by many switches showing a link (that is the [[MII]] status) without actually being able to push traffic. As part of the switches boot sequence, the links will go down and come back up a couple of times.
 
This is partly why the <span class="code">updelay</span> option exists in the <span class="code">BONDING_OPTS</span>, but it is also why we don't set <span class="code">primary</span> or enable <span class="code">_reselect</span>. Every time the active slave interface changes, there is a small chance of a problem that could break the cluster.
 
You will notice after several minutes that the backup slave interfaces are still in use, despite the first switch being back online. This is just fine. We can check the cluster status again and we'll see that everything is still fine. The recovery test passed!
 
=== Failing The Second Switch ===
 
For the same reason that we need to test all fence devices from both nodes, we also need to test failure and recovery of both switches. So now lets pull the plug on the second switch!
 
As before, we'll see messages showing the interfaces dropping.
 
<source lang="text">
Sep 16 13:35:36 an-node01 kernel: e1000e: eth3 NIC Link is Down
Sep 16 13:35:36 an-node01 kernel: bonding: bond0: link status definitely down for interface eth3, disabling it
Sep 16 13:35:36 an-node01 kernel: bonding: bond0: making interface eth0 the new active one.
Sep 16 13:35:36 an-node01 kernel: device eth3 left promiscuous mode
Sep 16 13:35:36 an-node01 kernel: device eth0 entered promiscuous mode
Sep 16 13:35:38 an-node01 kernel: e1000e: eth5 NIC Link is Down
Sep 16 13:35:38 an-node01 kernel: bonding: bond2: link status definitely down for interface eth5, disabling it
Sep 16 13:35:38 an-node01 kernel: bonding: bond2: making interface eth2 the new active one.
Sep 16 13:35:38 an-node01 kernel: device eth5 left promiscuous mode
Sep 16 13:35:38 an-node01 kernel: device eth2 entered promiscuous mode
Sep 16 13:35:39 an-node01 kernel: e1000e: eth4 NIC Link is Down
Sep 16 13:35:39 an-node01 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Sep 16 13:35:39 an-node01 kernel: bonding: bond1: making interface eth1 the new active one.
</source>
 
Let's take a look at <span class="code">an-node01</span>'s <span class="code">bond0</span> again.


<source lang="bash">
<source lang="bash">
/etc/init.d/iptables restart
cat /proc/net/bonding/bond0
</source>
</source>
<source lang="text">
<source lang="text">
iptables: Flushing firewall rules:                         [  OK  ]
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)
iptables: Setting chains to policy ACCEPT: filter          [  OK  ]
 
iptables: Unloading modules:                               [  OK  ]
Bonding Mode: fault-tolerance (active-backup)
iptables: Applying firewall rules:                         [  OK  ]
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
 
Slave Interface: eth0
MII Status: up
Link Failure Count: 3
Permanent HW addr: 00:e0:81:c7:ec:49
 
Slave Interface: eth3
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:1b:21:9d:59:fc
</source>
</source>


Now confirm that the new rule is active.
We can see that <span class="code">eth0</span> has returned to the active slave and that <span class="code">eth3</span> is now down. Again, it's the same story across the other bonds and <span class="code">cman_tool status</span> shows that all is right in the world.
 
=== Restoring The Second Switch ===
 
Again, we're going to wait a good five minutes after restoring power before calling this test a success.
 
Checking out <span class="code">an-node01</span>'s syslog, we see the links came back and didn't bounce. These are two identical switches and should behave the same but didn't. This is a good example of why you need to test '''everything''', even when you have identical hardware. You just can't guess how things will behave until you test and see for yourself.
 
<source lang="text">
Sep 16 13:53:54 an-node01 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:53:54 an-node01 kernel: e1000e: eth5 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:53:54 an-node01 kernel: bonding: bond0: link status definitely up for interface eth3.
Sep 16 13:53:54 an-node01 kernel: bonding: bond2: link status definitely up for interface eth5.
Sep 16 13:53:55 an-node01 kernel: e1000e: eth4 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:53:55 an-node01 kernel: bonding: bond1: link status definitely up for interface eth4.
</source>
 
Now we're done! We can truly say we've got a full high-availability network configuration that is tested and trusted!
 
= Installing DRBD =
 
DRBD is an open-source application for real-time, block-level disk replication created and maintained by [http://linbit.com Linbit]. We will use this to keep the data on our cluster consistent between the two nodes.
 
To install it, we have two choices;
* Install from source files.
* Install from [http://elrepo.org/tiki/tiki-index.php ELRepo].
 
Installing from source ensures that you have full control over the installed software. However, you become solely responsible for installing future patches and bugfixes.
 
Installing from ELRepo means seceding some control to the ELRepo maintainers, but it also means that future patches and bugfixes are applied as part of a standard update.
 
Which you choose is, ultimately, a decision you need to make.
 
== Option A - Install From Source ==
 
On '''Both''' nodes run:
 
<source lang="bash">
# Obliterate peer - fence via cman
wget -c https://alteeve.com/files/an-cluster/sbin/obliterate-peer.sh -O /sbin/obliterate-peer.sh
chmod a+x /sbin/obliterate-peer.sh
ls -lah /sbin/obliterate-peer.sh
 
# Download, compile and install DRBD
wget -c http://oss.linbit.com/drbd/8.3/drbd-8.3.11.tar.gz
tar -xvzf drbd-8.3.11.tar.gz
cd drbd-8.3.11
./configure \
  --prefix=/usr \
  --localstatedir=/var \
  --sysconfdir=/etc \
  --with-utils \
  --with-km \
  --with-udev \
  --with-pacemaker \
  --with-rgmanager \
  --with-bashcompletion
make
make install
chkconfig --add drbd
chkconfig drbd off
</source>
 
== Option B - Install From ELRepo ==
 
On '''Both''' nodes run:
 
<source lang="bash">
# Obliterate peer - fence via cman
wget -c https://alteeve.com/files/an-cluster/sbin/obliterate-peer.sh -O /sbin/obliterate-peer.sh
chmod a+x /sbin/obliterate-peer.sh
ls -lah /sbin/obliterate-peer.sh
 
# Install the ELRepo GPG key, add the repo and install DRBD.
rpm --import http://elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://elrepo.org/elrepo-release-6-4.el6.elrepo.noarch.rpm
yum install drbd83-utils kmod-drbd83
</source>
 
== Creating The DRBD Partitions ==
 
It is possible to use [[LVM]] on the hosts, and simply create [[LV]]s to back our DRBD resources. However, this causes confusion as LVM will see the [[PV]] signatures on both the DRBD backing devices and the DRBD device itself. Getting around this requires editing LVM's <span class="code">filter</span> option, which is somewhat complicated. Not overly so, mind you, but enough to be outside the scope of this document.
 
Also, by working with <span class="code">fdisk</span> directly, it will give us a chance to make sure that the DRBD partitions start on an even 64 [[KiB]] boundry. This is important for decent performance on Windows VMs, as we will see later. This is true for both traditional platter and modern solid-state drives.
 
On our nodes, we created three primary disk partitions;
* <span class="code">/dev/sda1</span>; The <span class="code">/boot</span> partition.
* <span class="code">/dev/sda2</span>; The root <span class="code">/</span> partition.
* <span class="code">/dev/sda3</span>; The swap partition.
 
We will create a new extended partition. Then within it we will create three new partitions;
* <span class="code">/dev/sda5</span>; a small partition we will later use for our shared [[GFS2]] partition.
* <span class="code">/dev/sda6</span>; a partition big enough to host the VMs that will normally run on <span class="code">an-node01</span>.
* <span class="code">/dev/sda7</span>; a partition big enough to host the VMs that will normally run on <span class="code">an-node02</span>.
 
As we create each partition, we will do a little math to ensure that the start sector is on a 64 [[KiB]] boundry.
 
=== Alignment Math ===
 
Before we can start the alignment math, we need to know how big each sector is on our hard drive. This is almost always 512 [[bytes]], but it's still best to be sure. To check, run;


<source lang="bash">
<source lang="bash">
iptables-save
fdisk -l /dev/sda | grep Sector
</source>
<source lang="text">
Sector size (logical/physical): 512 bytes / 512 bytes
</source>
</source>
So now that we have confirmed our sector size, we can look at the math.
* Each 64 [[KiB]] block will use 128 sectors <span class="code">((64 * 1024) / 512) == 128</span>.
* As we create each each partition, we will be asked to enter the starting sector (using <span class="code">fdisk -u</span>). Take the first free sector and divide it by <span class="code">128</span>. If it does not divide evenly, then;
** Add <span class="code">127 (one sector shy of another block to guarantee we've gone past the start sector we want).
** Divide the new number by <span class="code">128</span>. This will give you a fractional number. Remove (do not round!) any number after the decimal place.
** Multiply by <span class="code">128</span> to get the sector number we want.
Lets look at a example using real numbers. Lets say we create a new partition and the first free sector is <span class="code">92807568</span>;
<source lang="text">
<source lang="text">
# Generated by iptables-save v1.4.7 on Thu Sep 29 00:52:24 2011
92807568 ÷ 128 = 725059.125
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [73:22576]
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 80 -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 443 -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 53 -j ACCEPT
-A INPUT -p udp -m state --state NEW -m udp --dport 53 -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 953 -j ACCEPT
-A INPUT -p udp -m state --state NEW -m udp --dport 953 -j ACCEPT
-A INPUT -p udp -m state --state NEW -m udp --dport 1194 -j ACCEPT
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
COMMIT
# Completed on Thu Sep 29 00:52:24 2011
</source>
</source>


== Configure The Server ==
We have a remainder, so it's not on an even 64 KiB block boundry. Now we need to figure out what sector above <span class="code">92807568</span> is evenly divisible by 128. To do that, lets add 127 (one sector shy of the next 64 KiB block), divide by 128 to get the number of 64 KiB blocks (with a remainder), remove the remainder to get an even number (do not round, you just want the bare integer), then finally multiply by 128 to get the sector number. This will give us the sector number we want our partition to start on.
 
<source lang="text">
92807568 + 127 = 92807695
</source>
<source lang="text">
92807695 ÷ 128 = 725060.1171875
</source>
<source lang="text">
int(725060.1171875) = 725060
</source>
<source lang="text">
725060 x 128 = 92807680
</source>
 
So now we know that sector number <span class="code">92807680</span> is the first sector above <span class="code">92807568</span> that falls on an even 64 KiB block. Now we need to alter our partition's starting sector. To do this, we will need to go into <span class="code">fdisk</span>'s extra functions.
 
{{note|1=Pay attention to the last sector number of each partition you create. As you create partitions, <span class="code">fdisk</span> will see free space, as tiny as it is, and it will default to that as the first sector for the next partition. This is annoying. My noting the last sector of each partition you create, you can add 1 sector and do the math to find the first sector above that which sits on a 64 KiB boundary.}}
 
=== Creating the Three Partitions ===
 
Here I will show you the values I entered to create the three partitions I needed on my nodes.
 
'''DO NOT COPY THIS!'''
 
The values you enter will almost certainly be different.
 
Start <span class="code">fdisk</span> in sector mode on <span class="code">/dev/sda</span>.


This controls how the server works and will need to be created.
{{note|1=If you are using software [[RAID]], you will need to do the following steps on all disks, then you can proceed to create the RAID partitions normally and they will be aligned.}}


<source lang="bash">
<source lang="bash">
vim /etc/openvpn/server.conf
fdisk -u /dev/sda
</source>
</source>
<source lang="text">
<source lang="text">
###############################################################################
WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
# OpenVPN 2.0 config file the for the Alteeve's Niche! 'daimon.alteeve.com'   #
        switch off the mode (command 'c').
# server.                                                                     #
</source>
#                                                                            #
# This file is for the server side of a many-clients <-> one-server OpenVPN  #
# configuration.                                                              #
#                                                                            #
# Comments are preceded with '#' or ';'                                      #
###############################################################################


# Which local IP address should OpenVPN listen on? (optional)
Disable DOS compatibility because hey, it's not the 80s any more.
;local a.b.c.d


# Which TCP/UDP port should OpenVPN listen on?
<source lang="text">
# If you want to run multiple OpenVPN instances on the same machine, use a
Command (m for help): c
# different port number for each one. You will need to open up this port on
</source>
# your firewall.
<source lang="text">
port 1194
DOS Compatibility flag is not set
</source>


# TCP or UDP server?
Lets take a look at the current partition layout.
;proto tcp
proto udp


# "dev tun" will create a routed IP tunnel,
<source lang="text">
# "dev tap" will create an ethernet tunnel.
Command (m for help): p
# Use "dev tap0" if you are ethernet bridging and have precreated a tap0
</source>
# virtual interface and bridged it with your ethernet interface. If you want to
<source lang="text">
# control access policies over the VPN, you must create firewall rules for the
Disk /dev/sda: 500.1 GB, 500107862016 bytes
# the TUN/TAP interface.
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
# You can give an explicit unit number, such as tun0.
Units = sectors of 1 * 512 = 512 bytes
# On most systems, the VPN will not function unless you partially or fully
Sector size (logical/physical): 512 bytes / 512 bytes
# disable the firewall for the TUN/TAP interface.
I/O size (minimum/optimal): 512 bytes / 512 bytes
;dev tap
Disk identifier: 0x00056856
dev tun


# SSL/TLS root certificate (ca), certificate (cert), and private key (key).
  Device Boot      Start        End      Blocks  Id  System
# Each client and the server must have their own cert and key file. The server
/dev/sda1  *        2048      526335      262144  83  Linux
# and all clients will use the same ca file.
/dev/sda2          526336    84412415    41943040  83  Linux
#
/dev/sda3        84412416    92801023    4194304  82 Linux swap / Solaris
# See:
</source>
# https://alteeve.com/w/OpenVPN_Server_on_EL6#Generate_the_Master_CA_Certificate_and_Key
# for instructions on generating RSA certificates and private keys. Remember
# to use a unique Common Name for the server and each of the client certificates.
#
# Any X509 key management system can be used. OpenVPN can also use a PKCS #12
# formatted key file (see "pkcs12" directive in man page).
;ca keys/ca.crt
;cert keys/server.crt
;key keys/server.key # This file should be kept secret
ca keys/ca.crt
cert keys/daimon.alteeve.com.crt
key keys/daimon.alteeve.com.key


# Diffie hellman parameters.
Perfect. Now let's create a new extended partition that will use the rest of the disk. We don't care if this is aligned so we'll just accept the default start and end sectors.
# Generate your own with:
#  openssl dhparam -out dh1024.pem 1024
# Substitute 2048 for 1024 if you are using
# 2048 bit keys.  
;dh dh2048.pem
dh keys/dh1024.pem


# Configure server mode and supply a VPN subnet for OpenVPN to draw client
<source lang="text">
# addresses from. The server will take 192.168.20.1 (was: 10.8.0.1) for itself,
Command (m for help): n
# the rest will be made available to clients.
</source>
# Each client will be able to reach the server on 192.168.20.1 (was: 10.8.0.1).
<source lang="text">
# Comment this line out if you are ethernet bridging. See the man page for more
Command action
# info.
  e  extended
server 10.30.0.0 255.255.0.0
  p  primary partition (1-4)
</source>
<source lang="text">
e
</source>
<source lang="text">
Selected partition 4
First sector (92801024-976773167, default 92801024):
</source>


# Maintain a record of client <-> virtual IP address associations in this file.
Just press <span class="code"><enter></span>.
# If OpenVPN goes down or is restarted, reconnecting clients can be assigned
# the same virtual IP address from the pool that was previously assigned.
ifconfig-pool-persist ipp.txt


# Configure server mode for ethernet bridging. You must first use your OS's
<source lang="text">
# bridging capability to bridge the TAP interface with the ethernet NIC
Using default value 92801024
# interface. Then you must manually set the IP/netmask on the bridge interface,
Last sector, +sectors or +size{K,M,G} (92801024-976773167, default 976773167):
# here we assume 10.30.0.1/255.255.0.0.
</source>
# Finally we must set aside an IP range in this subnet (start=10.30.0.20
# end=10.30.0.250 to allocate to connecting clients. Leave this line commented
# out unless you are ethernet bridging.
;server-bridge 10.30.0.1 255.255.0.0 10.30.0.20 10.30.0.250


# Push routes to the client to allow it to reach other private subnets behind
Just press <span class="code"><enter></span> again.
# the server. Remember that these private subnets will also need to know to
# route the OpenVPN client address pool. In this example, we're routing to an
# example subnet at 10.30.0.0/255.255.0.0.
;push "route 10.40.0.0 255.255.0.0"


# To assign specific IP addresses to specific clients or if a connecting client
<source lang="text">
# has a private subnet behind it that should also have VPN access, use the
Using default value 976773167
# subdirectory "ccd" for client-specific configuration files (see man page for
</source>
# more info).


# EXAMPLE:
Now we'll create the first partition. This will be a 20GB partition used by the shared [[GFS2]] partition. As it will never host a VM, I don't care if it is aligned.
# Suppose the client having the certificate common name "Thelonious" also has a
# small subnet behind their connecting machine, such as;
# 192.168.40.128/255.255.255.248.
# First, uncomment out these lines:
;client-config-dir ccd
;route 192.168.40.128 255.255.255.248
# Then create a file ccd/Thelonious with this line:
#  iroute 192.168.40.128 255.255.255.248
# This will allow Thelonious' private subnet to access the VPN. This example
# will only work if you are routing, not bridging, i.e. you are using "dev tun"
# and "server" directives.


# EXAMPLE:
<source lang="text">
# Suppose you want to give Thelonious a fixed VPN IP address of 10.9.0.1.
Command (m for help): n
# First uncomment out these lines:
</source>
;client-config-dir ccd
<source lang="text">
;route 10.9.0.0 255.255.255.252
First sector (92803072-976773167, default 92803072):  
# Then add this line to ccd/Thelonious:
</source>
#  ifconfig-push 10.9.0.1 10.9.0.2


# Suppose that you want to enable different firewall access policies for
Just press <span class="code"><enter></span>.
# different groups of clients. There are two methods:
# (1) Run multiple OpenVPN daemons, one for each group, and firewall the
#    TUN/TAP interface for each group/daemon appropriately.
# (2) (Advanced) Create a script to dynamically modify the firewall in
#    response to access from different clients. See man page for more info on
#    learn-address script.
;learn-address ./script


# If enabled, this directive will configure all clients to redirect their
<source lang="text">
# default network gateway through the VPN, causing all IP traffic such as web
Using default value 92803072
# browsing and and DNS lookups to go through the VPN (The OpenVPN server
</source>
# machine may need to NAT the TUN/TAP interface to the internet in order for
<source lang="text">
# this to work properly).
Last sector, +sectors or +size{K,M,G} (92803072-976773167, default 976773167): +20G
# CAVEAT:
</source>
# May break client's network config if client's local DHCP server packets get
# routed through the tunnel.
# Solution:
# Make sure client's local DHCP server is reachable via a more specific route
# than the default route of 0.0.0.0/0.0.0.0.
;push "redirect-gateway"


# Certain Windows-specific network settings can be pushed to clients, such as
Now we will create the last two partitions that will host our VMs. I want to split the remaining space in half, so I need to do a little bit more math before I can proceed. I will need to see how many sectors are still free, divide by two to get the number of sectors in half the remaining free space, the add the number of already-used sectors so that I know where the first partition should end. We'll do this math in just a moment.
# DNS or WINS server addresses.
# CAVEAT:
# http://openvpn.net/faq.html#dhcpcaveats
;push "dhcp-option DNS 10.30.0.1"
;push "dhcp-option WINS 10.30.0.1"


# Uncomment this directive to allow different clients to be able to "see" each
So let's print the current partition layout:
# other. By default, clients will only see the server. To force clients to only
# see the server, you will also need to appropriately firewall the server's
# TUN/TAP interface. In my case, I need this.
client-to-client


# Uncomment this directive if multiple clients might connect with the same
<source lang="text">
# certificate/key files or common names. This is recommended only for testing
Command (m for help): p
# purposes. For production use, each client should have its own certificate/key
</source>
# pair.
<source lang="text">
#
Disk /dev/sda: 500.1 GB, 500107862016 bytes
# IF YOU HAVE NOT GENERATED INDIVIDUAL CERTIFICATE/KEY PAIRS FOR EACH CLIENT,
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
# EACH HAVING ITS OWN UNIQUE "COMMON NAME", UNCOMMENT THIS LINE OUT.
Units = sectors of 1 * 512 = 512 bytes
;duplicate-cn
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00056856


# The keepalive directive causes ping-like messages to be sent back and forth
  Device Boot      Start        End      Blocks  Id  System
# over the link so that each side knows when the other side has gone down. Ping
/dev/sda1  *        2048      526335      262144  83  Linux
# every 10 seconds, assume that remote peer is down if no ping received during
/dev/sda2          526336    84412415    41943040  83  Linux
# a 120 second time period.
/dev/sda3        84412416    92801023    4194304  82  Linux swap / Solaris
keepalive 10 120
/dev/sda4        92801024  976773167  441986072    5  Extended
/dev/sda5        92803072  134746111    20971520  83  Linux
</source>


# For extra security beyond that provided by SSL/TLS, create an "HMAC firewall"
Start to create the new partition. Before be can sort out the last sector, we first need to find the first sector.
# to help block DoS attacks and UDP port flooding.
#
# Generate with:
#  openvpn --genkey --secret ta.key
#
# The server and each client must have a copy of this key. The second parameter
# should be '0' on the server and '1' on the clients.
;tls-auth ta.key 0 # This file is secret


# Select a cryptographic cipher. This config item must be copied to the client
<source lang="text">
# config file as well.
Command (m for help): n
;cipher BF-CBC        # Blowfish (default)
</source>
;cipher AES-128-CBC  # AES
<source lang="text">
;cipher DES-EDE3-CBC  # Triple-DES
First sector (134748160-976773167, default 134748160):
</source>


# Enable compression on the VPN link. If you enable it here, you must also
Now I see that it the first free sector is <span class="code">134748160</span>. I divide this by <span class="code">128</span> and I get <span class="code">1052720</span>. It is an even number, so I don't need to do anything more as it is already on a 64 [[KiB]] boundry! So I can just press <span class="code"><enter></span> to accept it.
# enable it in the client config file.
comp-lzo


# The maximum number of concurrently connected clients we want to allow.
<source lang="text">
;max-clients 100
Using default value 134748160
Last sector, +sectors or +size{K,M,G} (134748160-976773167, default 976773167):
</source>


# It's a good idea to reduce the OpenVPN daemon's privileges after
Now we need to do the math to find what sector marks half of the remaining free space. Let's gather some numbers;
# initialization.
#
# You can uncomment this on non-Windows systems.
;user nobody
;group nobody


# The persist options will try to avoid accessing certain resources on restart
* This partition started at sector <span class="code">134748160</span>
# that may no longer be accessible because of the privilege downgrade.
* The default end sector is <span class="code">976773167</span>
persist-key
* That means that there are currently <span class="code">(976773167 - 134748160) == 842025007</span> sectors free.
persist-tun
* Half of that is <span class="code">(842025007 / 2) == int(421012503.5) == 421012503</span> sectors free (<span class="code">int()</span> simply means to take the remainder off the number).
* So if we want a partition that is <span class="code">421012503</span> long, we need to add the start sector to get our offset. That is, <span class="code">(421012503 + 134748160) == 555760663</span>. This is what we will enter now.


# Output a short status file showing current connections, truncated and
<source lang="text">
# rewritten every minute.
Last sector, +sectors or +size{K,M,G} (134748160-976773167, default 976773167): 555760663
status openvpn-status.log
</source>


# By default, log messages will go to the syslog. Use log or log-append to
Now to create the last partition, we will repeat the steps above.
# override this default. "log" will truncate the log file on OpenVPN startup,
# while "log-append" will append to it. Use one or the other (but not both).
;log        openvpn.log
;log-append  openvpn.log


# Set the appropriate level of log file verbosity.
<source lang="text">
#
Command (m for help): n
# 0 is silent, except for fatal errors
</source>
# 4 is reasonable for general usage
<source lang="text">
# 5 and 6 can help to debug connection problems
First sector (555762712-976773167, default 555762712):
# 9 is extremely verbose
</source>
;verb 3
 
verb 4
Let's make sure that <span class="code">555762712</span> is on a 64 KiB boundry;
* <span class="code">(555762712 / 128) == 4341896.1875</span> is not an even number, so we need to find the next sector on an even boundary.
* Add <span class="code">127</span> sectors and divide by 128 again;
** <span class="code">(555762712 + 127) == 555762839</span>
** <span class="code">(555762839 / 128) == int(4341897.1796875) == 4341897</span>
** <span class="code">(4341897 * 128) == 555762816</span>
* Now we know that we want our start sector to be <span class="code">555762816</span>.
 
<source lang="text">
First sector (555762712-976773167, default 555762712): 555762816
</source>
<source lang="text">
Last sector, +sectors or +size{K,M,G} (555762816-976773167, default 976773167):
</source>
 
This is the last partition, so we can just press <span class="code"><enter></span> to get the last sector on the disk.
 
<source lang="text">
Using default value 976773167
</source>
 
Lets take a final look at the new partition before committing the changes to disk.
 
<source lang="text">
Command (m for help): p
</source>
<source lang="text">
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00056856
 
  Device Boot      Start        End      Blocks  Id  System
/dev/sda1  *        2048      526335      262144  83  Linux
/dev/sda2          526336    84412415    41943040  83  Linux
/dev/sda3        84412416    92801023    4194304  82  Linux swap / Solaris
/dev/sda4        92801024  976773167  441986072    5 Extended
/dev/sda5        92803072  134746111    20971520  83  Linux
/dev/sda6      134748160  555760663  210506252  83  Linux
/dev/sda7      555762816  976773167  210505176  83  Linux
</source>
 
Perfect. If you divide partition six or seven's start sector by <span class="code">128</span>, you will see that both have no remainder which means that they are, if fact, aligned. This is the last time we need to worry about alignment because LVM uses an even multiple of 64 [[KiB]] in it's [[extent]] sizes, so all normal extent sized will always produce [[LV]]s on even 64 KiB boudaries.
 
So now write out the changes, re-probe the disk (or reboot) and then repeat all these steps on the other node.


# Silence repeating messages. At most 20 sequential messages of the same
<source lang="text">
# message category will be output to the log.
Command (m for help): w
;mute 20
</source>
</source>
<source lang="text">
The partition table has been altered!
Calling ioctl() to re-read partition table.


=== Starting The Server ===
WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
Syncing disks.
</source>


You should now be able to start the OpenVPN daemon!
No reprobe using <span class="code">partprobe</span>.


<source lang="bash">
<source lang="bash">
/etc/init.d/openvpn start
partprobe /dev/sda
</source>
</source>
<source lang="text">
Warning: WARNING: the kernel failed to re-read the partition table on /dev/sda
(Device or resource busy).  As a result, it may not reflect all of your changes
until after reboot.
</source>
In my case, the probe failed so I will reboot. To do this most safely, stop the cluster before calling <span class="code">reboot</span>.
<source lang="bash">
<source lang="bash">
Starting openvpn:                                         [  OK  ]
/etc/init.d/cman stop
</source>
<source lang="text">
Stopping cluster:  
  Leaving fence domain...                                [  OK  ]
  Stopping gfs_controld...                                [  OK  ]
  Stopping dlm_controld...                                [  OK  ]
  Stopping fenced...                                      [  OK  ]
  Stopping cman...                                        [  OK  ]
  Waiting for corosync to shutdown:                      [  OK  ]
  Unloading kernel modules...                            [  OK  ]
  Unmounting configfs...                                  [  OK  ]
</source>
</source>


If the start fails, look in <span class="code">/var/log/messages</span> for clues to the problem.
Now reboot.


To confirm that the server is up, check the interfaces and you should now see a <span class="code">tun0</span> device.
<source lang="bash">
reboot
</source>
 
== Configuring DRBD ==
 
DRBD is configured in two parts;
 
* Global and common configuration options
* Resource configurations
 
We will be creating three separate DRBD resources, so we will create three separate resource configuration files. More on that in a moment.


{{note|1=If you have anything already using <span class="code">tun0</span>, the '0' will be incremented to the first free integer.}}
=== Configuring DRBD Global and Common Options ===
 
The first file to edit is <span class="code">/etc/drbd.d/global_common.conf</span>. In this file, we will set global configuration options and set default resource configuration options. These default resource options can be overwritten in the actual resource files which we'll create once we're done here.


<source lang="bash">
<source lang="bash">
ifconfig tun0
cp /etc/drbd.d/global_common.conf /etc/drbd.d/global_common.conf.orig
vim /etc/drbd.d/global_common.conf
diff -u  /etc/drbd.d/global_common.conf.orig /etc/drbd.d/global_common.conf
</source>
</source>
<source lang="diff">
--- /etc/drbd.d/global_common.conf.orig 2011-09-14 14:03:56.364566109 -0400
+++ /etc/drbd.d/global_common.conf 2011-09-14 14:23:37.287566400 -0400
@@ -15,24 +15,81 @@
# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
+
+ # This script is a wrapper for RHCS's 'fence_node' command line
+ # tool. It will call a fence against the other node and return
+ # the appropriate exit code to DRBD.
+ fence-peer "/sbin/obliterate-peer.sh";
}
startup {
# wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
+
+ # This tells DRBD to promote both nodes to Primary on start.
+ become-primary-on both;
+
+ # This tells DRBD to wait five minutes for the other node to
+ # connect. This should be longer than it takes for cman to
+ # timeout and fence the other node *plus* the amount of time it
+ # takes the other node to reboot. If you set this too short,
+ # you could corrupt your data. If you want to be extra safe, do
+ # not use this at all and DRBD will wait for the other node
+ # forever.
+ wfc-timeout 300;
+
+ # This tells DRBD to wait for the other node for three minutes
+ # if the other node was degraded the last time it was seen by
+ # this node. This is a way to speed up the boot process when
+ # the other node is out of commission for an extended duration.
+ degr-wfc-timeout 120;
}
disk {
# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
# no-disk-drain no-md-flushes max-bio-bvecs
+
+ # This tells DRBD to block IO and fence the remote node (using
+ # the 'fence-peer' helper) when connection with the other node
+ # is unexpectedly lost. This is what helps prevent split-brain
+ # condition and it is incredible important in dual-primary
+ # setups!
+ fencing resource-and-stonith;
}
net {
# sndbuf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
+
+ # This tells DRBD to allow two nodes to be Primary at the same
+ # time. It is needed when 'become-primary-on both' is set.
+ allow-two-primaries;
+
+ # The following three commands tell DRBD how to react should
+ # our best efforts fail and a split brain occurs. You can learn
+ # more about these options by reading the drbd.conf man page.
+ # NOTE! It is not possible to safely recover from a split brain
+ # where both nodes were primary. This care requires human
+ # intervention, so 'disconnect' is the only safe policy.
+ after-sb-0pri discard-zero-changes;
+ after-sb-1pri discard-secondary;
+ after-sb-2pri disconnect;
}
syncer {
# rate after al-extents use-rle cpu-mask verify-alg csums-alg
+
+ # This alters DRBD's default syncer rate. Note that is it
+ # *very* important that you do *not* configure the syncer rate
+ # to be too fast. If it is too fast, it can significantly
+ # impact applications using the DRBD resource. If it's set to a
+ # rate higher than the underlying network and storage can
+ # handle, the sync can stall completely.
+ # This should be set to ~30% of the *tested* sustainable read
+ # or write speed of the raw /dev/drbdX device (whichever is
+ # slower). In this example, the underlying resource was tested
+ # as being able to sustain roughly 60 MB/sec, so this is set to
+ # one third of that rate, 20M.
+ rate 20M;
}
}
</source>
=== Configuring the DRBD Resources ===
As mentioned earlier, we are going to create three DRBD resources.
* Resource <span class="code">r0</span>, which will be device <span class="code">/dev/drbd0</span>, will be the shared GFS2 partition.
* Resource <span class="code">r1</span>, which will be device <span class="code">/dev/drbd1</span>, will provide disk space for VMs that will normally run on <span class="code">an-node01</span>.
* Resource <span class="code">r2</span>, which will be device <span class="code">/dev/drbd2</span>, will provide disk space for VMs that will normally run on <span class="code">an-node02</span>.
{{note|1=The reason for the two separate VM resources is to help protect against data loss in the off chance that a [[split-brain]] occurs, despite our counter-measures. As we will see later, recovering from a split brain requires discarding the changes on one side of the resource. If VMs are running on the same resource but on different nodes, this would lead to data loss. Using two resources helps prevent that scenario.}}
Each resource configuration will be in it's own file saved as <span class="code">/etc/drbd.d/rX.res</span>. The three of them will be pretty much the same. So let's take a look at the first GFS2 resource <span class="code">r0.res</span>, then we'll just look at the changes for <span class="code">r1.res</span> and <span class="code">r2.res</span>. These files won't exist initially.
<source lang="bash">
<source lang="bash">
tun0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 
vim /etc/drbd.d/r0.res
          inet addr:10.30.0.1  P-t-P:10.30.0.2  Mask:255.255.255.255
          UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:1500  Metric:1
          RX packets:34 errors:0 dropped:0 overruns:0 frame:0
          TX packets:42 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100
          RX bytes:4621 (4.5 KiB)  TX bytes:4997 (4.8 KiB)
</source>
</source>
<source lang="text">
# This is the resource used for the shared GFS2 partition.
resource r0 {
# This is the block device path.
        device          /dev/drbd0;


= Client Setup =
# We'll use the normal internal metadisk (takes about 32MB/TB)
        meta-disk      internal;


You will need to setup keys for all clients that will connect to this server. Once done, you will copy
# This is the `uname -n` of the first node
        on an-node01.alteeve.com {
# The 'address' has to be the IP, not a hostname. This is the
# node's SN (bond1) IP. The port number must be unique amoung
# resources.
                address        10.10.0.1:7789;


== Creating The Client Keys ==
# This is the block device backing this resource on this node.
                disk            /dev/sda5;
        }
# Now the same information again for the second node.
        on an-node02.alteeve.com {
                address        10.10.0.2:7789;
                disk            /dev/sda5;
        }
}
</source>


This should look familiar by now. Remember to change <span class="code">digimer-lework</span> to the name of the keys you want to create.
Now copy this to <span class="code">r1.res</span> and edit for the <span class="code">an-node01</span> VM resource. The main differences are the resource name, <span class="code">r1</span>, the block device, <span class="code">/dev/drbd1</span>, the port, <span class="code">7790</span> and the backing block devices, <span class="code">/dev/sda6</span>.


<source lang="bash">
<source lang="bash">
cd /etc/openvpn
cp /etc/drbd.d/r0.res /etc/drbd.d/r1.res
. ./vars
vim /etc/drbd.d/r1.res
</source>
</source>
<source lang="text">
<source lang="text">
NOTE: If you run ./clean-all, I will be doing a rm -rf on /etc/openvpn/keys
# This is the resource used for VMs that will normally run on an-node01.
resource r1 {
# This is the block device path.
        device          /dev/drbd1;
 
# We'll use the normal internal metadisk (takes about 32MB/TB)
        meta-disk      internal;
 
# This is the `uname -n` of the first node
        on an-node01.alteeve.com {
# The 'address' has to be the IP, not a hostname. This is the
# node's SN (bond1) IP. The port number must be unique amoung
# resources.
                address        10.10.0.1:7790;
 
# This is the block device backing this resource on this node.
                disk            /dev/sda6;
        }
# Now the same information again for the second node.
        on an-node02.alteeve.com {
                address        10.10.0.2:7790;
                disk            /dev/sda6;
        }
}
</source>
</source>
The last resource is again the same, with the same set of changes.
<source lang="bash">
<source lang="bash">
./build-key digimer-lework
cp /etc/drbd.d/r1.res /etc/drbd.d/r2.res
vim /etc/drbd.d/r2.res
</source>
</source>
<source lang="text">
<source lang="text">
Generating a 1024 bit RSA private key
# This is the resource used for VMs that will normally run on an-node02.
.........++++++
resource r2 {
.............++++++
# This is the block device path.
writing new private key to 'digimer-lework.key'
        device          /dev/drbd2;
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [CA]:
State or Province Name (full name) [ON]:
Locality Name (eg, city) [Toronto]:
Organization Name (eg, company) [Alteeve's Niche!]:
Organizational Unit Name (eg, section) []:
Common Name (eg, your name or your server's hostname) [digimer-lework]:
Name []:
Email Address [admin@alteeve.com]:


Please enter the following 'extra' attributes
# We'll use the normal internal metadisk (takes about 32MB/TB)
to be sent with your certificate request
        meta-disk      internal;
A challenge password []:
An optional company name []:
Using configuration from /etc/openvpn/openssl.cnf
Check that the request matches the signature
Signature ok
The Subject's Distinguished Name is as follows
countryName          :PRINTABLE:'CA'
stateOrProvinceName  :PRINTABLE:'ON'
localityName          :PRINTABLE:'Toronto'
organizationName      :T61STRING:'Alteeve's Niche!'
commonName            :PRINTABLE:'digimer-lework'
emailAddress          :IA5STRING:'admin@alteeve.com'
Certificate is to be certified until Oct 13 01:24:21 2021 GMT (3650 days)
Sign the certificate? [y/n]:y


# This is the `uname -n` of the first node
        on an-node01.alteeve.com {
# The 'address' has to be the IP, not a hostname. This is the
# node's SN (bond1) IP. The port number must be unique amoung
# resources.
                address        10.10.0.1:7791;


1 out of 1 certificate requests certified, commit? [y/n]y
# This is the block device backing this resource on this node.
Write out database with 1 new entries
                disk            /dev/sda7;
Data Base Updated
        }
# Now the same information again for the second node.
        on an-node02.alteeve.com {
                address        10.10.0.2:7791;
                disk            /dev/sda7;
        }
}
</source>
</source>


== client.conf ==
The final step is to validate the configuration. This is done by running the following command;
 
<source lang="bash">
drbdadm dump
</source>
<source lang="text">
# /etc/drbd.conf
common {
    protocol              C;
    net {
        allow-two-primaries;
        after-sb-0pri    discard-zero-changes;
        after-sb-1pri    discard-secondary;
        after-sb-2pri    disconnect;
    }
    disk {
        fencing          resource-and-stonith;
    }
    syncer {
        rate            20M;
    }
    startup {
        wfc-timeout      300;
        degr-wfc-timeout 120;
        become-primary-on both;
    }
    handlers {
        pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        local-io-error  "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
        fence-peer      /sbin/obliterate-peer.sh;
    }
}


This file can be created on the server and then sent to the client, or created directly on the client device if you have access to it.
# resource r0 on an-node01.alteeve.com: not ignored, not stacked
resource r0 {
    on an-node01.alteeve.com {
        device           /dev/drbd0 minor 0;
        disk            /dev/sda5;
        address          ipv4 10.10.0.1:7789;
        meta-disk        internal;
    }
    on an-node02.alteeve.com {
        device          /dev/drbd0 minor 0;
        disk            /dev/sda5;
        address          ipv4 10.10.0.2:7789;
        meta-disk        internal;
    }
}


The most important lines to change for each new client device are:
# resource r1 on an-node01.alteeve.com: not ignored, not stacked
* <span class="code">cert client_dev.crt</span>
resource r1 {
* <span class="code">key client_dev.key</span>
    on an-node01.alteeve.com {
        device           /dev/drbd1 minor 1;
        disk            /dev/sda6;
        address          ipv4 10.10.0.1:7790;
        meta-disk        internal;
    }
    on an-node02.alteeve.com {
        device          /dev/drbd1 minor 1;
        disk            /dev/sda6;
        address          ipv4 10.10.0.2:7790;
        meta-disk        internal;
    }
}


{{note|1=This file needs to be created on the client machine.}}
# resource r2 on an-node01.alteeve.com: not ignored, not stacked
resource r2 {
    on an-node01.alteeve.com {
        device          /dev/drbd2 minor 2;
        disk            /dev/sda7;
        address          ipv4 10.10.0.1:7791;
        meta-disk        internal;
    }
    on an-node02.alteeve.com {
        device          /dev/drbd2 minor 2;
        disk            /dev/sda7;
        address          ipv4 10.10.0.2:7791;
        meta-disk        internal;
    }
}
</source>
 
You'll note that the output is formatted differently, but the values themselves are the same. If there had of been errors, you would have seen them printed. Fix any problems before proceeding. Once you get a clean dump, copy the configuration over to the other node.


<source lang="bash">
<source lang="bash">
vim /etc/openvpn/client.conf
rsync -av /etc/drbd.d root@an-node02:/etc/
</source>
</source>
<source lang="text">
<source lang="text">
################################################################################
sending incremental file list
# Client-side OpenVPN 2.0 config file for client devices connecting to the    #
drbd.d/
# Alteeve's Niche! -> 'daimon.alteeve.com' multi-client OpenVPN server.       #
drbd.d/global_common.conf
#                                                                              #
drbd.d/global_common.conf.orig
# This configuration can be used by multiple clients, however each client      #
drbd.d/r0.res
# should have its own cert and key files.                                     #
drbd.d/r1.res
#                                                                              #
drbd.d/r2.res
# BE SURE TO SET:                                                              #
# - cert digimer-lework.crt                                                    #
# - key digimer-lework.key                                                    #
# To the name of the key you generated for the client device BEFORE sending    #
# this config file to the client!                                              #
################################################################################


# Specify that we are a client and that we will be pulling certain config file
sent 7619 bytes  received 129 bytes  15496.00 bytes/sec
# directives from the server.
total size is 7946  speedup is 1.03
client
</source>


# Use the same setting as you are using on the server. On most systems, the VPN
== Initializing The DRBD Resources ==
# will not function unless you partially or fully disable the firewall for the
# TUN/TAP interface.
;dev tap
dev tun


# Are we connecting to a TCP or UDP server? Use the same setting as on the
Now that we have DRBD configured, we need to initialize the DRBD backing devices and then bring up the resources for the first time.
# server.
;proto tcp
proto udp


# The hostname/IP and port of the server. You can have multiple remote entries
{{note|1=To save a bit of time and typing, the following sections will use a little <span class="code">bash</span> magic. When commands need to be run on all three resources, rather than running the same command three times with the different resource names, we will use the short-hand form <span class="code">r{0,1,2}</span> or <span class="code">r{0..2}</span>.}}
# to load balance between the servers.
;remote my-server-2 1194
remote daimon.alteeve.com 1194


# Choose a random host from the remote list for load-balancing. Otherwise try
On '''both''' nodes, create the new metadata on the backing devices. You may need to type <span class="code">yes</span> to confirm the action if any data is seen. If DRBD sees an actual file system, it will error and insist that you clear the partition. You can do this by running; <span class="code">dd if=/dev/zero of=/dev/sdaX bs=4M count=1000</span>, where <span class="code">X</span> is the partition you want to clear.
# hosts in the order specified.
;remote-random


# Keep trying indefinitely to resolve the host name of the OpenVPN server. Very
<source lang="bash">
# useful on machines which are not permanently connected to the internet such
drbdadm create-md r{0..2}
# as laptops.
</source>
resolv-retry infinite
<source lang="text">
md_offset 21474832384
al_offset 21474799616
bm_offset 21474144256


# Most clients don't need to bind to a specific local port number.
Found some data
nobind


# Downgrade privileges after initialization.
==> This might destroy existing data! <==
;user nobody
;group nobody


# Try to preserve some state across restarts.
Do you want to proceed?
persist-key
</source>
persist-tun
<source lang="text">
[need to type 'yes' to confirm] yes
</source>
<source lang="text">


# If you are connecting through an HTTP proxy to reach the actual OpenVPN
Writing meta data...
# server, put the proxy server/IP and port number here. See the man page if
initializing activity log
# your proxy server requires authentication.
NOT initialized bitmap
;http-proxy-retry # retry on connection failures
New drbd meta data block successfully created.
;http-proxy [proxy server] [proxy port #]
success
md_offset 215558397952
al_offset 215558365184
bm_offset 215551782912


# Wireless networks often produce a lot of duplicate packets. Set this flag to
Found some data
# silence duplicate packet warnings.
;mute-replay-warnings


# SSL/TLS parms.
==> This might destroy existing data! <==
# See the server config file for more description. It's best to use a separate
# .crt/.key file pair for each client. A single ca file can be used for all
# clients.
ca keys/ca.crt
;cert digimer-lework.crt
;key digimer-lework.key
cert keys/digimer-lework.crt
key keys/digimer-lework.key


# Verify server certificate by checking that the certicate has the nsCertType
Do you want to proceed?
# field set to "server".  This is an important precaution to protect against a
</source>
# potential attack discussed here: http://openvpn.net/howto.html#mitm
<source lang="text">
#
[need to type 'yes' to confirm] yes
# To use this feature, you will need to generate your server certificates with
</source>
# the nsCertType field set to "server". The build-key-server script in the
<source lang="text">
# easy-rsa folder will do this.
;ns-cert-type server


# If a tls-auth key is used on the server then every client must also have the
Writing meta data...
# key.
initializing activity log
;tls-auth ta.key 1
NOT initialized bitmap
New drbd meta data block successfully created.
success
md_offset 215557296128
al_offset 215557263360
bm_offset 215550681088


# Select a cryptographic cipher. If the cipher option is used on the server
Found some data
# then you must also specify it here.
;cipher x


# Enable compression on the VPN link. Don't enable this unless it is also
==> This might destroy existing data! <==
# enabled in the server config file.
comp-lzo


# Set log file verbosity.
Do you want to proceed?
;verb 3
</source>
verb 4
<source lang="text">
[need to type 'yes' to confirm] yes
</source>
<source lang="text">


# Silence repeating messages
Writing meta data...
;mute 20
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
</source>
</source>


== OpenVPN Client Software Install ==
Before you go any further, we'll need to load the <span class="code">drbd</span> kernel module. Note that you won't normally need to do this. Later, after we get everything running the first time, we'll be able to start and stop the DRBD resources using the <span class="code">/etc/init.d/drbd</span> script, which loads and unloads the <span class="code">drbd</span> kernel module as needed.
 
<source lang="bash">
modprobe drbd
</source>


How the client installs the OpenVPN software depends on the specifics of their operating system.
Now go back to the terminal windows we had used to watch the cluster start. We now want to watch the output of <span class="code">cat /proc/drbd</span> so we can keep tabs on the current state of the DRBD resources. We'll do this by using the <span class="code">watch</span> program, which will refresh the output of the <span class="code">cat</span> call every couple of seconds.


On RPM based systems, you can install the OpenVPN package. On [[EL6]] machines, this will require installing from the DAG repositories as was done for the server.
<source lang="bash">
watch cat /proc/drbd
</source>
<source lang="text">
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by dag@Build64R6, 2011-08-08 08:54:05
</source>


On Ubuntu and other Debian-derivative distributions, run:
Back in the first terminal, we need to <span class="code">attach</span> the backing device, <span class="code">/dev/sda{5..7}</span> to their respective DRBD resources, <span class="code">r{0..2}</span>. After running the following command, you will see no output on the first terminal, but the second terminal's <span class="code">/proc/drbd</span> should update.


<source lang="bash">
<source lang="bash">
apt-get install openvpn
drbdadm attach r{0..2}
</source>
<source lang="text">
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by dag@Build64R6, 2011-08-08 08:54:05
0: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown  r----s
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:20970844
1: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown  r----s
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:210499788
2: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown  r----s
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:210498712
</source>
</source>


== Files to Send to the Client ==
Take note of the connection state, <span class="code">cs:StandAlone</span>, the current role, <span class="code">ro:Secondary/Unknown</span> and the disk state, <span class="code">ds:Inconsistent/DUnknown</span>. This tells us that our resources are not talking to one another, are not usable because they are in the <span class="code">Secondary</span> state (you can't even read the <span class="code">/dev/drbdX</span> device) and that the backing device does not have an up to date view of the data.


Now that you've generated the keys, switch into the <span class="code">/etc/openvpn/keys</span> directory. You will need to send the following to the client.
This all makes sense of course, as the resources are brand new.


{{warning|1=Be sure to send these files to the user over secure channels, preferably directly to their device!}}
So the next step is to <span class="code">connect</span> the two nodes together. As before, we won't see any output from the first terminal, but the second terminal will change.


We'll create a [[tarball]] of the keys we're going to send to the client. Be sure '''not''' to include the <span class="code">keys/digimer-lework.csr</span> file!
{{note|1=After running the following command on the first node, it's connection state will become <span class="code">cs:WFConnection</span> which means that it is '''w'''aiting '''f'''or a '''connection''' from the other node.}}


<source lang="bash">
<source lang="bash">
cd /etc/openvpn
drbdadm connect r{0..2}
tar -cvzf digimer-lework_vpn-keys.tar.gz keys/ca.crt keys/digimer-lework.crt keys/digimer-lework.key
</source>
</source>
<source lang="text">
<source lang="text">
keys/ca.crt
version: 8.3.11 (api:88/proto:86-96)
keys/digimer-lework.crt
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by dag@Build64R6, 2011-08-08 08:54:05
keys/digimer-lework.key
0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:20970844
1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:210499788
2: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:210498712
</source>
</source>


How you get this file to the client is up to you. Again though, '''use secure channels'''! Otherwise you have poked a big hole in your security.
We can now see that the two nodes are talking to one another properly as the connection state has changed to <span class="code">cs:Connected</span>. They can see that their peer node is in the same state as they are; <span class="code">Secondary</span>/<span class="code">Inconsistent</span>.


== Setting Up The Client ==
Seeing as the resources are brand new, there is no data to synchronize or save. So we're going to issue a special command that will only ever be used this one time. It will tell DRBD to immediately consider the DRBD resources to be up to date.


On the client's device, install OpenVPN if needed and then copy or extract their files to:
On '''one''' node only, run;


<source lang="bash">
<source lang="bash">
/etc/openvpn/
drbdadm -- --clear-bitmap new-current-uuid r{0..2}
</source>
 
As before, look to the second terminal to see the new state of affairs.
 
<source lang="text">
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by dag@Build64R6, 2011-08-08 08:54:05
0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
2: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
</source>
</source>


They should now have:
Voila!
 
We could promote both sides to <span class="code">Primary</span> by running <span class="code">drbdadm primary r{0..2}</span> on both nodes, but there is no purpose in doing that at this stage as we can safely say our DRBD is ready to go. So instead, let's just stop DRBD entirely. We'll also prevent it from starting on boot as <span class="code">drbd</span> will be managed by the cluster in a later step.
 
On '''both''' nodes run;


<source lang="bash">
<source lang="bash">
/etc/openvpn/client.conf
chkconfig drbd off
/etc/openvpn/keys/ca.crt
/etc/init.d/drbd stop
/etc/openvpn/keys/digimer-lework.crt
</source>
/etc/openvpn/keys/digimer-lework.key
<source lang="text">
Stopping all DRBD resources: .
</source>
</source>


Obviously, substitute <span class="code">digimer-lework</span> for the name you used for the given client.
The second terminal will start complaining that <span class="code">/proc/drbd</span> no longer exists. This is because the <span class="code">drbd</span> init script unloaded the <span class="code">drbd</span> kernel module.


Now (re)start the client's openvpn daemon:
= Configuring Clustered Storage =
 
Before we can provision the first virtual machine, we must first create the storage that will back them. This will take a few steps;
 
* Configuring [[LVM]]'s clustered locking and creating the [[PV]]s, [[VG]]s and [[LV]]s
* Formatting and configuring the shared [[GFS2]] partition.
* Adding storage to the cluster's resource management.
 
== Configuring Clustered LVM Locking ==
 
Before we create the clustered LVM, we need to first make a couple of changes to the LVM configuration.
* We need to filter out the DRBD backing devices so that LVM doesn't see the same signature twice.
* Switch and enforce the locking type from local locking to clustered locking.
 
The configuration option to filter out the DRBD backing device is, surprisingly, <span class="code">filter = [ ... ]</span>. By default, it is set to allow everything via the <span class="code">"a/.*/"</span> regular expression. We're only using DRBD in our LVM, so we're going to flip that to reject everything ''except'' DRBD by changing the regex to <span class="code">"a|/dev/drbd*|", "r/.*/"</span>.
 
For the locking, we're going to change the <span class="code">locking_type</span> from <span class="code">1</span> (local locking) to <span class="code">3</span>, clustered locking. We're also going to disallow fall-back to local locking. Normal


<source lang="bash">
<source lang="bash">
/etc/init.d/openvpn restart
</source>
<source lang="text">
</source>
</source>


How you make sure this starts with the user's machine depends on the particular distro they are using.


== File Summary ==


These are the key files for OpenVPN.
== Creating The Shared GFS2 Partition ==


{|class="wikitable"
On '''both'''
!style="font-weight: bold; border-bottom: 1px solid #dfdfdf; text-align: left;"|Filename
!style="font-weight: bold; border-bottom: 1px solid #dfdfdf; text-align: left;"|Needed By
!style="font-weight: bold; border-bottom: 1px solid #dfdfdf; text-align: left;"|Purpose
!style="font-weight: bold; border-bottom: 1px solid #dfdfdf; text-align: left;"|Secret
|-
|<span class="code">ca.crt</span>
|server + all clients
|Root CA certificate
|style="color: #1fdf1f;"|No
|-
|<span class="code">ca.key</span>
|key signing machine only
|Root CA key
|style="color: #df1f1f;"|Yes
|-
|<span class="code">dh1024.pem</span>
|server only
|Diffie Hellman parameters
|style="color: #1fdf1f;"|No
|-
|<span class="code">daimon.alteeve.com.crt</span>
|<span class="code">daimon.alteeve.com</span> server only
|Server Certificate
|style="color: #1fdf1f;"|No
|-
|<span class="code">daimon.alteeve.com.key</span>
|<span class="code">daimon.alteeve.com</span> server only
|Server Key
|style="color: #df1f1f;"|Yes
|-
|colspan="4" style="border-bottom: 1px solid #dfdfdf; border-top: 1px solid #dfdfdf;"|These files are an example of the <span class="code">digimer-lework</span> keys. There will be a similar pair for every client's device you've made a key for.
|-
|<span class="code">digimer-lework.crt</span>
|The <span class="code">digimer-lework</span> client device only
|Client Certificate
|style="color: #1fdf1f;"|No
|-
|<span class="code">digimer-lework.key</span>
|The <span class="code">digimer-lework</span> client device only
|Client Key
|style="color: #df1f1f;"|Yes
|}


== Starting OpenVPN On The Client ==
<source lang="bash">
/etc/init.d/drbd start
/etc/init.d/clvmd start
</source>


You should now be able to start the OpenVPN daemon on the client machines now. This is identical to starting it on the server.
On '''<span class="code">an-node01</span>'''


<source lang="bash">
<source lang="bash">
/etc/init.d/openvpn start
pvcreate /dev/drbd0 && \
vgcreate -c y vg0 /dev/drbd0 && \
lvcreate -L 10G -n shared /dev/vg0
 
# Change this to match your cluster name
mkfs.gfs2 -p lock_dlm -j 2 -t ClusterA:shared /dev/vg0/shared
</source>
</source>
On '''both'''
<source lang="bash">
<source lang="bash">
Starting openvpn:                                          [ OK  ]
mkdir /shared
mount /dev/vg0/shared /shared/
echo `gfs2_edit -p sb /dev/vg0/shared | grep sb_uuid | sed -e "s/.*sb_uuid *\(.*\)/UUID=\L\1\E \/shared\t\tgfs2\trw,suid,dev,exec,nouser,async\t0 0/"` >> /etc/fstab
/etc/init.d/gfs2 status
</source>
</source>


If the start fails, look in <span class="code">/var/log/messages</span> for clues to the problem.
On '''<span class="code">an-node01</span>'''


To confirm that the server is up, check the interfaces and you should now see a <span class="code">tun0</span> device.
<source lang="bash">
mkdir /shared/definitions
mkdir /shared/provision
</source>
 
<source lang="bash">
# Change this to the proper 'vmXXXX-YY'
lvcreate -l 100%free -n vm0001-01 /dev/vg0
</source>


{{note|1=If you have anything already using <span class="code">tun0</span>, the '0' will be incremented to the first free integer.}}


<source lang="bash">
<source lang="bash">
ifconfig tun0
/etc/init.d/gfs2 stop && /etc/init.d/clvmd stop && /etc/init.d/drbd stop
/etc/init.d/rgmanager start && watch clustat
</source>
</source>
<span class="code"></span>
<source lang="bash">
<source lang="bash">
tun0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 
</source>
          inet addr:10.30.0.10  P-t-P:10.30.0.9  Mask:255.255.255.255
<source lang="text">
          UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
</source>
</source>


You should now be able to connect to the server, and vice-versa, using the <span class="code">10.30.0.0</span> network.
= Thanks =


''Done!!''


{{footer}}
{{footer}}

Revision as of 22:26, 23 October 2011

 AN!Wiki :: How To :: OpenVPN Server on EL6

Note: This is the second edition of the original Red Hat Cluster Service 2 Tutorial. This version is updated to use the Red Hat Cluster Suite, Stable version 3. It replaces Xen in favour of KVM to stay in-line with Red Hat's supported configuration. It also uses corosync, replacing openais, as the core cluster communication stack.

This paper has one goal;

  • Creating a 2-node, high-availability cluster hosting KVM virtual machines using RHCS "stable 3" with DRBD and clustered LVM for synchronizing storage data. This is an updated version of the earlier Red Hat Cluster Service 2 Tutorial Tutorial. You will find much in common with that tutorial if you've previously followed that document. Please don't skip large sections though. There are some differences that are subtle but important.

Grab a coffee, put on some nice music and settle in for some geekly fun.

The Task Ahead

Before we start, let's take a few minutes to discuss clustering and it's complexities.

Technologies We Will Use

  • Red Hat Enterprise Linux 6 (EL6); You can use a derivative like CentOS v6.
  • Red Hat Cluster Services "Stable" version 3. This describes the following core components:
    • Corosync; Provides cluster communications using the totem protocol.
    • Cluster Manager (cman); Manages the starting, stopping and managing of the cluster.
    • Resource Manager (rgmanager); Manages cluster resources and services. Handles service recovery during failures.
    • Clustered Logical Volume Manager (clvm); Cluster-aware (disk) volume manager. Backs GFS2 filesystems and KVM virtual machines.
    • Global File Systems version 2 (gfs2); Cluster-aware, concurrently mountable file system.
  • Distributed Redundant Block Device (DRBD); Keeps shared data synchronized across cluster nodes.
  • KVM; Hypervisor that controls and supports virtual machines.

A Note on Patience

There is nothing inherently hard about clustering. However, there are many components that you need to understand before you can begin. The result is that clustering has an inherently steep learning curve.

You must have patience. Lots of it.

Many technologies can be learned by creating a very simple base and then building on it. The classic "Hello, World!" script created when first learning a programming language is an example of this. Unfortunately, there is no real analogue to this in clustering. Even the most basic cluster requires several pieces be in place and working together. If you try to rush by ignoring pieces you think are not important, you will almost certainly waste time. A good example is setting aside fencing, thinking that your test cluster's data isn't important. The cluster software has no concept of "test". It treats everything as critical all the time and will shut down if anything goes wrong.

Take your time, work through these steps, and you will have the foundation cluster sooner than you realize. Clustering is fun because it is a challenge.

Prerequisites

It is assumed that you are familiar with Linux systems administration, specifically Red Hat Enterprise Linux and its derivatives. You will need to have somewhat advanced networking experience as well. You should be comfortable working in a terminal (directly or over ssh). Familiarity with XML will help, but is not terribly required as it's use here is pretty self-evident.

If you feel a little out of depth at times, don't hesitate to set this tutorial aside. Branch over to the components you feel the need to study more, then return and continue on. Finally, and perhaps most importantly, you must have patience! If you have a manager asking you to "go live" with a cluster in a month, tell him or her that it simply won't happen. If you rush, you will skip important points and you will fail.

Patience is vastly more important than any pre-existing skill.

Focus and Goal

There is a different cluster for every problem. Generally speaking though, there are two main problems that clusters try to resolve; Performance and High Availability. Performance clusters are generally tailored to the application requiring the performance increase. There are some general tools for performance clustering, like Red Hat's LVS (Linux Virtual Server) for load-balancing common applications like the Apache web-server.

This tutorial will focus on High Availability clustering, often shortened to simply HA and not to be confused with the Linux-HA "heartbeat" cluster suite, which we will not be using here. The cluster will provide a shared file systems and will provide for the high availability on KVM-based virtual servers. The goal will be to have the virtual servers live-migrate during planned node outages and automatically restart on a surviving node when the original host node fails.

Below is a very brief overview;

High Availability clusters like ours have two main parts; Cluster management and resource management.

The cluster itself is responsible for maintaining the cluster nodes in a group. This group is part of a "Closed Process Group", or CPG. When a node fails, the cluster manager must detect the failure, reliably eject the node from the cluster using fencing and then reform the CPG. Each time the cluster changes, or "re-forms", the resource manager is called. The resource manager checks to see how the cluster changed, consults it's configuration and determines what to do, if anything.

The details of all this will be discussed in detail a little later on. For now, it's sufficient to have in mind these two major roles and understand that they are somewhat independent entities.

Platform

This tutorial was written using RHEL version 6.1 and CentOS version 6.0 x86_64. No attempt was made to test on i686 or other EL6 derivatives. That said, there is no reason to believe that this tutorial will not apply to any variant. As much as possible, the language will be distro-agnostic. It is advised that you use an x86_64 (64-bit) platform if at all possible.

A Word On Complexity

Introducing the Fabimer Principle:

Clustering is not inherently hard, but it is inherently complex. Consider;

  • Any given program has N bugs.
    • RHCS uses; cman, corosync, dlm, fenced, rgmanager, and many more smaller apps.
    • We will be adding DRBD, GFS2, clvmd, libvirtd and KVM.
    • Right there, we have N^10 possible bugs. We'll call this A.
  • A cluster has Y nodes.
    • In our case, 2 nodes, each with 3 networks across 6 interfaces bonded into pairs.
    • The network infrastructure (Switches, routers, etc). We will be using two managed switches, adding another layer of complexity.
    • This gives us another Y^(2*(3*2))+2, the +2 for managed switches. We'll call this B.
  • Let's add the human factor. Let's say that a person needs roughly 5 years of cluster experience to be considered an proficient. For each year less than this, add a Z "oops" factor, (5-Z)^2. We'll call this C.
  • So, finally, add up the complexity, using this tutorial's layout, 0-years of experience and managed switches.
    • (N^10) * (Y^(2*(3*2))+2) * ((5-0)^2) == (A * B * C) == an-unknown-but-big-number.

This isn't meant to scare you away, but it is meant to be a sobering statement. Obviously, those numbers are somewhat artificial, but the point remains.

Any one piece is easy to understand, thus, clustering is inherently easy. However, given the large number of variables, you must really understand all the pieces and how they work together. DO NOT think that you will have this mastered and working in a month. Certainly don't try to sell clusters as a service without a lot of internal testing.

Clustering is kind of like chess. The rules are pretty straight forward, but the complexity can take some time to master.

Overview of Components

When looking at a cluster, there is a tendency to want to dive right into the configuration file. That is not very useful in clustering.

  • When you look at the configuration file, it is quite short.

It isn't like most applications or technologies though. Most of us learn by taking something, like a configuration file, and tweaking it this way and that to see what happens. I tried that with clustering and learned only what it was like to bang my head against the wall.

  • Understanding the parts and how they work together is critical.

You will find that the discussion on the components of clustering, and how those components and concepts interact, will be much longer than the initial configuration. It is true that we could talk very briefly about the actual syntax, but it would be a disservice. Please, don't rush through the next section or, worse, skip it and go right to the configuration. You will waste far more time than you will save.

  • Clustering is easy, but it has a complex web of inter-connectivity. You must grasp this network if you want to be an effective cluster administrator!

Component; cman

This was, traditionally, the cluster manager. In the 3.0 series, which is what all versions of EL6 will use, cman acts mainly as a quorum provider, tallying votes and deciding on a critical property of the cluster: quorum. As of the 3.1 series, which future EL releases will use, cman will be removed entirely.

The cman service is used to start and stop the cluster communication, membership, locking, fencing and other cluster foundation applications.

Component; corosync

Corosync is the heart of the cluster. Almost all other cluster compnents operate though this.

In Red Hat clusters, corosync is configured via the central cluster.conf file. It can be configured directly in corosync.conf, but given that we will be building an RHCS cluster, we will only use cluster.conf. That said, almost all corosync.conf options are available in cluster.conf. This is important to note as you will see references to both configuration files when searching the Internet.

Corosync sends messages using multicast messaging by default. Recently, unicast support has been added, but due to network latency, it is only recommended for use with small clusters of two to four nodes. We will be using multicast in this tutorial.

A Little History

There were significant changes between RHCS version 2, which we are using, and version 3 available on EL6 and recent Fedoras.

In the RHCS version 2, there was a component called openais which provided totem. The OpenAIS project was designed to be the heart of the cluster and was based around the Service Availability Forum's Application Interface Specification. AIS is an open API designed to provide inter-operable high availability services.

In 2008, it was decided that the AIS specification was overkill for most clustered applications being developed in the open source community. At that point, OpenAIS was split in to two projects: Corosync and OpenAIS. The former, Corosync, provides totem, cluster membership, messaging, and basic APIs for use by clustered applications, while the OpenAIS project became an optional add-on to corosync for users who want the full AIS API.

You will see a lot of references to OpenAIS while searching the web for information on clustering. Understanding it's evolution will hopefully help you avoid confusion.

Concept; quorum

Quorum is defined as the minimum set of hosts required in order to provide clustered services and is used to prevent split-brain situations.

The quorum algorithm used by the RHCS cluster is called "simple majority quorum", which means that more than half of the hosts must be online and communicating in order to provide service. While simple majority quorum a very common quorum algorithm, other quorum algorithms exist (grid quorum, YKD Dyanamic Linear Voting, etc.).

The idea behind quorum is that, when a cluster splits into two or more partitions, which ever group of machines has quorum can safely start clustered services knowing that no other lost nodes will try to do the same.

Take this scenario;

  • You have a cluster of four nodes, each with one vote.
    • The cluster's expected_votes is 4. A clear majority, in this case, is 3 because (4/2)+1, rounded down, is 3.
    • Now imagine that there is a failure in the network equipment and one of the nodes disconnects from the rest of the cluster.
    • You now have two partitions; One partition contains three machines and the other partition has one.
    • The three machines will have quorum, and the other machine will lose quorum.
    • The partition with quorum will reconfigure and continue to provide cluster services.
    • The partition without quorum will withdraw from the cluster and shut down all cluster services.

When the cluster reconfigures and the partition wins quorum, it will fence the node(s) in the partition without quorum. Once the fencing has been confirmed successful, the partition with quorum will begin accessing clustered resources, like shared filesystems.

This also helps explain why an even 50% is not enough to have quorum, a common question for people new to clustering. Using the above scenario, imagine if the split were 2 and 2 nodes. Because either can't be sure what the other would do, neither can safely proceed. If we allowed an even 50% to have quorum, both partition might try to take over the clustered services and disaster would soon follow.

There is one, and only one except to this rule.

In the case of a two node cluster, as we will be building here, any failure results in a 50/50 split. If we enforced quorum in a two-node cluster, there would never be high availability because and failure would cause both nodes to withdraw. The risk with this exception is that we now place the entire safety of the cluster on fencing, a concept we will cover in a second. Fencing is a second line of defense and something we are loath to rely on alone.

Even in a two-node cluster though, proper quorum can be maintained by using a quorum disk, called a qdisk. Unfortunately, qdisk on a DRBD resource comes with it's own problems, so we will not be able to use it here.

Concept; Virtual Synchrony

Many cluster operations, like fencing, distributed locking and so on, have to occur in the same order across all nodes. This concept is called "virtual synchrony".

This is provided by corosync using "closed process groups", CPG. A closed process group is simply a private group of processes in a cluster. Within this closed group, all messages between members are ordered. Delivery, however, is not guaranteed. If a member misses messages, it is up to the member's application to decide what action to take.

Let's look at two scenarios showing how locks are handled using CPG;

  • The cluster starts up cleanly with two members.
  • Both members are able to start service:foo.
  • Both want to start it, but need a lock from DLM to do so.
    • The an-node01 member has it's totem token, and sends it's request for the lock.
    • DLM issues a lock for that service to an-node01.
    • The an-node02 member requests a lock for the same service.
    • DLM rejects the lock request.
  • The an-node01 member successfully starts service:foo and announces this to the CPG members.
  • The an-node02 sees that service:foo is now running on an-node01 and no longer tries to start the service.
  • The two members want to write to a common area of the /shared GFS2 partition.
    • The an-node02 sends a request for a DLM lock against the FS, gets it.
    • The an-node01 sends a request for the same lock, but DLM sees that a lock is pending and rejects the request.
    • The an-node02 member finishes altering the file system, announces the changed over CPG and releases the lock.
    • The an-node01 member updates it's view of the filesystem, requests a lock, receives it and proceeds to update the filesystems.
    • It completes the changes, annouces the changes over CPG and releases the lock.

Messages can only be sent to the members of the CPG while the node has a totem tokem from corosync.

Concept; Fencing

Fencing is a absolutely critical part of clustering. Without fully working fence devices, your cluster will fail.

Was that strong enough, or should I say that again? Let's be safe:

DO NOT BUILD A CLUSTER WITHOUT PROPER, WORKING AND TESTED FENCING.

Sorry, I promise that this will be the only time that I speak so strongly. Fencing really is critical, and explaining the need for fencing is nearly a weekly event.

So then, let's discuss fencing.

When a node stops responding, an internal timeout and counter start ticking away. During this time, no DLM locks are allowed to be issued. Anything using DLM, including rgmanager, clvmd and gfs2, are effectively hung. The hung node is detected using a totem token timeout. That is, if a token is not received from a node within a period of time, it is considered lost and a new token is sent. After a certain number of lost tokens, the cluster declares the node dead. The remaining nodes reconfigure into a new cluster and, if they have quorum (or if quorum is ignored), a fence call against the silent node is made.

The fence daemon will look at the cluster configuration and get the fence devices configured for the dead node. Then, one at a time and in the order that they appear in the configuration, the fence daemon will call those fence devices, via their fence agents, passing to the fence agent any configured arguments like username, password, port number and so on. If the first fence agent returns a failure, the next fence agent will be called. If the second fails, the third will be called, then the forth and so on. Once the last (or perhaps only) fence device fails, the fence daemon will retry again, starting back at the start of the list. It will do this indefinitely until one of the fence devices success.

Here's the flow, in point form:

  • The totem token moves around the cluster members. As each member gets the token, it sends sequenced messages to the CPG members.
  • The token is passed from one node to the next, in order and continuously during normal operation.
  • Suddenly, one node stops responding.
    • A timeout starts (~238ms by default), and each time the timeout is hit, and error counter increments and a replacement token is created.
    • The silent node responds before the failure counter reaches the limit.
      • The failure counter is reset to 0
      • The cluster operates normally again.
  • Again, one node stops responding.
    • Again, the timeout begins. As each totem token times out, a new packet is sent and the error count increments.
    • The error counts exceed the limit (4 errors is the default); Roughly one second has passed (238ms * 4 plus some overhead).
    • The node is declared dead.
    • The cluster checks which members it still has, and if that provides enough votes for quorum.
      • If there are too few votes for quorum, the cluster software freezes and the node(s) withdraw from the cluster.
      • If there are enough votes for quorum, the silent node is declared dead.
        • corosync calls fenced, telling it to fence the node.
        • The fenced daemon notifies DLM and locks are blocked.
        • Which fence device(s) to use, that is, what fence_agent to call and what arguments to pass, is gathered.
        • For each configured fence device:
          • The agent is called and fenced waits for the fence_agent to exit.
          • The fence_agent's exit code is examined. If it's a success, recovery starts. If it failed, the next configured fence agent is called.
        • If all (or the only) configured fence fails, fenced will start over.
        • fenced will wait and loop forever until a fence agent succeeds. During this time, the cluster is effectively hung.
      • Once a fence_agent succeeds, fenced notifies DLM and lost locks are recovered.
        • GFS2 partitions recover using their journal.
        • Lost cluster resources are recovered as per rgmanager's configuration (including file system recovery as needed).
  • Normal cluster operation is restored, minus the lost node.

This skipped a few key things, but the general flow of logic should be there.

This is why fencing is so important. Without a properly configured and tested fence device or devices, the cluster will never successfully fence and the cluster will remain hung until a human can intervene.

Component; totem

The totem protocol defines message passing within the cluster and it is used by corosync. A token is passed around all the nodes in the cluster, and nodes can only send messages while they have the token. A node will keep it's messages in memory until it gets the token back with no "not ack" messages. This way, if a node missed a message, it can request it be resent when it gets it's token. If a node isn't up, it will simply miss the messages.

The totem protocol supports something called 'rrp', Redundant Ring Protocol. Through rrp, you can add a second backup ring on a separate network to take over in the event of a failure in the first ring. In RHCS, these rings are known as "ring 0" and "ring 1". The RRP is being re-introduced in RHCS version 3. It's use is experimental and should only be used with plenty of testing.

Component; rgmanager

When the cluster membership changes, corosync tells the rgmanager that it needs to recheck it's services. It will examine what changed and then will start, stop, migrate or recover cluster resources as needed.

Within rgmanager, one or more resources are brought together as a service. This service is then optionally assigned to a failover domain, an subset of nodes that can have preferential ordering.

The rgmanager daemon runs separately from the cluster manager, cman. This means that, to fully start the cluster, we need to start both cman and then rgmanager.

Component; qdisk

Note: qdisk does not work reliably on a DRBD resource, so we will not be using it in this tutorial.

A Quorum disk, known as a qdisk is small partition on SAN storage used to enhance quorum. It generally carries enough votes to allow even a single node to take quorum during a cluster partition. It does this by using configured heuristics, that is custom tests, to decided which which node or partition is best suited for providing clustered services during a cluster reconfiguration. These heuristics can be simple, like testing which partition has access to a given router, or they can be as complex as the administrator wishes using custom scripts.

Though we won't be using it here, it is well worth knowing about when you move to a cluster with SAN storage.

Component; DRBD

DRBD; Distributed Replicating Block Device, is a technology that takes raw storage from two or more nodes and keeps their data synchronized in real time. It is sometimes described as "RAID 1 over Cluster Nodes", and that is conceptually accurate. In this tutorial's cluster, DRBD will be used to provide that back-end storage as a cost-effective alternative to a traditional SAN device.

To help visualize DRBD's use and role, Take a look at how we will implement our cluster's storage.

This shows;

  • Each node having four physical disks tied together in a RAID Level 5 array and presented to the Node's OS as a single drive which is found at /dev/sda.
  • Each node's OS uses three primary partitions for /boot, <swap> and /.
  • Three extended partitions are created;
    • /dev/sda5 backs a small partition used as a GFS2-formatted shared mount point.
    • /dev/sda6 backs the VMs designed to run primarily on an-node01.
    • /dev/sda7 backs the VMs designed to run primarily on an-node02.
  • All three extended partitions are combined using DRBD to create three DRBD resources;
    • /dev/drbd0 is backed by /dev/sda5.
    • /dev/drbd1 is backed by /dev/sda6.
    • /dev/drbd2 is backed by /dev/sda7.
  • All three DRBD resources are managed by clustered LVM.
  • The GFS2-formatted LV is mounted on /shared on both nodes.
  • Each VM gets it's own LV.
  • All three DRBD resources sync over the Storage Network, which uses the bonded bond1 (backed be eth1 and eth4).

Don't worry if this seems illogical at this stage. The main thing to look at are the drbdX devices and how they each tie back to a corresponding sdaY device on either node.

 _________________________________________________                 _________________________________________________ 
| [ an-node01 ]                                   |               |                                   [ an-node02 ] |
|  ________       __________                      |               |                      __________       ________  |
| [_disk_1_]--+--[_/dev/sda_]                     |               |                     [_/dev/sda_]--+--[_disk_1_] |
|  ________   |    |   ___________    _______     |               |     _______    ___________   |    |   ________  |
| [_disk_2_]--+    +--[_/dev/sda1_]--[_/boot_]    |               |    [_/boot_]--[_/dev/sda1_]--+    +--[_disk_2_] |
|  ________   |    |   ___________    ________    |               |    ________    ___________   |    |   ________  |
| [_disk_3_]--+    +--[_/dev/sda2_]--[_<swap>_]   |               |   [_<swap>_]--[_/dev/sda2_]--+    +--[_disk_3_] |
|  ________   |    |   ___________    ___         |               |         ___    ___________   |    |   ________  |
| [_disk_4_]--/    +--[_/dev/sda3_]--[_/_]        |               |        [_/_]--[_/dev/sda3_]--+    \--[_disk_4_] |
|                  |   ___________                |               |                ___________   |                  |
|                  +--[_/dev/sda5_]------------\  |               |  /------------[_/dev/sda5_]--+                  |
|                  |   ___________             |  |               |  |             ___________   |                  |
|                  +--[_/dev/sda6_]----------\ |  |               |  | /----------[_/dev/sda6_]--+                  |
|                  |   ___________           | |  |               |  | |           ___________   |                  |
|                  \--[_/dev/sda7_]--------\ | |  |               |  | | /--------[_/dev/sda7_]--/                  |
|        _______________    ____________   | | |  |               |  | | |   ____________    _______________        |
|    /--[_Clustered_LVM_]--[_/dev/drbd2_]--/ | |  |               |  | | \--[_/dev/drbd2_]--[_Clustered_LVM_]--\    |
|   _|__                     |   _______     | |  |               |  | |      |   _______                    __|_   |
|  [_PV_]                    \--{_bond1_}    | |  |               |  | |      \--{_bond1_}                  [_PV_]  |
|   _|_______                                | |  |               |  | |                                _______|_   |
|  [_an2-vg0_]                               | |  |               |  | |                               [_an2-vg0_]  |
|    |   _______________________    .......  | |  |               |  | |   _____     _______________________   |    |
|    +--[_/dev/an2-vg0/vm0003_1_]---:.vm3.:  | |  |               |  | |  [_vm3_]---[_/dev/an2-vg0/vm0003_1_]--+    |
|    |   _______________________    .......  | |  |               |  | |   _____     _______________________   |    |
|    \--[_/dev/an2-vg0/vm0004_1_]---:.vm4.:  | |  |               |  | |  [_vm4_]---[_/dev/an2-vg0/vm0004_1_]--/    |
|          _______________    ____________   | |  |               |  | |   ____________    _______________          |
|      /--[_Clustered_LVM_]--[_/dev/drbd1_]--/ |  |               |  | \--[_/dev/drbd1_]--[_Clustered_LVM_]--\      |
|     _|__                     |   _______     |  |               |  |      |   _______                    __|_     |
|    [_PV_]                    \--{_bond1_}    |  |               |  |      \--{_bond1_}                  [_PV_]    |
|     _|_______                                |  |               |  |                                ___ ___|_     |
|    [_an1-vg0_]                               |  |               |  |                               [_an1-vg0_]    |
|      |   _______________________     _____   |  |               |  |      .......    ___________________   |      |
|      +--[_/dev/an1-vg0/vm0001_1_]---[_vm1_]  |  |               |  |      :.vm1.:---[_/dev/vg0/vm0001_1_]--+      |
|      |   _______________________     _____   |  |               |  |      .......    ___________________   |      |
|      \--[_/dev/an1-vg0/vm0002_1_]---[_vm2_]  |  |               |  |      :.vm2.:---[_/dev/vg0/vm0002_1_]--/      |
|            _______________    ____________   |  |               |  |   ____________    _______________            |
|        /--[_Clustered_LVM_]--[_/dev/drbd0_]--/  |               |  \--[_/dev/drbd0_]--[_Clustered_LVM_]--\        |
|       _|__                     |   _______      |               |       |   _______                    __|_       |
|      [_PV_]                    \--{_bond1_}     |               |       \--{_bond1_}                  [_PV_]      |
|       _|__________                              |               |                              __________|_       |
|      [_shared-vg0_]                             |               |                             [_shared-vg0_]      |
|       _|_________________________               |               |               _________________________|_       |
|      [_/dev/shared-vg0/lv_shared_]              |               |              [_/dev/shared-vg0/lv_shared_]      |
|        |   ______    _________                  |               |                  _________    ______   |        |
|        \--[_GFS2_]--[_/shared_]                 |               |                 [_/shared_]--[_GFS2_]--/        |
|                                          _______|   _________   |_______                                          |
|                                         | bond1 =--| Storage |--= bond1 |                                         |
|                                         |______||  | Network |  ||______|                                         |
|_________________________________________________|  |_________|  |_________________________________________________|
.

Component; Clustered LVM

With DRBD providing the raw storage for the cluster, we must next consider partitions. This is where Clustered LVM, known as CLVM, comes into play.

CLVM is ideal in that by using DLM, the distributed lock manager. It won't allow access to cluster members outside of corosync's closed process group, which, in turn, requires quorum.

It is ideal because it can take one or more raw devices, known as "physical volumes", or simple as PVs, and combine their raw space into one or more "volume groups", known as VGs. These volume groups then act just like a typical hard drive and can be "partitioned" into one or more "logical volumes", known as LVs. These LVs are where KVM's virtual machine guests will exist and where we will create our GFS2 clustered file system.

LVM is particularly attractive because of how flexible it is. We can easily add new physical volumes later, and then grow an existing volume group to use the new space. This new space can then be given to existing logical volumes, or entirely new logical volumes can be created. This can all be done while the cluster is online offering an upgrade path with no down time.

Component; GFS2

With DRBD providing the clusters raw storage space, and Clustered LVM providing the logical partitions, we can now look at the clustered file system. This is the role of the Global File System version 2, known simply as GFS2.

It works much like standard filesystem, with user-land tools like mkfs.gfs2, fsck.gfs2 and so on. The major difference is that it and clvmd use the cluster's distributed locking mechanism provided by the dlm_controld daemon. Once formatted, the GFS2-formatted partition can be mounted and used by any node in the cluster's closed process group. All nodes can then safely read from and write to the data on the partition simultaneously.

Component; DLM

One of the major roles of a cluster is to provide distributed locking for clustered storage and resource management.

Whenever a resource, GFS2 filesystem or clustered LVM LV needs a lock, it sends a request to dlm_controld which runs in userspace. This communicates with DLM in kernel. If the lock group does not yet exist, DLM will create it and then give the lock to the requester. Should a subsequant lock request come in for the same lock group, it will be rejected. Once the application using the lock is finished with it, it will release the lock. After this, another node may request and receive a lock for the lock group.

If a node fails, fenced will alert dlm_controld that a fence is pending and new lock requests will block. After a successful fence, fenced will alert DLM that the node is gone and any locks the victim node held are released. At this time, other nodes may request a lock on the lock groups the lost node held and can perform recovery, like replaying a GFS2 filesystem journal, prior to resuming normal operation.

Note that DLM locks are not used for actually locking the file system. That job is still handled by plock() calls (POSIX locks).

Component; KVM

Two of the most popular open-source virtualization platforms available in the Linux world today and Xen and KVM. The former is maintained by Citrix and the other by Redhat. It would be difficult to say which is "better", as they're both very good. Xen can be argued to be more mature where KVM is the "official" solution supported by Red Hat in EL6.

We will be using the KVM hypervisor within which our highly-available virtual machine guests will reside. It is a type-2 hypervisor, which means that the host operating system runs directly on the bare hardware. Contrasted against Xen, which is a type-1 hypervisor where even the installed OS is itself just another virtual machine.

Network

The cluster will use three separate Class B networks;

Purpose Subnet Notes
Internet-Facing Network (IFN) 10.255.0.0/16
  • Each node will use 10.255.0.x where x matches the node ID.
  • Virtual Machines in the cluster that need to be connected to the Internet will use 10.255.y.z where y corresponds to the cluster and z is a simple sequence number matching the VM ID.
Storage Network (SN) 10.10.0.0/16
  • Each node will use 10.10.0.x where x matches the node ID.
Back-Channel Network (BCN) 10.20.0.0/16
  • Each node will use 10.20.0.x where x matches the node ID.
  • Node-specific IPMI or other out-of-band management devices will use 10.20.1.x where x matches the node ID.
  • Multi-port fence devices will use 10.20.2.z where z is a simple sequence.

Miscellaneous equipment in the cluster, like managed switches, will use 10.20.3.z where z is a simple sequence.

Optional OpenVPN Network 10.30.0.0/16
  • For clients behind firewalls, I like to create a VPN server for the cluster nodes to log into when support is needed. This way, the client retains control over when remote access is available simply by starting and stopping the openvpn daemon. This will not be discussed any further in this tutorial.

We will be using six interfaces, bonded into three pairs of two NICs in Active/Passive (mode 1) configuration. Each link of each bond will be on alternate, unstacked switches. This configuration is the only configuration supported by Red Hat in clusters. We will also configure affinity by specifying interfaces eth0, eth1 and eth2 as primary for the bond0, bond1 and bond2 interfaces, respectively. This way, when everything is working fine, all traffic is routed through the same switch for maximum performance.

Note: Only the bonded interface used by corosync must be in Active/Passive configuration (bond0 in this tutorial). If you want to experiment with other bonding modes for bond1 or bond2, please feel free to do so. That is outside the scope of this tutorial, however.

If you can not install six interfaces in your server, then four interfaces will do with the SN and BCN networks merged.

Warning: If you wish to merge the SN and BCN onto one interface, test to ensure that the storage traffic will not block cluster communication. Test by forming your cluster and then pushing your storage to maximum read and write performance for an extended period of time (minimum of several seconds). If the cluster partitions, you will need to do some advanced quality-of-service or other network configuration to ensure reliable delivery of cluster network traffic.

In this tutorial, we will use two D-Link DGS-3100-24, unstacked, using three VLANs to isolate the three networks.

  • IFN will have VLAN ID number 100.
  • SN will have VLAN ID number 101.
  • BCN will have VLAN IS number 102.

You could just as easily use four or six unmanaged 5 port or 8 port switches. What matters is that the three subnets are isolated and that each link of each bond is on a separate switch. Lastly, only connect the IFN switches or VLANs to the Internet for security reasons.

The actual mapping of interfaces to bonds to networks will be:

Subnet Link 1 Link 2 Bond IP
BCN eth0 eth3 bond0 10.20.0.x
SN eth1 eth4 bond1 10.10.0.x
IFN eth2 eth5 bond2 10.255.0.x

Setting Up the Network

Warning: The following steps can easily get confusing, given how many files we need to edit. Losing access to your server's network is a very real possibility! Do not continue without direct access to your servers! If you have out-of-band access via iKVM, console redirection or similar, be sure to test that it is working before proceeding.

Managed and Stacking Switch Notes

Note: If you have two stacked switches, do not stack them!

There are two things you need to be wary of with managed switches.

  • Don't stack them. It may seem like it makes sense to stack them and create Link Aggregation Groups, but this is not supported. Leave the two switches as independent units.
  • Disable Spanning Tree Protocol on all ports used by the cluster. Otherwise, when a lost switch is recovered, STP negotiation will cause traffic to stop on the ports for upwards of thirty seconds. This is more than enough time to partition a cluster.

Enable STP on the ports you use for uplinking the two switches and disable it on all other ports.

Making Sure We Know Our Interfaces

When you installed the operating system, the network interfaces names are somewhat randomly assigned to the physical network interfaces. It more than likely that you will want to re-order.

Before you start moving interface names around, you will want to consider which physical interfaces you will want to use on which networks. At the end of the day, the names themselves have no meaning. At the very least though, make them consistent across nodes.

Some things to consider, in order of importance:

  • If you have a shared interface for your out-of-band management interface, like IPMI or iLO, you will want that interface to be on the Back-Channel Network.
  • For redundancy, you want to spread out which interfaces are paired up. In my case, I have three interfaces on my mainboard and three additional add-in cards. I will pair each onboard interface with an add-in interface. In my case, my IPMI interface physically piggy-backs on one of the onboard interfaces so this interface will need to be part of the BCN bond.
  • Your interfaces with the lowest latency should be used for the back-channel network.
  • Your two fastest interfaces should be used for your storage network.
  • The remaining two slowest interfaces should be used for the Internet-Facing Network bond.

In my case, all six interfaces are identical, so there is little to consider. The left-most interface on my system has IPMI, so it's paired network interface will be eth0. I simply work my way left, incrementing as I go. What you do will be whatever makes most sense to you.

There is a separate, short tutorial on re-ordering network interface;

Once you have the physical interfaces named the way you like, proceed to the next step.

Planning Our Network

To setup our network, we will need to edit the ifcfg-ethX, ifcfg-bondX and ifcfg-vbrX scripts. The last one will create bridges which will be used to route network connections to the virtual machines. We won't be creating an vbr1 bridge though, and bond1 will be dedicated to the storage and never used by a VM. The bridges will have the IP addresses, not the bonded interfaces. They will instead be slaved to their respective bridges.

We're going to be editing a lot of files. It's best to lay out what we'll be doing in a chart. So our setup will be:

Node BCN IP and Device SN IP and Device IFN IP and Device
an-node01 10.20.0.1 on vbr0 (bond0 slaved) 10.10.0.1 on bond1 10.255.0.1 on vbr2 (bond2 slaved)
an-node02 10.20.0.2 on vbr0 (bond0 slaved) 10.10.0.2 on bond1 10.255.0.2 on vbr2 (bond2 slaved)

Creating Some Network Configuration Files

Bridge configuration files must have a file name that sorts after the interfaces and bridges. The actual device name can be whatever you want though. If the system tries to start the bridge before it's interface is up, it will fail. I personally like to use the name vbrX for "virtual machine bridge". You can use whatever makes sense to you, with the above concern in mind.

Start by touching the configuration files we will need.

touch /etc/sysconfig/network-scripts/ifcfg-bond{0,1,2}
touch /etc/sysconfig/network-scripts/ifcfg-vbr{0,2}

Now make a backup of your configuration files, in case something goes wrong and you want to start over.

mkdir /root/backups/
rsync -av /etc/sysconfig/network-scripts/ifcfg-eth* /root/backups/
sending incremental file list
ifcfg-eth0
ifcfg-eth1
ifcfg-eth2
ifcfg-eth3
ifcfg-eth4
ifcfg-eth5

sent 1467 bytes  received 126 bytes  3186.00 bytes/sec
total size is 1119  speedup is 0.70

Configuring Our Bridges

Now lets start in reverse order. We'll write the bridge configuration, then the bond interfaces and finally alter the interface configuration files. The reason for doing this in reverse is to minimize the amount of time where a sudden restart would leave us without network access.

Note: If you know now that none of your VMs will ever need access to the BCN, as might be the case if all VMs will be web-facing, then you can skip the creation of vbr0. In this case, move the IP address and related values to the ifcfg-bond0 configuration file.

an-node01 BCN Bridge:

vim /etc/sysconfig/network-scripts/ifcfg-vbr0
# Back-Channel Network - Bridge
DEVICE="vbr0"
TYPE="Bridge"
BOOTPROTO="static"
IPADDR="10.20.0.1"
NETMASK="255.255.0.0"

an-node01 IFN Bridge:

vim /etc/sysconfig/network-scripts/ifcfg-vbr2
# Internet-Facing Network - Bridge
DEVICE="vbr2"
TYPE="Bridge"
BOOTPROTO="static"
IPADDR="10.255.0.1"
NETMASK="255.255.0.0"
GATEWAY="10.255.255.254"
DNS1="192.139.81.117"
DNS2="192.139.81.1"
DEFROUTE="yes"

Creating the Bonded Interfaces

Now we can create the actual bond configuration files.

To explain the BONDING_OPTS options;

  • mode=1 sets the bonding mode to active-backup.
  • The miimon=100 tells the bonding driver to check if the network cable has been unplugged or plugged in every 100 milliseconds.
  • The use_carrier=1 tells the driver to use the driver to maintain the link state. Some drivers don't support that. If you run into trouble, try changing this to 0.
  • The updelay=120000 tells the driver to delay switching back to the primary interface for 120,000 milliseconds (2 minutes). This is designed to give the switch connected to the primary interface time to finish booting. Setting this too low may cause the bonding driver to sitch back before the network switch is ready to actually move data.
  • The downdelay=0 tells the driver not to wait before changing the state of an interface when the link goes down. That is, when the driver detects a fault, it will switch to the backup interface immediately.

an-node01 BCN Bond:

vim /etc/sysconfig/network-scripts/ifcfg-bond0
# Back-Channel Network - Bond
DEVICE="bond0"
BRIDGE="vbr0"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=eth0"

an-node01 SN Bond:

vim /etc/sysconfig/network-scripts/ifcfg-bond1
# Storage Network - Bond
DEVICE="bond1"
BOOTPROTO="static"
NM_CONTROLLED="no"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=eth1"
IPADDR="10.10.0.1"
NETMASK="255.255.0.0"

an-node01 IFN Bond:

vim /etc/sysconfig/network-scripts/ifcfg-bond2
# Internet-Facing Network - Bond
DEVICE="bond2"
BRIDGE="vbr2"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=eth2"

Alter The Interface Configurations

Now, finally, alter the interfaces themselves to join their respective bonds.

an-node01's eth0, the BCN bond0, Link 1:

vim /etc/sysconfig/network-scripts/ifcfg-eth0
# Back-Channel Network - Link 1
HWADDR="00:E0:81:C7:EC:49"
DEVICE="eth0"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond0"
SLAVE="yes"

an-node01's eth1, the SN bond1, Link 1:

vim /etc/sysconfig/network-scripts/ifcfg-eth0
# Storage Network - Link 1
HWADDR="00:E0:81:C7:EC:48"
DEVICE="eth1"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond1"
SLAVE="yes"

an-node01's eth2, the IFN bond2, Link 1:

vim /etc/sysconfig/network-scripts/ifcfg-eth0
# Internet-Facing Network - Link 1
HWADDR="00:E0:81:C7:EC:47"
DEVICE="eth2"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond2"
SLAVE="yes"

an-node01's eth3, the BCN bond0, Link 2:

vim /etc/sysconfig/network-scripts/ifcfg-eth3
# Back-Channel Network - Link 2
DEVICE="eth3"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond0"
SLAVE="yes"

an-node01's eth4, the SN bond1, Link 2:

vim /etc/sysconfig/network-scripts/ifcfg-eth0
# Storage Network - Link 2
HWADDR="00:1B:21:BF:6F:FE"
DEVICE="eth4"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond1"
SLAVE="yes"

an-node01's eth5, the IFN bond2, Link 2:

vim /etc/sysconfig/network-scripts/ifcfg-eth0
# Internet-Facing Network - Link 2
HWADDR="00:1B:21:BF:70:02"
DEVICE="eth5"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond2"
SLAVE="yes"

Loading The New Network Configuration

Simple restart the network service.

/etc/init.d/network restart

Updating /etc/hosts

On both nodes, update the /etc/hosts file to reflect your network configuration. Remember to add entries for your IPMI, switched PDUs and other devices.

vim /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

# an-node01
10.20.0.1	an-node01 an-node01.bcn an-node01.alteeve.com
10.20.1.1	an-node01.ipmi
10.10.0.1	an-node01.sn
10.255.0.1	an-node01.ifn

# an-node01
10.20.0.2	an-node02 an-node02.bcn an-node02.alteeve.com
10.20.1.2	an-node02.ipmi
10.10.0.2	an-node02.sn
10.255.0.2	an-node02.ifn

# Fence devices
10.20.2.1       pdu1 pdu1.alteeve.com
10.20.2.2       pdu2 pdu2.alteeve.com

# VPN interfaces, if used.
10.30.0.1	an-node01.vpn
10.30.0.2	an-node02.vpn
Warning: Which ever switch you have the IPMI interfaces connected to, be sure to connect the PDU into the opposite switch! If both fence types are on one switch, then that switch becomes a single point of failure!

Configuring The Cluster Foundation

We need to configure the cluster in two stages. This is because we have something of a chicken-and-egg problem.

  • We need clustered storage for our virtual machines.
  • Our clustered storage needs the cluster for fencing.

Conveniently, clustering has two logical parts;

  • Cluster communication and membership.
  • Cluster resource management.

The first, communication and membership, covers which nodes are part of the cluster and ejecting faulty nodes from the cluster, among other tasks. The second part, resource management, is provided by a second tool called rgmanager. It's this second part that we will set aside for later.

Installing Required Programs

Installing the cluster software is pretty simple;

yum install cman corosync rgmanager ricci gfs2-utils

Configuration Methods

In Red Hat Cluster Services, the heart of the cluster is found in the /etc/cluster/cluster.conf XML configuration file.

There are three main ways of editing this file. Two are already well documented, so I won't bother discussing them, beyond introducing them. The third way is by directly hand-crafting the cluster.conf file. This method is not very well documented, and directly manipulating configuration files is my preferred method. As my boss loves to say; "The more computers do for you, the more they do to you".

The first two, well documented, graphical tools are:

  • system-config-cluster, older GUI tool run directly from one of the cluster nodes.
  • Conga, comprised of the ricci node-side client and the luci web-based server (can be run on machines outside the cluster).

I do like the tools above, but I often find issues that send me back to the command line. I'd recommend setting them aside for now as well. Once you feel comfortable with cluster.conf syntax, then by all means, go back and use them. I'd recommend not relying on them though, which might be the case if you try to use them too early in your studies.

The First cluster.conf Foundation Configuration

The very first stage of building the cluster is to create a configuration file that is as minimal as possible. To do that, we need to define a few thing;

  • The name of the cluster and the cluster file version.
    • Define cman options
    • The nodes in the cluster
      • The fence method for each node
    • Define fence devices
    • Define fenced options

That's it. Once we've defined this minimal amount, we will be able to start the cluster for the first time! So lets get to it, finally.

Name the Cluster and Set The Configuration Version

The cluster tag is the parent tag for the entire cluster configuration file.

vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="an-clusterA" config_version="1">
</cluster>

This has two attributes that we need to set are name="" and config_version="".

The name="" attribute defines the name of the cluster. It must be unique amongst the clusters on your network. It should be descriptive, but you will not want to make it too long, either. You will see this name in the various cluster tools and you will enter in, for example, when creating a GFS2 partition later on. This tutorial uses the cluster name an_clusterA. The reason for the A is to help differentiate it from the nodes which use sequence numbers.

The config_version="" attribute is an integer marking the version of the configuration file. Whenever you make a change to the cluster.conf file, you will need to increment this version number by 1. If you don't increment this number, then the cluster tools will not know that the file needs to be reloaded. As this is the first version of this configuration file, it will start with 1. Note that this tutorial will increment the version after every change, regardless of whether it is explicitly pushed out to the other nodes and reloaded. The reason is to help get into the habit of always increasing this value.

Configuring cman Options

We are going to setup a special case for our cluster; A 2-Node cluster.

This is a special case because traditional quorum will not be useful. With only two nodes, each having a vote of 1, the total votes is 2. Quorum needs 50% + 1, which means that a single node failure would shut down the cluster, as the remaining node's vote is 50% exactly. That kind of defeats the purpose to having a cluster at all.

So to account for this special case, there is a special attribute called two_node="1". This tells the cluster manager to continue operating with only one vote. This option requires that the expected_votes="" attribute be set to 1. Normally, expected_votes is set automatically to the total sum of the defined cluster nodes' votes (which itself is a default of 1). This is the other half of the "trick", as a single node's vote of 1 now always provides quorum (that is, 1 meets the 50% + 1 requirement).

<?xml version="1.0"?>
<cluster name="an-clusterA" config_version="2">
	<cman expected_votes="1" two_node="1"/>
</cluster>

Take note of the self-closing <... /> tag. This is an XML syntax that tells the parser not to look for any child or a closing tags.

Defining Cluster Nodes

This example is a little artificial, please don't load it into your cluster as we will need to add a few child tags, but one thing at a time.

This actually introduces two tags.

The first is parent clusternodes tag, which takes no variables of it's own. It's sole purpose is to contain the clusternode child tags.

<?xml version="1.0"?>
<cluster name="an-clusterA" config_version="3">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node01.alteeve.com" nodeid="1" />
		<clusternode name="an-node02.alteeve.com" nodeid="2" />
	</clusternodes>
</cluster>

The clusternode tag defines each cluster node. There are many attributes available, but we will look at just the two required ones.

The first is the name="" attribute. This should match the name given by uname -n ($HOSTNAME) when run on each node. The IP address that the name resolves to also sets the interface and subnet that the totem ring will run on. That is, the main cluster communications, which we are calling the Back-Channel Network. This is why it is so important to setup our /etc/hosts file correctly. Please see the clusternode's name attribute document for details on how name to interface mapping is resolved.

The second attribute is nodeid="". This must be a unique integer amongst the <clusternode ...> tags. It is used by the cluster to identify the node.

Defining Fence Devices

Fencing devices are designed to forcible eject a node from a cluster. This is generally done by forcing it to power off or reboot. Some SAN switches can logically disconnect a node from the shared storage device, which has the same effect of guaranteeing that the defective node can not alter the shared storage. A common, third type of fence device is one that cuts the mains power to the server.

In this tutorial, our nodes support IPMI, which we will use as the primary fence device. We also have an APC brand switched PDU which will act as a backup in case a fault in the node disables the IPMI BMC.

Note: Not all brands of switched PDUs are supported as fence devices. Before you purchase a fence device, confirm that it is supported.

All fence devices are contained within the parent fencedevices tag. This parent tag has no attributes. Within this parent tag are one or more fencedevice child tags.

<?xml version="1.0"?>
<cluster name="an-clusterA" config_version="4">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node01.alteeve.com" nodeid="1" />
		<clusternode name="an-node02.alteeve.com" nodeid="2" />
	</clusternodes>
	<fencedevices>
		<fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-node01.ipmi" login="root" passwd="secret" />
		<fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-node02.ipmi" login="root" passwd="secret" />
		<fencedevice name="pdu2" agent="fence_apc" ipaddr="pdu2.alteeve.com" login="apc" passwd="secret" />
	</fencedevices>
</cluster>

Every fence device used in your cluster will have it's own fencedevice tag. If you are using IPMI, this means you will have a fencedevice entry for each node, as each physical IPMI BMC is a unique fence device. On the other hand, fence devices that support multiple nodes, like switched PDUs, will have just one entry. In our case, we're using both types, so we have three fences devices; The two IPMI BMCs plus the switched PDU.

All fencedevice tags share two basic attributes; name="" and agent="".

  • The name attribute must be unique among all the fence devices in your cluster. As we will see in the next step, this name will be used within the <clusternode...> tag.
  • The agent tag tells the cluster which fence agent to use when the fenced daemon needs to communicate with the physical fence device. A fence agent is simple a shell script that acts as a glue layer between the fenced daemon and the fence hardware. This agent takes the arguments from the daemon, like what port to act on and what action to take, and executes the node. The agent is responsible for ensuring that the execution succeeded and returning an appropriate success or failure exit code, depending. For those curious, the full details are described in the FenceAgentAPI. If you have two or more of the same fence device, like IPMI, then you will use the same fence agent value a corresponding number of times.

Beyond these two attributes, each fence agent will have it's own subset of attributes. The scope of which is outside this tutorial, though we will see examples for IPMI and a switched PDU. Most, if not all, fence agents have a corresponding man page that will show you what attributes it accepts and how they are used. The two fence agents we will see here have their attributes defines in the following man pages.

  • man fence_ipmilan - IPMI fence agent.
  • man fence_apc - APC-brand switched PDU.

The example above is what this tutorial will use.

Example <fencedevice...> Tag For IPMI

Here we will show what IPMI <fencedevice...> tags look like. We won't be using it ourselves, but it is quite popular as a fence device so I wanted to show an example of it's use.

                ...
                        <fence>
                                <method name="ipmi">
                                        <device name="ipmi_an01" action="reboot"/>
                                </method>
                        </fence>
                ...
	<fencedevices>
		<fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-node01.ipmi" login="root" passwd="secret" />
		<fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-node02.ipmi" login="root" passwd="secret" />
	</fencedevices>
  • ipaddr; This is the resolvable name or IP address of the device. If you use a resolvable name, it is strongly advised that you put the name in /etc/hosts as DNS is another layer of abstraction which could fail.
  • login; This is the login name to use when the fenced daemon connects to the device.
  • passwd; This is the login password to use when the fenced daemon connects to the device.
  • name; This is the name of this particular fence device within the cluster which, as we will see shortly, is matched in the <clusternode...> element where appropriate.
Note: We will see shortly that, unlike switched PDUs or other network fence devices, IPMI does not have ports. This is because each IPMI BMC supports just it's host system. More on that later.

Example <fencedevice...> Tag For APC Switched PDUs

Here we will show how to configure APC switched PDU <fencedevice...> tags. We won't be using it in this tutorial, but in the real world, it is highly recommended as a backup fence device for IPMI and similar primary fence devices.

		...
			<fence>
				<method name="pdu">
					<device name="pdu2" port="1" action="reboot"/>
				</method>
			</fence>
		...
	<fencedevices>
		<fencedevice name="pdu2" agent="fence_apc" ipaddr="pdu2.alteeve.com" login="apc" passwd="secret" />
	</fencedevices>
  • ipaddr; This is the resolvable name or IP address of the device. If you use a resolvable name, it is strongly advised that you put the name in /etc/hosts as DNS is another layer of abstraction which could fail.
  • login; This is the login name to use when the fenced daemon connects to the device.
  • passwd; This is the login password to use when the fenced daemon connects to the device.
  • name; This is the name of this particular fence device within the cluster which, as we will see shortly, is matched in the <clusternode...> element where appropriate.

Using the Fence Devices

Now we have nodes and fence devices defined, we will go back and tie them together. This is done by:

  • Defining a fence tag containing all fence methods and devices.
    • Defining one or more method tag(s) containing the device call(s) needed for each fence attempt.
      • Defining one or more device tag(s) containing attributes describing how to call the fence device to kill this node.

Here is how we implement IPMI as the primary fence device with the switched PDU as the backup method.

<?xml version="1.0"?>
<cluster name="an-clusterA" config_version="5">
        <cman expected_votes="1" two_node="1"/>
        <clusternodes>
                <clusternode name="an-node01.alteeve.com" nodeid="1">
                        <fence>
                                <method name="ipmi">
                                        <device name="ipmi_an01" action="reboot"/>
                                </method>
                                <method name="pdu">
                                        <device name="pdu2" port="1" action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node02.alteeve.com" nodeid="2">
                        <fence>
                                <method name="ipmi">
                                        <device name="ipmi_an02" action="reboot"/>
                                </method>
                                <method name="pdu">
                                        <device name="pdu2" port="2" action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-node01.ipmi" login="root" passwd="secret" />
                <fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-node02.ipmi" login="root" passwd="secret" />
                <fencedevice name="pdu2" agent="fence_apc" ipaddr="pdu2.alteeve.com" login="apc" passwd="secret" />
        </fencedevices>
</cluster>

First, notice that the fence tag has no attributes. It's merely a container for the method(s).

The next level is the method named ipmi. This name is merely a description and can be whatever you feel is most appropriate. It's purpose is simply to help you distinguish this method from other methods. The reason for method tags is that some fence device calls will have two or more steps. A classic example would be a node with a redundant power supply on a switch PDU acting as the fence device. In such a case, you will need to define multiple device tags, one for each power cable feeding the node. In such a case, the cluster will not consider the fence a success unless and until all contained device calls execute successfully.

The same pair of method and device tags are supplied a second time. The first pair defined the IPMI interfaces, and the second pair defined the switched PDU. Note that the PDU definition needs a port="" attribute where the IPMI fence device does not. When a fence call is needed, the fence devices will be called in the order they are found here. If both devices fail, the cluster will go back to the start and try again, looping indefinitely until one device succeeds.

Note: It's important to understand why we use IPMI as the primary fence device. It is suggested, but not required, that the fence device confirm that the node is off. IPMI can do this, the switched PDU does not. Thus, IPMI won't return a success unless the node is truly off. The PDU though will return a success once the power is cut to the requested port. However, a misconfigured node with redundant PDU may in fact still be running, leading to disastrous consequences.

The actual fence device configuration is the final piece of the puzzle. It is here that you specify per-node configuration options and link these attributes to a given fencedevice. Here, we see the link to the fencedevice via the name, fence_na01 in this example.

Let's step through an example fence call to help show how the per-cluster and fence device attributes are combined during a fence call.

  • The cluster manager decides that a node needs to be fenced. Let's say that the victim is an-node02.
  • The first method in the fence section under an-node02 is consulted. Within it there are two method entries, named ipmi and pdu. The IPMI method's device has one attribute while the PDU's device has two attributes;
    • port; only found in the PDU method, this tells the cluster that an-node02 is connected to switched PDU's port number 2.
    • action; Found on both devices, this tells the cluster that the fence action to take is reboot. How this action is actually interpreted depends on the fence device in use, though the name certainly implies that the node will be forced off and then restarted.
  • The cluster searches in fencedevices for a fencedevice matching the name ipmi_an02. This fence device has four attributes;
    • agent; This tells the cluster to call the fence_ipmilan fence agent script, as we discussed earlier.
    • ipaddr; This tells the fence agent where on the network to find this particular IPMI BMC. This is how multiple fence devices of the same type can be used in the cluster.
    • login; This is the login user name to use when authenticating against the fence device.
    • passwd; This is the password to supply along with the login name when authenticating against the fence device.
  • Should the IPMI fence call fail for some reason, the cluster will move on to the second pdu method, repeating the steps above but using the PDU values.

When the cluster calls the fence agent, it does so by initially calling the fence agent script with no arguments.

/usr/sbin/fence_ipmilan

Then it will pass to that agent the following arguments:

ipaddr=an-node02.ipmi
login=root
passwd=secret
action=reboot

As you can see then, the first three arguments are from the fencedevice attributes and the last one is from the device attributes under an-node02's clusternode's fence tag.

If this method fails, then the PDU will be called in a very similar way, but with an extra argument from the device attributes.

/usr/sbin/fence_apc

Then it will pass to that agent the following arguments:

ipaddr=pdu2.alteeve.com
login=root
passwd=secret
port=2
action=reboot

Should this fail, the cluster will go back and try the IPMI interface again. It will loop through the fence device methods forever until one of the methods succeeds.

Give Nodes More Time To Start

Clusters with more than three nodes will have to gain quorum before they can fence other nodes. As we saw earlier though, this is not really the case when using the two_node="1" attribute in the cman tag. What this means in practice is that if you start the cluster on one node and then wait too long to start the cluster on the second node, the first will fence the second.

The logic behind this is; When the cluster starts, it will try to talk to it's fellow node and then fail. With the special two_node="1" attribute set, the cluster knows that it is allowed to start clustered services, but it has no way to say for sure what state the other node is in. It could well be online and hosting services for all it knows. So it has to proceed on the assumption that the other node is alive and using shared resources. Given that, and given that it can not talk to the other node, it's only safe option is to fence the other node. Only then can it be confident that it is safe to start providing clustered services.

<?xml version="1.0"?>
<cluster name="an-clusterA" config_version="7">
        <cman expected_votes="1" two_node="1"/>
        <clusternodes>
                <clusternode name="an-node01.alteeve.com" nodeid="1">
                        <fence>
                                <method name="ipmi">
                                        <device name="ipmi_an01" action="reboot"/>
                                </method>
                                <method name="pdu">
                                        <device name="pdu2" port="1" action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node02.alteeve.com" nodeid="2">
                        <fence>
                                <method name="ipmi">
                                        <device name="ipmi_an02" action="reboot"/>
                                </method>
                                <method name="pdu">
                                        <device name="pdu2" port="2" action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-node01.ipmi" login="root" passwd="secret" />
                <fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-node02.ipmi" login="root" passwd="secret" />
                <fencedevice name="pdu2" agent="fence_apc" ipaddr="pdu2.alteeve.com" login="apc" passwd="secret" />
        </fencedevices>
        <fence_daemon post_join_delay="30"/>
</cluster>

The new tag is fence_daemon, seen near the bottom if the file above. The change is made using the post_join_delay="60" attribute. By default, the cluster will declare the other node dead after just 6 seconds. The reason is that the larger this value, the slower the start-up of the cluster services will be. During testing and development though, I find this value to be far too short and frequently led to unnecessary fencing. Once your cluster is setup and working, it's not a bad idea to reduce this value to the lowest value that you are comfortable with.

Configuring Totem

This is almost a misnomer, as we're more or less not configuring the totem protocol in this cluster.

<?xml version="1.0"?>
<cluster name="an-clusterA" config_version="8">
        <cman expected_votes="1" two_node="1"/>
        <clusternodes>
                <clusternode name="an-node01.alteeve.com" nodeid="1">
                        <fence>
                                <method name="ipmi">
                                        <device name="ipmi_an01" action="reboot"/>
                                </method>
                                <method name="pdu">
                                        <device name="pdu2" port="1" action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node02.alteeve.com" nodeid="2">
                        <fence>
                                <method name="ipmi">
                                        <device name="ipmi_an02" action="reboot"/>
                                </method>
                                <method name="pdu">
                                        <device name="pdu2" port="2" action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-node01.ipmi" login="root" passwd="secret" />
                <fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-node02.ipmi" login="root" passwd="secret" />
                <fencedevice name="pdu2" agent="fence_apc" ipaddr="pdu2.alteeve.com" login="apc" passwd="secret" />
        </fencedevices>
        <fence_daemon post_join_delay="30"/>
        <totem rrp_mode="none" secauth="off"/>
</cluster>
Note: At this time, redundant ring protocol is not supported (RHEL6.1 and lower) and will be in technology preview mode (RHEL6.2 and above). For this reason, we will not be using it. However, we are using bonding, so we still have removed a single point of failure.

RRP is an optional second ring that can be used for cluster communication in the case of a break down in the first ring. However, if you wish to explore it further, please take a look at the clusternode element tag called <altname...>. When altname is used though, then the rrp_mode attribute will need to be changed to either active or passive (the details of which are outside the scope of this tutorial).

The second option we're looking at here is the secauth="off" attribute. This controls whether the cluster communications are encrypted or not. We can safely disable this because we're working on a known-private network, which yields two benefits; It's simpler to setup and it's a lot faster. If you must encrypt the cluster communications, then you can do so here. The details of which are also outside the scope of this tutorial though.

Validating and Pushing the /etc/cluster/cluster.conf File

One of the most noticeable changes in RHCS cluster stable 3 is that we no longer have to make a long, cryptic xmllint call to validate our cluster configuration. Now we can simply call ccs_config_validate.

ccs_config_validate
Configuration validates

If there was a problem, you need to go back and fix it. DO NOT proceed until your configuration validates. Once it does, we're ready to move on!

With it validated, we need to push it to the other node. As the cluster is not running yet, we will push it out using rsync.

rsync -av /etc/cluster/cluster.conf root@an-node02:/etc/cluster/
sending incremental file list
cluster.conf

sent 1228 bytes  received 31 bytes  2518.00 bytes/sec
total size is 1148  speedup is 0.91

Setting Up ricci

Another change from RHCS stable 2 is how configuration changes are propagated. Before, after a change, we'd push out the updated cluster configuration by calling ccs_tool update /etc/cluster/cluster.conf. Now this is done with cman_tool version -r. More fundamentally though, the cluster needs to authenticate against each node and does this using the local ricci system user. The user has no password initially, so we need to set one.

On both nodes:

passwd ricci
Changing password for user ricci.
New password: 
Retype new password: 
passwd: all authentication tokens updated successfully.

You will need to enter this password once from each node against the other node. We will see this later.

Now make sure that the ricci daemon is set to start on boot and is running now.

chkconfig ricci on
/etc/init.d/ricci start
Starting ricci:                                            [  OK  ]
Note: If you don't see [ OK ], don't worry, it is probably because it was already running.

Starting the Cluster for the First Time

It's a good idea to open a second terminal on either node and tail the /var/log/messages syslog file. All cluster messages will be recorded here and it will help to debug problems if you can watch the logs. To do this, in the new terminal windows run;

clear; tail -f -n 0 /var/log/messages

This will clear the screen and start watching for new lines to be written to syslog. When you are done watching syslog, press the <ctrl> + c key combination.

How you lay out your terminal windows is, obviously, up to your own preferences. Below is a configuration I have found very useful.

Terminal window layout for watching 2 nodes. Left windows are used for entering commands and the left windows are used for tailing syslog.

With the terminals setup, lets start the cluster!

Warning: If you don't start cman on both nodes within 30 seconds, the slower node will be fenced.

On both nodes, run:

/etc/init.d/cman start
Starting cluster: 
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]
   Waiting for quorum...                                   [  OK  ]
   Starting fenced...                                      [  OK  ]
   Starting dlm_controld...                                [  OK  ]
   Starting gfs_controld...                                [  OK  ]
   Unfencing self...                                       [  OK  ]
   Joining fence domain...                                 [  OK  ]

Here is what you should see in syslog:

Sep 14 13:33:58 an-node01 kernel: DLM (built Jun 27 2011 19:51:46) installed
Sep 14 13:33:58 an-node01 corosync[18897]:   [MAIN  ] Corosync Cluster Engine ('1.2.3'): started and ready to provide service.
Sep 14 13:33:58 an-node01 corosync[18897]:   [MAIN  ] Corosync built-in features: nss rdma
Sep 14 13:33:58 an-node01 corosync[18897]:   [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Sep 14 13:33:58 an-node01 corosync[18897]:   [MAIN  ] Successfully parsed cman config
Sep 14 13:33:58 an-node01 corosync[18897]:   [TOTEM ] Initializing transport (UDP/IP).
Sep 14 13:33:58 an-node01 corosync[18897]:   [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Sep 14 13:33:58 an-node01 corosync[18897]:   [TOTEM ] The network interface [10.20.0.1] is now up.
Sep 14 13:33:58 an-node01 corosync[18897]:   [QUORUM] Using quorum provider quorum_cman
Sep 14 13:33:58 an-node01 corosync[18897]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Sep 14 13:33:58 an-node01 corosync[18897]:   [CMAN  ] CMAN 3.0.12 (built Jul  4 2011 22:35:06) started
Sep 14 13:33:58 an-node01 corosync[18897]:   [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Sep 14 13:33:58 an-node01 corosync[18897]:   [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Sep 14 13:33:58 an-node01 corosync[18897]:   [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Sep 14 13:33:58 an-node01 corosync[18897]:   [SERV  ] Service engine loaded: corosync configuration service
Sep 14 13:33:58 an-node01 corosync[18897]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Sep 14 13:33:58 an-node01 corosync[18897]:   [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Sep 14 13:33:58 an-node01 corosync[18897]:   [SERV  ] Service engine loaded: corosync profile loading service
Sep 14 13:33:58 an-node01 corosync[18897]:   [QUORUM] Using quorum provider quorum_cman
Sep 14 13:33:58 an-node01 corosync[18897]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Sep 14 13:33:58 an-node01 corosync[18897]:   [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Sep 14 13:33:58 an-node01 corosync[18897]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 14 13:33:58 an-node01 corosync[18897]:   [CMAN  ] quorum regained, resuming activity
Sep 14 13:33:58 an-node01 corosync[18897]:   [QUORUM] This node is within the primary component and will provide service.
Sep 14 13:33:58 an-node01 corosync[18897]:   [QUORUM] Members[1]: 1
Sep 14 13:33:58 an-node01 corosync[18897]:   [QUORUM] Members[1]: 1
Sep 14 13:33:58 an-node01 corosync[18897]:   [CPG   ] downlist received left_list: 0
Sep 14 13:33:58 an-node01 corosync[18897]:   [CPG   ] chosen downlist from node r(0) ip(10.20.0.1) 
Sep 14 13:33:58 an-node01 corosync[18897]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 14 13:34:02 an-node01 corosync[18897]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 14 13:34:02 an-node01 corosync[18897]:   [QUORUM] Members[2]: 1 2
Sep 14 13:34:02 an-node01 corosync[18897]:   [QUORUM] Members[2]: 1 2
Sep 14 13:34:02 an-node01 corosync[18897]:   [CPG   ] downlist received left_list: 0
Sep 14 13:34:02 an-node01 corosync[18897]:   [CPG   ] downlist received left_list: 0
Sep 14 13:34:02 an-node01 corosync[18897]:   [CPG   ] chosen downlist from node r(0) ip(10.20.0.1) 
Sep 14 13:34:02 an-node01 corosync[18897]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 14 13:34:02 an-node01 fenced[18954]: fenced 3.0.12 started
Sep 14 13:34:02 an-node01 dlm_controld[18978]: dlm_controld 3.0.12 started
Sep 14 13:34:02 an-node01 gfs_controld[19000]: gfs_controld 3.0.12 started

Now to confirm that the cluster is operating properly, run cman_tool status;

cman_tool status
Version: 6.2.0
Config Version: 8
Cluster Name: an-clusterA
Cluster Id: 29382
Cluster Member: Yes
Cluster Generation: 8
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Node votes: 1
Quorum: 1  
Active subsystems: 7
Flags: 2node 
Ports Bound: 0  
Node name: an-node01.alteeve.com
Node ID: 1
Multicast addresses: 239.192.114.57 
Node addresses: 10.20.0.1

We can see that the both nodes are talking because of the Nodes: 2 entry.

If you ever want to see the nitty-gritty configuration, you can run corosync-objctl.

corosync-objctl
cluster.name=an-clusterA
cluster.config_version=8
cluster.cman.expected_votes=1
cluster.cman.two_node=1
cluster.cman.nodename=an-node01.alteeve.com
cluster.cman.cluster_id=29382
cluster.clusternodes.clusternode.name=an-node01.alteeve.com
cluster.clusternodes.clusternode.nodeid=1
cluster.clusternodes.clusternode.fence.method.name=ipmi
cluster.clusternodes.clusternode.fence.method.device.name=ipmi_an01
cluster.clusternodes.clusternode.fence.method.device.action=reboot
cluster.clusternodes.clusternode.fence.method.name=pdu
cluster.clusternodes.clusternode.fence.method.device.name=pdu2
cluster.clusternodes.clusternode.fence.method.device.port=1
cluster.clusternodes.clusternode.fence.method.device.action=reboot
cluster.clusternodes.clusternode.name=an-node02.alteeve.com
cluster.clusternodes.clusternode.nodeid=2
cluster.clusternodes.clusternode.fence.method.name=ipmi
cluster.clusternodes.clusternode.fence.method.device.name=ipmi_an02
cluster.clusternodes.clusternode.fence.method.device.action=reboot
cluster.clusternodes.clusternode.fence.method.name=pdu
cluster.clusternodes.clusternode.fence.method.device.name=pdu2
cluster.clusternodes.clusternode.fence.method.device.port=2
cluster.clusternodes.clusternode.fence.method.device.action=reboot
cluster.fencedevices.fencedevice.name=ipmi_an01
cluster.fencedevices.fencedevice.agent=fence_ipmilan
cluster.fencedevices.fencedevice.ipaddr=an-node01.ipmi
cluster.fencedevices.fencedevice.login=root
cluster.fencedevices.fencedevice.passwd=secret
cluster.fencedevices.fencedevice.name=ipmi_an02
cluster.fencedevices.fencedevice.agent=fence_ipmilan
cluster.fencedevices.fencedevice.ipaddr=an-node02.ipmi
cluster.fencedevices.fencedevice.login=root
cluster.fencedevices.fencedevice.passwd=secret
cluster.fencedevices.fencedevice.name=pdu2
cluster.fencedevices.fencedevice.agent=fence_apc
cluster.fencedevices.fencedevice.ipaddr=pdu2.alteeve.com
cluster.fencedevices.fencedevice.login=apc
cluster.fencedevices.fencedevice.passwd=secret
cluster.fence_daemon.post_join_delay=30
cluster.totem.rrp_mode=none
cluster.totem.secauth=off
totem.rrp_mode=none
totem.secauth=off
totem.version=2
totem.nodeid=1
totem.vsftype=none
totem.token=10000
totem.join=60
totem.fail_recv_const=2500
totem.consensus=2000
totem.key=an-clusterA
totem.interface.ringnumber=0
totem.interface.bindnetaddr=10.20.0.1
totem.interface.mcastaddr=239.192.114.57
totem.interface.mcastport=5405
libccs.next_handle=7
libccs.connection.ccs_handle=3
libccs.connection.config_version=8
libccs.connection.fullxpath=0
libccs.connection.ccs_handle=4
libccs.connection.config_version=8
libccs.connection.fullxpath=0
libccs.connection.ccs_handle=5
libccs.connection.config_version=8
libccs.connection.fullxpath=0
logging.timestamp=on
logging.to_logfile=yes
logging.logfile=/var/log/cluster/corosync.log
logging.logfile_priority=info
logging.to_syslog=yes
logging.syslog_facility=local4
logging.syslog_priority=info
aisexec.user=ais
aisexec.group=ais
service.name=corosync_quorum
service.ver=0
service.name=corosync_cman
service.ver=0
quorum.provider=quorum_cman
service.name=openais_ckpt
service.ver=0
runtime.services.quorum.service_id=12
runtime.services.cman.service_id=9
runtime.services.ckpt.service_id=3
runtime.services.ckpt.0.tx=0
runtime.services.ckpt.0.rx=0
runtime.services.ckpt.1.tx=0
runtime.services.ckpt.1.rx=0
runtime.services.ckpt.2.tx=0
runtime.services.ckpt.2.rx=0
runtime.services.ckpt.3.tx=0
runtime.services.ckpt.3.rx=0
runtime.services.ckpt.4.tx=0
runtime.services.ckpt.4.rx=0
runtime.services.ckpt.5.tx=0
runtime.services.ckpt.5.rx=0
runtime.services.ckpt.6.tx=0
runtime.services.ckpt.6.rx=0
runtime.services.ckpt.7.tx=0
runtime.services.ckpt.7.rx=0
runtime.services.ckpt.8.tx=0
runtime.services.ckpt.8.rx=0
runtime.services.ckpt.9.tx=0
runtime.services.ckpt.9.rx=0
runtime.services.ckpt.10.tx=0
runtime.services.ckpt.10.rx=0
runtime.services.ckpt.11.tx=2
runtime.services.ckpt.11.rx=3
runtime.services.ckpt.12.tx=0
runtime.services.ckpt.12.rx=0
runtime.services.ckpt.13.tx=0
runtime.services.ckpt.13.rx=0
runtime.services.evs.service_id=0
runtime.services.evs.0.tx=0
runtime.services.evs.0.rx=0
runtime.services.cfg.service_id=7
runtime.services.cfg.0.tx=0
runtime.services.cfg.0.rx=0
runtime.services.cfg.1.tx=0
runtime.services.cfg.1.rx=0
runtime.services.cfg.2.tx=0
runtime.services.cfg.2.rx=0
runtime.services.cfg.3.tx=0
runtime.services.cfg.3.rx=0
runtime.services.cpg.service_id=8
runtime.services.cpg.0.tx=4
runtime.services.cpg.0.rx=8
runtime.services.cpg.1.tx=0
runtime.services.cpg.1.rx=0
runtime.services.cpg.2.tx=0
runtime.services.cpg.2.rx=0
runtime.services.cpg.3.tx=16
runtime.services.cpg.3.rx=23
runtime.services.cpg.4.tx=0
runtime.services.cpg.4.rx=0
runtime.services.cpg.5.tx=2
runtime.services.cpg.5.rx=3
runtime.services.confdb.service_id=11
runtime.services.pload.service_id=13
runtime.services.pload.0.tx=0
runtime.services.pload.0.rx=0
runtime.services.pload.1.tx=0
runtime.services.pload.1.rx=0
runtime.services.quorum.service_id=12
runtime.connections.active=6
runtime.connections.closed=110
runtime.connections.fenced:18954:16.service_id=8
runtime.connections.fenced:18954:16.client_pid=18954
runtime.connections.fenced:18954:16.responses=5
runtime.connections.fenced:18954:16.dispatched=9
runtime.connections.fenced:18954:16.requests=5
runtime.connections.fenced:18954:16.sem_retry_count=0
runtime.connections.fenced:18954:16.send_retry_count=0
runtime.connections.fenced:18954:16.recv_retry_count=0
runtime.connections.fenced:18954:16.flow_control=0
runtime.connections.fenced:18954:16.flow_control_count=0
runtime.connections.fenced:18954:16.queue_size=0
runtime.connections.dlm_controld:18978:24.service_id=8
runtime.connections.dlm_controld:18978:24.client_pid=18978
runtime.connections.dlm_controld:18978:24.responses=5
runtime.connections.dlm_controld:18978:24.dispatched=8
runtime.connections.dlm_controld:18978:24.requests=5
runtime.connections.dlm_controld:18978:24.sem_retry_count=0
runtime.connections.dlm_controld:18978:24.send_retry_count=0
runtime.connections.dlm_controld:18978:24.recv_retry_count=0
runtime.connections.dlm_controld:18978:24.flow_control=0
runtime.connections.dlm_controld:18978:24.flow_control_count=0
runtime.connections.dlm_controld:18978:24.queue_size=0
runtime.connections.dlm_controld:18978:19.service_id=3
runtime.connections.dlm_controld:18978:19.client_pid=18978
runtime.connections.dlm_controld:18978:19.responses=0
runtime.connections.dlm_controld:18978:19.dispatched=0
runtime.connections.dlm_controld:18978:19.requests=0
runtime.connections.dlm_controld:18978:19.sem_retry_count=0
runtime.connections.dlm_controld:18978:19.send_retry_count=0
runtime.connections.dlm_controld:18978:19.recv_retry_count=0
runtime.connections.dlm_controld:18978:19.flow_control=0
runtime.connections.dlm_controld:18978:19.flow_control_count=0
runtime.connections.dlm_controld:18978:19.queue_size=0
runtime.connections.gfs_controld:19000:22.service_id=8
runtime.connections.gfs_controld:19000:22.client_pid=19000
runtime.connections.gfs_controld:19000:22.responses=5
runtime.connections.gfs_controld:19000:22.dispatched=8
runtime.connections.gfs_controld:19000:22.requests=5
runtime.connections.gfs_controld:19000:22.sem_retry_count=0
runtime.connections.gfs_controld:19000:22.send_retry_count=0
runtime.connections.gfs_controld:19000:22.recv_retry_count=0
runtime.connections.gfs_controld:19000:22.flow_control=0
runtime.connections.gfs_controld:19000:22.flow_control_count=0
runtime.connections.gfs_controld:19000:22.queue_size=0
runtime.connections.fenced:18954:25.service_id=8
runtime.connections.fenced:18954:25.client_pid=18954
runtime.connections.fenced:18954:25.responses=5
runtime.connections.fenced:18954:25.dispatched=8
runtime.connections.fenced:18954:25.requests=5
runtime.connections.fenced:18954:25.sem_retry_count=0
runtime.connections.fenced:18954:25.send_retry_count=0
runtime.connections.fenced:18954:25.recv_retry_count=0
runtime.connections.fenced:18954:25.flow_control=0
runtime.connections.fenced:18954:25.flow_control_count=0
runtime.connections.fenced:18954:25.queue_size=0
runtime.connections.corosync-objctl:19188:23.service_id=11
runtime.connections.corosync-objctl:19188:23.client_pid=19188
runtime.connections.corosync-objctl:19188:23.responses=435
runtime.connections.corosync-objctl:19188:23.dispatched=0
runtime.connections.corosync-objctl:19188:23.requests=438
runtime.connections.corosync-objctl:19188:23.sem_retry_count=0
runtime.connections.corosync-objctl:19188:23.send_retry_count=0
runtime.connections.corosync-objctl:19188:23.recv_retry_count=0
runtime.connections.corosync-objctl:19188:23.flow_control=0
runtime.connections.corosync-objctl:19188:23.flow_control_count=0
runtime.connections.corosync-objctl:19188:23.queue_size=0
runtime.totem.pg.mrp.srp.orf_token_tx=2
runtime.totem.pg.mrp.srp.orf_token_rx=744
runtime.totem.pg.mrp.srp.memb_merge_detect_tx=365
runtime.totem.pg.mrp.srp.memb_merge_detect_rx=365
runtime.totem.pg.mrp.srp.memb_join_tx=3
runtime.totem.pg.mrp.srp.memb_join_rx=5
runtime.totem.pg.mrp.srp.mcast_tx=46
runtime.totem.pg.mrp.srp.mcast_retx=0
runtime.totem.pg.mrp.srp.mcast_rx=57
runtime.totem.pg.mrp.srp.memb_commit_token_tx=4
runtime.totem.pg.mrp.srp.memb_commit_token_rx=4
runtime.totem.pg.mrp.srp.token_hold_cancel_tx=4
runtime.totem.pg.mrp.srp.token_hold_cancel_rx=7
runtime.totem.pg.mrp.srp.operational_entered=2
runtime.totem.pg.mrp.srp.operational_token_lost=0
runtime.totem.pg.mrp.srp.gather_entered=2
runtime.totem.pg.mrp.srp.gather_token_lost=0
runtime.totem.pg.mrp.srp.commit_entered=2
runtime.totem.pg.mrp.srp.commit_token_lost=0
runtime.totem.pg.mrp.srp.recovery_entered=2
runtime.totem.pg.mrp.srp.recovery_token_lost=0
runtime.totem.pg.mrp.srp.consensus_timeouts=0
runtime.totem.pg.mrp.srp.mtt_rx_token=1903
runtime.totem.pg.mrp.srp.avg_token_workload=0
runtime.totem.pg.mrp.srp.avg_backlog_calc=0
runtime.totem.pg.mrp.srp.rx_msg_dropped=0
runtime.blackbox.dump_flight_data=no
runtime.blackbox.dump_state=no
cman_private.COROSYNC_DEFAULT_CONFIG_IFACE=xmlconfig:cmanpreconfig

Testing Fencing

We need to thoroughly test our fence configuration and devices before we proceed. Should the cluster call a fence, and if the fence call fails, the cluster will hang until the fence finally succeeds. There is no way to abort a fence, so this could effectively hang the cluster. If we have problems, we need to find them now.

We need to run two tests from each node against the other node for a total of four tests.

  • The first test will use fence_ipmilan. To do this, we will hang the victim node by running echo c > /proc/sysrq-trigger on it. This will immediately and completely hang the kernel. The other node should detect the failure and reboot the victim. You can confirm that IPMI was used by watching the fence PDU and not seeing it power-cycle the port.
  • Secondly, we will pull the power on the victim node. This is done to ensure that the IPMI BMC is also dead and will simulate a failure in the power supply. You should see the other node try to fence the victim, fail initially, then try again using the second, switched PDU. If you want the PDU, you should see the power indicator LED go off and then come back on.
Note: To "pull the power", we can actually just log into the PDU and turn off the victim's power. In this case, we'll see the power restored when the PDU is used to fence the node. We can actually use the fence_apc fence agent to pull the power, as we'll see.
Test Victim Pass?
echo c > /proc/sysrq-trigger an-node01
fence_apc -a pdu2.alteeve.com -l apc -p secret -n 1 -o off an-node01
echo c > /proc/sysrq-trigger an-node02
fence_apc -a pdu2.alteeve.com -l apc -p secret -n 2 -o off an-node02

After the lost node is recovered, remember to restart cman before starting the next test.

Hanging an-node01

Be sure to be tailing the /var/log/messages on an-node02. Go to an-node01's first terminal and run the following command.

Warning: This command will not return and you will lose all ability to talk to this node until it is rebooted.

On an-node01 run:

echo c > /proc/sysrq-trigger

On an-node02's syslog terminal, you should see the following entries in the log.

Sep 15 16:08:17 an-node02 corosync[12347]:   [TOTEM ] A processor failed, forming new configuration.
Sep 15 16:08:19 an-node02 corosync[12347]:   [QUORUM] Members[1]: 2
Sep 15 16:08:19 an-node02 corosync[12347]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 15 16:08:19 an-node02 corosync[12347]:   [CPG   ] downlist received left_list: 1
Sep 15 16:08:19 an-node02 corosync[12347]:   [CPG   ] chosen downlist from node r(0) ip(10.20.0.2) 
Sep 15 16:08:19 an-node02 corosync[12347]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 16:08:19 an-node02 kernel: dlm: closing connection to node 1
Sep 15 16:08:19 an-node02 fenced[12403]: fencing node an-node01.alteeve.com
Sep 15 16:08:33 an-node02 fenced[12403]: fence an-node01.alteeve.com success

Perfect!

If you are watching an-node01's display, you should now see it starting back up. Once it finished booting, log into it and restart cman before moving on to the next test.

Cutting the Power to an-node01

As was discussed earlier, IPMI has a fatal flaw as a fence device. IPMI's BMC draws it's power from the same power supply as the node itself. Thus, when the power supply itself fails (or the mains connection is pulled/tripped over), fencing via IPMI will fail. This makes the power supply a single point of failure, which is what the PDU protects us against.

So to simulate a failed power supply, we're going to use an-node02's fence_apc fence agent to turn off the power to an-node01.

Alternatively, you could also just unplug the power and the fence would still succeed. The fence call only needs to confirm that the node is off to succeed. Whether the node restarts after or not is not important so far as the cluster is concerned.

From an-node02, pull the power on an-node01 with the following call;

fence_apc -a pdu2.alteeve.com -l apc -p secret -n 1 -o off
Success: Powered OFF

Back on an-node02's syslog, we should see the following entries;

Sep 15 16:18:06 an-node02 corosync[12347]:   [TOTEM ] A processor failed, forming new configuration.
Sep 15 16:18:08 an-node02 corosync[12347]:   [QUORUM] Members[1]: 2
Sep 15 16:18:08 an-node02 corosync[12347]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 15 16:18:08 an-node02 kernel: dlm: closing connection to node 1
Sep 15 16:18:08 an-node02 corosync[12347]:   [CPG   ] downlist received left_list: 1
Sep 15 16:18:08 an-node02 corosync[12347]:   [CPG   ] chosen downlist from node r(0) ip(10.20.0.2) 
Sep 15 16:18:08 an-node02 corosync[12347]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 16:18:08 an-node02 fenced[12403]: fencing node an-node01.alteeve.com
Sep 15 16:18:31 an-node02 fenced[12403]: fence an-node01.alteeve.com dev 0.0 agent fence_ipmilan result: error from agent
Sep 15 16:18:31 an-node02 fenced[12403]: fence an-node01.alteeve.com success

Hoozah!

Notice that there is an error from the fence_ipmilan. This is exactly what we expected because of the IPMI BMC losing power.

So now we know that an-node01 can be fenced successfully from both fence devices. Now we need to run the same tests against an-node02.

Hanging an-node02

Warning: DO NOT ASSUME THAT an-node02 WILL FENCE PROPERLY JUST BECAUSE an-node01 PASSED!. There are many ways that a fence could fail; Bad password, misconfigured device, plugged into the wrong port on the PDU and so on. Always test all nodes using all methods!

Be sure to be tailing the /var/log/messages on an-node02. Go to an-node01's first terminal and run the following command.

Note: This command will not return and you will lose all ability to talk to this node until it is rebooted.

On an-node02 run:

echo c > /proc/sysrq-trigger

On an-node01's syslog terminal, you should see the following entries in the log.

Sep 15 15:26:19 an-node01 corosync[2223]:   [TOTEM ] A processor failed, forming new configuration.
Sep 15 15:26:21 an-node01 corosync[2223]:   [QUORUM] Members[1]: 1
Sep 15 15:26:21 an-node01 corosync[2223]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 15 15:26:21 an-node01 corosync[2223]:   [CPG   ] downlist received left_list: 1
Sep 15 15:26:21 an-node01 corosync[2223]:   [CPG   ] chosen downlist from node r(0) ip(10.20.0.1) 
Sep 15 15:26:21 an-node01 corosync[2223]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 15:26:21 an-node01 fenced[2280]: fencing node an-node02.alteeve.com
Sep 15 15:26:21 an-node01 kernel: dlm: closing connection to node 2
Sep 15 15:26:36 an-node01 fenced[2280]: fence an-node02.alteeve.com success

Again, perfect!

Cutting the Power to an-node02

From an-node01, pull the power on an-node02 with the following call;

fence_apc -a pdu2.alteeve.com -l apc -p secret -n 2 -o off
Success: Powered OFF

Back on an-node01's syslog, we should see the following entries;

Sep 15 15:36:30 an-node01 corosync[2223]:   [TOTEM ] A processor failed, forming new configuration.
Sep 15 15:36:32 an-node01 corosync[2223]:   [QUORUM] Members[1]: 1
Sep 15 15:36:32 an-node01 corosync[2223]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 15 15:36:32 an-node01 kernel: dlm: closing connection to node 2
Sep 15 15:36:32 an-node01 corosync[2223]:   [CPG   ] downlist received left_list: 1
Sep 15 15:36:32 an-node01 corosync[2223]:   [CPG   ] chosen downlist from node r(0) ip(10.20.0.1) 
Sep 15 15:36:32 an-node01 corosync[2223]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 15 15:36:32 an-node01 fenced[2280]: fencing node an-node02.alteeve.com
Sep 15 15:36:55 an-node01 fenced[2280]: fence an-node02.alteeve.com dev 0.0 agent fence_ipmilan result: error from agent
Sep 15 15:36:55 an-node01 fenced[2280]: fence an-node02.alteeve.com success

Woot!

We can now safely say that our fencing is setup and working properly.

Testing Network Redundancy

Next up of the testing block is our network configuration. Seeing as we've build our bonds, we need to now test that they are working properly.

To run this test, we're going to do the following;

  • Make sure that cman has started on both nodes.
  • On both nodes, start a ping flood against the opposing node in the first window and start tailing syslog in the second window.
  • Look at the current state of the bonds to see which interfaces are active.
  • Pull the power on the switch those interfaces are using. If the interfaces are spread across both switches, don't worry. Pick one and we will kill it again later.
  • Check the state of the bonds again and see that they've switched to their backup links. If a node gets fenced, you know something went wrong.
  • Wait about a minute, then restore power to the lost switch. Wait a good five minutes to ensure that it is in fact back up and that the network was not interrupted.
  • Repeat the power off/on cycle for the second switch.
  • If the initial state had the bonds spread across both switches, repeat the power off/on for the first switch.

If all of these steps pass and the cluster doesn't partition, then you can be confident that your network is configured properly for full redundancy.

How to Know if the Tests Passed

Well, the most obvious answer to this question is if the cluster is still working after a switch is powered off.

We can be a little more subtle than that though.

The state of each bond is viewable by looking in the special /proc/net/bonding/bondX files, where X is the bond number. Lets take a look at bond0 on an-node01.

cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:e0:81:c7:ec:49

Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1b:21:9d:59:fc

We can see that the currently active interface is eth0. This is the key bit we're going to be watching for these tests. I know that eth0 on an-node01 is connected to by first switch. So when I pull the power to that switch, I should see eth3 take over.

We'll also be watching syslog. If things work right, we should not see any messages from the cluster during failure and recovery.

If you have the screen space for it, I'd recommend opening six more terminal windows, one for each bond. Run watch cat /proc/net/bonding/bondX so that you can quickly see any change in the bond states. Below's an example of the layout I use for this test.

Terminal layout used for monitoring the bonded link status during HA network testing. The right window shows two columns of terminals, an-node01 on the left and an-node02 on the right, stacked into three rows, bond0 on the top, bond1 in the middle and bond2 at the bottom. The left window shows the standard tail on syslog.

Failing The First Switch

In my case, all of the bonds on both nodes are using their first links as the current active links. This means that all network traffic is going through the first switch. So I will first power down that switch. You need to sort out which switch you should shut down first. If you've got the network traffic going over both switches, then just pick on to start with.

Note: Make sure that cman is running before beginning the test!

After killing the switch, I can see in syslog the following messages:

Sep 16 13:12:46 an-node01 kernel: e1000e: eth2 NIC Link is Down
Sep 16 13:12:46 an-node01 kernel: e1000e: eth0 NIC Link is Down
Sep 16 13:12:46 an-node01 kernel: e1000e: eth1 NIC Link is Down
Sep 16 13:12:46 an-node01 kernel: bonding: bond1: link status definitely down for interface eth1, disabling it
Sep 16 13:12:46 an-node01 kernel: bonding: bond1: making interface eth4 the new active one.
Sep 16 13:12:46 an-node01 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Sep 16 13:12:46 an-node01 kernel: bonding: bond0: making interface eth3 the new active one.
Sep 16 13:12:46 an-node01 kernel: device eth0 left promiscuous mode
Sep 16 13:12:46 an-node01 kernel: device eth3 entered promiscuous mode
Sep 16 13:12:46 an-node01 kernel: bonding: bond2: link status definitely down for interface eth2, disabling it
Sep 16 13:12:46 an-node01 kernel: bonding: bond2: making interface eth5 the new active one.
Sep 16 13:12:46 an-node01 kernel: device eth2 left promiscuous mode
Sep 16 13:12:46 an-node01 kernel: device eth5 entered promiscuous mode

I can look at an-node01's /proc/net/bonding/bond0 file and see:

cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth3
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:e0:81:c7:ec:49

Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1b:21:9d:59:fc

Notice Currently Active Slave is now eth3? You can also see now that eth0's link is down (MII Status: down).

It's the same story for all the other bonds on both switches.

If we check the status of the cluster, we'll see that all is good.

cman_tool status
Version: 6.2.0
Config Version: 8
Cluster Name: an-clusterA
Cluster Id: 29382
Cluster Member: Yes
Cluster Generation: 72
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Node votes: 1
Quorum: 1  
Active subsystems: 7
Flags: 2node 
Ports Bound: 0  
Node name: an-node01.alteeve.com
Node ID: 1
Multicast addresses: 239.192.114.57 
Node addresses: 10.20.0.1

How cool is that?!

Restoring The First Switch

Now that we've confirmed all of the bonds are working on the backup switch, lets restore power to the first switch.

Warning: Be sure to wait a solid five minutes after restoring power before declaring the recovery a success!

It is very important to wait for a while after restoring power to the switch. Some of the common problems that can break your cluster will not show up immediately. A good example is a misconfiguration of STP. In this case, the switch will come up, a short time will pass and then the switch will trigger an STP reconfiguration. Once this happens, both switches will block traffic for many seconds. This will partition you cluster.

So then, lets power it back up.

Within a few moments, you should see this in your syslog;

Sep 16 13:23:57 an-node01 kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:23:57 an-node01 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:23:57 an-node01 kernel: bonding: bond0: link status definitely up for interface eth0.
Sep 16 13:23:57 an-node01 kernel: bonding: bond1: link status definitely up for interface eth1.
Sep 16 13:23:58 an-node01 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:23:58 an-node01 kernel: bonding: bond2: link status definitely up for interface eth2.

It looks up, but lets keep waiting for another minute (note the time stamps).

Sep 16 13:24:52 an-node01 kernel: e1000e: eth0 NIC Link is Down
Sep 16 13:24:52 an-node01 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Sep 16 13:24:53 an-node01 kernel: e1000e: eth1 NIC Link is Down
Sep 16 13:24:53 an-node01 kernel: bonding: bond1: link status definitely down for interface eth1, disabling it
Sep 16 13:24:54 an-node01 kernel: e1000e: eth2 NIC Link is Down
Sep 16 13:24:54 an-node01 kernel: bonding: bond2: link status definitely down for interface eth2, disabling it
Sep 16 13:24:55 an-node01 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:24:55 an-node01 kernel: bonding: bond0: link status definitely up for interface eth0.
Sep 16 13:24:56 an-node01 kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:24:56 an-node01 kernel: bonding: bond1: link status definitely up for interface eth1.
Sep 16 13:24:57 an-node01 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:24:57 an-node01 kernel: bonding: bond2: link status definitely up for interface eth2.
Sep 16 13:24:58 an-node01 kernel: e1000e: eth0 NIC Link is Down
Sep 16 13:24:58 an-node01 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Sep 16 13:24:59 an-node01 kernel: e1000e: eth1 NIC Link is Down
Sep 16 13:24:59 an-node01 kernel: bonding: bond1: link status definitely down for interface eth1, disabling it
Sep 16 13:25:00 an-node01 kernel: e1000e: eth2 NIC Link is Down
Sep 16 13:25:00 an-node01 kernel: bonding: bond2: link status definitely down for interface eth2, disabling it
Sep 16 13:25:00 an-node01 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:25:00 an-node01 kernel: bonding: bond0: link status definitely up for interface eth0.
Sep 16 13:25:02 an-node01 kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:25:02 an-node01 kernel: bonding: bond1: link status definitely up for interface eth1.
Sep 16 13:25:02 an-node01 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:25:02 an-node01 kernel: bonding: bond2: link status definitely up for interface eth2.

See all that bouncing? That is caused by many switches showing a link (that is the MII status) without actually being able to push traffic. As part of the switches boot sequence, the links will go down and come back up a couple of times.

This is partly why the updelay option exists in the BONDING_OPTS, but it is also why we don't set primary or enable _reselect. Every time the active slave interface changes, there is a small chance of a problem that could break the cluster.

You will notice after several minutes that the backup slave interfaces are still in use, despite the first switch being back online. This is just fine. We can check the cluster status again and we'll see that everything is still fine. The recovery test passed!

Failing The Second Switch

For the same reason that we need to test all fence devices from both nodes, we also need to test failure and recovery of both switches. So now lets pull the plug on the second switch!

As before, we'll see messages showing the interfaces dropping.

Sep 16 13:35:36 an-node01 kernel: e1000e: eth3 NIC Link is Down
Sep 16 13:35:36 an-node01 kernel: bonding: bond0: link status definitely down for interface eth3, disabling it
Sep 16 13:35:36 an-node01 kernel: bonding: bond0: making interface eth0 the new active one.
Sep 16 13:35:36 an-node01 kernel: device eth3 left promiscuous mode
Sep 16 13:35:36 an-node01 kernel: device eth0 entered promiscuous mode
Sep 16 13:35:38 an-node01 kernel: e1000e: eth5 NIC Link is Down
Sep 16 13:35:38 an-node01 kernel: bonding: bond2: link status definitely down for interface eth5, disabling it
Sep 16 13:35:38 an-node01 kernel: bonding: bond2: making interface eth2 the new active one.
Sep 16 13:35:38 an-node01 kernel: device eth5 left promiscuous mode
Sep 16 13:35:38 an-node01 kernel: device eth2 entered promiscuous mode
Sep 16 13:35:39 an-node01 kernel: e1000e: eth4 NIC Link is Down
Sep 16 13:35:39 an-node01 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Sep 16 13:35:39 an-node01 kernel: bonding: bond1: making interface eth1 the new active one.

Let's take a look at an-node01's bond0 again.

cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 3
Permanent HW addr: 00:e0:81:c7:ec:49

Slave Interface: eth3
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:1b:21:9d:59:fc

We can see that eth0 has returned to the active slave and that eth3 is now down. Again, it's the same story across the other bonds and cman_tool status shows that all is right in the world.

Restoring The Second Switch

Again, we're going to wait a good five minutes after restoring power before calling this test a success.

Checking out an-node01's syslog, we see the links came back and didn't bounce. These are two identical switches and should behave the same but didn't. This is a good example of why you need to test everything, even when you have identical hardware. You just can't guess how things will behave until you test and see for yourself.

Sep 16 13:53:54 an-node01 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:53:54 an-node01 kernel: e1000e: eth5 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:53:54 an-node01 kernel: bonding: bond0: link status definitely up for interface eth3.
Sep 16 13:53:54 an-node01 kernel: bonding: bond2: link status definitely up for interface eth5.
Sep 16 13:53:55 an-node01 kernel: e1000e: eth4 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 16 13:53:55 an-node01 kernel: bonding: bond1: link status definitely up for interface eth4.

Now we're done! We can truly say we've got a full high-availability network configuration that is tested and trusted!

Installing DRBD

DRBD is an open-source application for real-time, block-level disk replication created and maintained by Linbit. We will use this to keep the data on our cluster consistent between the two nodes.

To install it, we have two choices;

  • Install from source files.
  • Install from ELRepo.

Installing from source ensures that you have full control over the installed software. However, you become solely responsible for installing future patches and bugfixes.

Installing from ELRepo means seceding some control to the ELRepo maintainers, but it also means that future patches and bugfixes are applied as part of a standard update.

Which you choose is, ultimately, a decision you need to make.

Option A - Install From Source

On Both nodes run:

# Obliterate peer - fence via cman
wget -c https://alteeve.com/files/an-cluster/sbin/obliterate-peer.sh -O /sbin/obliterate-peer.sh
chmod a+x /sbin/obliterate-peer.sh
ls -lah /sbin/obliterate-peer.sh

# Download, compile and install DRBD
wget -c http://oss.linbit.com/drbd/8.3/drbd-8.3.11.tar.gz
tar -xvzf drbd-8.3.11.tar.gz
cd drbd-8.3.11
./configure \
   --prefix=/usr \
   --localstatedir=/var \
   --sysconfdir=/etc \
   --with-utils \
   --with-km \
   --with-udev \
   --with-pacemaker \
   --with-rgmanager \
   --with-bashcompletion
make
make install
chkconfig --add drbd
chkconfig drbd off

Option B - Install From ELRepo

On Both nodes run:

# Obliterate peer - fence via cman
wget -c https://alteeve.com/files/an-cluster/sbin/obliterate-peer.sh -O /sbin/obliterate-peer.sh
chmod a+x /sbin/obliterate-peer.sh
ls -lah /sbin/obliterate-peer.sh

# Install the ELRepo GPG key, add the repo and install DRBD.
rpm --import http://elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://elrepo.org/elrepo-release-6-4.el6.elrepo.noarch.rpm
yum install drbd83-utils kmod-drbd83

Creating The DRBD Partitions

It is possible to use LVM on the hosts, and simply create LVs to back our DRBD resources. However, this causes confusion as LVM will see the PV signatures on both the DRBD backing devices and the DRBD device itself. Getting around this requires editing LVM's filter option, which is somewhat complicated. Not overly so, mind you, but enough to be outside the scope of this document.

Also, by working with fdisk directly, it will give us a chance to make sure that the DRBD partitions start on an even 64 KiB boundry. This is important for decent performance on Windows VMs, as we will see later. This is true for both traditional platter and modern solid-state drives.

On our nodes, we created three primary disk partitions;

  • /dev/sda1; The /boot partition.
  • /dev/sda2; The root / partition.
  • /dev/sda3; The swap partition.

We will create a new extended partition. Then within it we will create three new partitions;

  • /dev/sda5; a small partition we will later use for our shared GFS2 partition.
  • /dev/sda6; a partition big enough to host the VMs that will normally run on an-node01.
  • /dev/sda7; a partition big enough to host the VMs that will normally run on an-node02.

As we create each partition, we will do a little math to ensure that the start sector is on a 64 KiB boundry.

Alignment Math

Before we can start the alignment math, we need to know how big each sector is on our hard drive. This is almost always 512 bytes, but it's still best to be sure. To check, run;

fdisk -l /dev/sda | grep Sector
Sector size (logical/physical): 512 bytes / 512 bytes

So now that we have confirmed our sector size, we can look at the math.

  • Each 64 KiB block will use 128 sectors ((64 * 1024) / 512) == 128.
  • As we create each each partition, we will be asked to enter the starting sector (using fdisk -u). Take the first free sector and divide it by 128. If it does not divide evenly, then;
    • Add 127 (one sector shy of another block to guarantee we've gone past the start sector we want).
    • Divide the new number by 128. This will give you a fractional number. Remove (do not round!) any number after the decimal place.
    • Multiply by 128 to get the sector number we want.

Lets look at a example using real numbers. Lets say we create a new partition and the first free sector is 92807568;

92807568 ÷ 128 = 725059.125

We have a remainder, so it's not on an even 64 KiB block boundry. Now we need to figure out what sector above 92807568 is evenly divisible by 128. To do that, lets add 127 (one sector shy of the next 64 KiB block), divide by 128 to get the number of 64 KiB blocks (with a remainder), remove the remainder to get an even number (do not round, you just want the bare integer), then finally multiply by 128 to get the sector number. This will give us the sector number we want our partition to start on.

92807568 + 127 = 92807695
92807695 ÷ 128 = 725060.1171875
int(725060.1171875) = 725060
725060 x 128 = 92807680

So now we know that sector number 92807680 is the first sector above 92807568 that falls on an even 64 KiB block. Now we need to alter our partition's starting sector. To do this, we will need to go into fdisk's extra functions.

Note: Pay attention to the last sector number of each partition you create. As you create partitions, fdisk will see free space, as tiny as it is, and it will default to that as the first sector for the next partition. This is annoying. My noting the last sector of each partition you create, you can add 1 sector and do the math to find the first sector above that which sits on a 64 KiB boundary.

Creating the Three Partitions

Here I will show you the values I entered to create the three partitions I needed on my nodes.

DO NOT COPY THIS!

The values you enter will almost certainly be different.

Start fdisk in sector mode on /dev/sda.

Note: If you are using software RAID, you will need to do the following steps on all disks, then you can proceed to create the RAID partitions normally and they will be aligned.
fdisk -u /dev/sda
WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
         switch off the mode (command 'c').

Disable DOS compatibility because hey, it's not the 80s any more.

Command (m for help): c
DOS Compatibility flag is not set

Lets take a look at the current partition layout.

Command (m for help): p
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00056856

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048      526335      262144   83  Linux
/dev/sda2          526336    84412415    41943040   83  Linux
/dev/sda3        84412416    92801023     4194304   82  Linux swap / Solaris

Perfect. Now let's create a new extended partition that will use the rest of the disk. We don't care if this is aligned so we'll just accept the default start and end sectors.

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
e
Selected partition 4
First sector (92801024-976773167, default 92801024):

Just press <enter>.

Using default value 92801024
Last sector, +sectors or +size{K,M,G} (92801024-976773167, default 976773167):

Just press <enter> again.

Using default value 976773167

Now we'll create the first partition. This will be a 20GB partition used by the shared GFS2 partition. As it will never host a VM, I don't care if it is aligned.

Command (m for help): n
First sector (92803072-976773167, default 92803072):

Just press <enter>.

Using default value 92803072
Last sector, +sectors or +size{K,M,G} (92803072-976773167, default 976773167): +20G

Now we will create the last two partitions that will host our VMs. I want to split the remaining space in half, so I need to do a little bit more math before I can proceed. I will need to see how many sectors are still free, divide by two to get the number of sectors in half the remaining free space, the add the number of already-used sectors so that I know where the first partition should end. We'll do this math in just a moment.

So let's print the current partition layout:

Command (m for help): p
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00056856

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048      526335      262144   83  Linux
/dev/sda2          526336    84412415    41943040   83  Linux
/dev/sda3        84412416    92801023     4194304   82  Linux swap / Solaris
/dev/sda4        92801024   976773167   441986072    5  Extended
/dev/sda5        92803072   134746111    20971520   83  Linux

Start to create the new partition. Before be can sort out the last sector, we first need to find the first sector.

Command (m for help): n
First sector (134748160-976773167, default 134748160):

Now I see that it the first free sector is 134748160. I divide this by 128 and I get 1052720. It is an even number, so I don't need to do anything more as it is already on a 64 KiB boundry! So I can just press <enter> to accept it.

Using default value 134748160
Last sector, +sectors or +size{K,M,G} (134748160-976773167, default 976773167):

Now we need to do the math to find what sector marks half of the remaining free space. Let's gather some numbers;

  • This partition started at sector 134748160
  • The default end sector is 976773167
  • That means that there are currently (976773167 - 134748160) == 842025007 sectors free.
  • Half of that is (842025007 / 2) == int(421012503.5) == 421012503 sectors free (int() simply means to take the remainder off the number).
  • So if we want a partition that is 421012503 long, we need to add the start sector to get our offset. That is, (421012503 + 134748160) == 555760663. This is what we will enter now.
Last sector, +sectors or +size{K,M,G} (134748160-976773167, default 976773167): 555760663

Now to create the last partition, we will repeat the steps above.

Command (m for help): n
First sector (555762712-976773167, default 555762712):

Let's make sure that 555762712 is on a 64 KiB boundry;

  • (555762712 / 128) == 4341896.1875 is not an even number, so we need to find the next sector on an even boundary.
  • Add 127 sectors and divide by 128 again;
    • (555762712 + 127) == 555762839
    • (555762839 / 128) == int(4341897.1796875) == 4341897
    • (4341897 * 128) == 555762816
  • Now we know that we want our start sector to be 555762816.
First sector (555762712-976773167, default 555762712): 555762816
Last sector, +sectors or +size{K,M,G} (555762816-976773167, default 976773167):

This is the last partition, so we can just press <enter> to get the last sector on the disk.

Using default value 976773167

Lets take a final look at the new partition before committing the changes to disk.

Command (m for help): p
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00056856

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048      526335      262144   83  Linux
/dev/sda2          526336    84412415    41943040   83  Linux
/dev/sda3        84412416    92801023     4194304   82  Linux swap / Solaris
/dev/sda4        92801024   976773167   441986072    5  Extended
/dev/sda5        92803072   134746111    20971520   83  Linux
/dev/sda6       134748160   555760663   210506252   83  Linux
/dev/sda7       555762816   976773167   210505176   83  Linux

Perfect. If you divide partition six or seven's start sector by 128, you will see that both have no remainder which means that they are, if fact, aligned. This is the last time we need to worry about alignment because LVM uses an even multiple of 64 KiB in it's extent sizes, so all normal extent sized will always produce LVs on even 64 KiB boudaries.

So now write out the changes, re-probe the disk (or reboot) and then repeat all these steps on the other node.

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.

WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
Syncing disks.

No reprobe using partprobe.

partprobe /dev/sda
Warning: WARNING: the kernel failed to re-read the partition table on /dev/sda 
(Device or resource busy).  As a result, it may not reflect all of your changes 
until after reboot.

In my case, the probe failed so I will reboot. To do this most safely, stop the cluster before calling reboot.

/etc/init.d/cman stop
Stopping cluster: 
   Leaving fence domain...                                 [  OK  ]
   Stopping gfs_controld...                                [  OK  ]
   Stopping dlm_controld...                                [  OK  ]
   Stopping fenced...                                      [  OK  ]
   Stopping cman...                                        [  OK  ]
   Waiting for corosync to shutdown:                       [  OK  ]
   Unloading kernel modules...                             [  OK  ]
   Unmounting configfs...                                  [  OK  ]

Now reboot.

reboot

Configuring DRBD

DRBD is configured in two parts;

  • Global and common configuration options
  • Resource configurations

We will be creating three separate DRBD resources, so we will create three separate resource configuration files. More on that in a moment.

Configuring DRBD Global and Common Options

The first file to edit is /etc/drbd.d/global_common.conf. In this file, we will set global configuration options and set default resource configuration options. These default resource options can be overwritten in the actual resource files which we'll create once we're done here.

cp /etc/drbd.d/global_common.conf /etc/drbd.d/global_common.conf.orig
vim /etc/drbd.d/global_common.conf 
diff -u  /etc/drbd.d/global_common.conf.orig /etc/drbd.d/global_common.conf
--- /etc/drbd.d/global_common.conf.orig	2011-09-14 14:03:56.364566109 -0400
+++ /etc/drbd.d/global_common.conf	2011-09-14 14:23:37.287566400 -0400
@@ -15,24 +15,81 @@
 		# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
 		# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
 		# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
+
+		# This script is a wrapper for RHCS's 'fence_node' command line
+		# tool. It will call a fence against the other node and return
+		# the appropriate exit code to DRBD.
+		fence-peer		"/sbin/obliterate-peer.sh";
 	}
 
 	startup {
 		# wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
+
+		# This tells DRBD to promote both nodes to Primary on start.
+		become-primary-on	both;
+
+		# This tells DRBD to wait five minutes for the other node to
+		# connect. This should be longer than it takes for cman to
+		# timeout and fence the other node *plus* the amount of time it
+		# takes the other node to reboot. If you set this too short,
+		# you could corrupt your data. If you want to be extra safe, do
+		# not use this at all and DRBD will wait for the other node 
+		# forever.
+		wfc-timeout		300;
+		
+		# This tells DRBD to wait for the other node for three minutes
+		# if the other node was degraded the last time it was seen by
+		# this node. This is a way to speed up the boot process when
+		# the other node is out of commission for an extended duration.
+		degr-wfc-timeout	120;
 	}
 
 	disk {
 		# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
 		# no-disk-drain no-md-flushes max-bio-bvecs
+
+		# This tells DRBD to block IO and fence the remote node (using
+		# the 'fence-peer' helper) when connection with the other node 
+		# is unexpectedly lost. This is what helps prevent split-brain
+		# condition and it is incredible important in dual-primary 
+		# setups!
+		fencing			resource-and-stonith;
 	}
 
 	net {
 		# sndbuf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
 		# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
 		# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
+
+		# This tells DRBD to allow two nodes to be Primary at the same
+		# time. It is needed when 'become-primary-on both' is set.
+		allow-two-primaries;
+
+		# The following three commands tell DRBD how to react should 
+		# our best efforts fail and a split brain occurs. You can learn
+		# more about these options by reading the drbd.conf man page.
+		# NOTE! It is not possible to safely recover from a split brain
+		# where both nodes were primary. This care requires human
+		# intervention, so 'disconnect' is the only safe policy.
+		after-sb-0pri		discard-zero-changes;
+		after-sb-1pri		discard-secondary;
+		after-sb-2pri		disconnect;
 	}
 
 	syncer {
 		# rate after al-extents use-rle cpu-mask verify-alg csums-alg
+
+		# This alters DRBD's default syncer rate. Note that is it 
+		# *very* important that you do *not* configure the syncer rate
+		# to be too fast. If it is too fast, it can significantly 
+		# impact applications using the DRBD resource. If it's set to a
+		# rate higher than the underlying network and storage can 
+		# handle, the sync can stall completely.
+		# This should be set to ~30% of the *tested* sustainable read 
+		# or write speed of the raw /dev/drbdX device (whichever is 
+		# slower). In this example, the underlying resource was tested
+		# as being able to sustain roughly 60 MB/sec, so this is set to
+		# one third of that rate, 20M.
+		rate			20M;
 	}
 }

Configuring the DRBD Resources

As mentioned earlier, we are going to create three DRBD resources.

  • Resource r0, which will be device /dev/drbd0, will be the shared GFS2 partition.
  • Resource r1, which will be device /dev/drbd1, will provide disk space for VMs that will normally run on an-node01.
  • Resource r2, which will be device /dev/drbd2, will provide disk space for VMs that will normally run on an-node02.
Note: The reason for the two separate VM resources is to help protect against data loss in the off chance that a split-brain occurs, despite our counter-measures. As we will see later, recovering from a split brain requires discarding the changes on one side of the resource. If VMs are running on the same resource but on different nodes, this would lead to data loss. Using two resources helps prevent that scenario.

Each resource configuration will be in it's own file saved as /etc/drbd.d/rX.res. The three of them will be pretty much the same. So let's take a look at the first GFS2 resource r0.res, then we'll just look at the changes for r1.res and r2.res. These files won't exist initially.

vim /etc/drbd.d/r0.res
# This is the resource used for the shared GFS2 partition.
resource r0 {
	# This is the block device path.
        device          /dev/drbd0;

	# We'll use the normal internal metadisk (takes about 32MB/TB)
        meta-disk       internal;

	# This is the `uname -n` of the first node
        on an-node01.alteeve.com {
		# The 'address' has to be the IP, not a hostname. This is the
		# node's SN (bond1) IP. The port number must be unique amoung
		# resources.
                address         10.10.0.1:7789;

		# This is the block device backing this resource on this node.
                disk            /dev/sda5;
        }
	# Now the same information again for the second node.
        on an-node02.alteeve.com {
                address         10.10.0.2:7789;
                disk            /dev/sda5;
        }
}

Now copy this to r1.res and edit for the an-node01 VM resource. The main differences are the resource name, r1, the block device, /dev/drbd1, the port, 7790 and the backing block devices, /dev/sda6.

cp /etc/drbd.d/r0.res /etc/drbd.d/r1.res
vim /etc/drbd.d/r1.res
# This is the resource used for VMs that will normally run on an-node01.
resource r1 {
	# This is the block device path.
        device          /dev/drbd1;

	# We'll use the normal internal metadisk (takes about 32MB/TB)
        meta-disk       internal;

	# This is the `uname -n` of the first node
        on an-node01.alteeve.com {
		# The 'address' has to be the IP, not a hostname. This is the
		# node's SN (bond1) IP. The port number must be unique amoung
		# resources.
                address         10.10.0.1:7790;

		# This is the block device backing this resource on this node.
                disk            /dev/sda6;
        }
	# Now the same information again for the second node.
        on an-node02.alteeve.com {
                address         10.10.0.2:7790;
                disk            /dev/sda6;
        }
}

The last resource is again the same, with the same set of changes.

cp /etc/drbd.d/r1.res /etc/drbd.d/r2.res
vim /etc/drbd.d/r2.res
# This is the resource used for VMs that will normally run on an-node02.
resource r2 {
	# This is the block device path.
        device          /dev/drbd2;

	# We'll use the normal internal metadisk (takes about 32MB/TB)
        meta-disk       internal;

	# This is the `uname -n` of the first node
        on an-node01.alteeve.com {
		# The 'address' has to be the IP, not a hostname. This is the
		# node's SN (bond1) IP. The port number must be unique amoung
		# resources.
                address         10.10.0.1:7791;

		# This is the block device backing this resource on this node.
                disk            /dev/sda7;
        }
	# Now the same information again for the second node.
        on an-node02.alteeve.com {
                address         10.10.0.2:7791;
                disk            /dev/sda7;
        }
}

The final step is to validate the configuration. This is done by running the following command;

drbdadm dump
# /etc/drbd.conf
common {
    protocol               C;
    net {
        allow-two-primaries;
        after-sb-0pri    discard-zero-changes;
        after-sb-1pri    discard-secondary;
        after-sb-2pri    disconnect;
    }
    disk {
        fencing          resource-and-stonith;
    }
    syncer {
        rate             20M;
    }
    startup {
        wfc-timeout      300;
        degr-wfc-timeout 120;
        become-primary-on both;
    }
    handlers {
        pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        local-io-error   "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
        fence-peer       /sbin/obliterate-peer.sh;
    }
}

# resource r0 on an-node01.alteeve.com: not ignored, not stacked
resource r0 {
    on an-node01.alteeve.com {
        device           /dev/drbd0 minor 0;
        disk             /dev/sda5;
        address          ipv4 10.10.0.1:7789;
        meta-disk        internal;
    }
    on an-node02.alteeve.com {
        device           /dev/drbd0 minor 0;
        disk             /dev/sda5;
        address          ipv4 10.10.0.2:7789;
        meta-disk        internal;
    }
}

# resource r1 on an-node01.alteeve.com: not ignored, not stacked
resource r1 {
    on an-node01.alteeve.com {
        device           /dev/drbd1 minor 1;
        disk             /dev/sda6;
        address          ipv4 10.10.0.1:7790;
        meta-disk        internal;
    }
    on an-node02.alteeve.com {
        device           /dev/drbd1 minor 1;
        disk             /dev/sda6;
        address          ipv4 10.10.0.2:7790;
        meta-disk        internal;
    }
}

# resource r2 on an-node01.alteeve.com: not ignored, not stacked
resource r2 {
    on an-node01.alteeve.com {
        device           /dev/drbd2 minor 2;
        disk             /dev/sda7;
        address          ipv4 10.10.0.1:7791;
        meta-disk        internal;
    }
    on an-node02.alteeve.com {
        device           /dev/drbd2 minor 2;
        disk             /dev/sda7;
        address          ipv4 10.10.0.2:7791;
        meta-disk        internal;
    }
}

You'll note that the output is formatted differently, but the values themselves are the same. If there had of been errors, you would have seen them printed. Fix any problems before proceeding. Once you get a clean dump, copy the configuration over to the other node.

rsync -av /etc/drbd.d root@an-node02:/etc/
sending incremental file list
drbd.d/
drbd.d/global_common.conf
drbd.d/global_common.conf.orig
drbd.d/r0.res
drbd.d/r1.res
drbd.d/r2.res

sent 7619 bytes  received 129 bytes  15496.00 bytes/sec
total size is 7946  speedup is 1.03

Initializing The DRBD Resources

Now that we have DRBD configured, we need to initialize the DRBD backing devices and then bring up the resources for the first time.

Note: To save a bit of time and typing, the following sections will use a little bash magic. When commands need to be run on all three resources, rather than running the same command three times with the different resource names, we will use the short-hand form r{0,1,2} or r{0..2}.

On both nodes, create the new metadata on the backing devices. You may need to type yes to confirm the action if any data is seen. If DRBD sees an actual file system, it will error and insist that you clear the partition. You can do this by running; dd if=/dev/zero of=/dev/sdaX bs=4M count=1000, where X is the partition you want to clear.

drbdadm create-md r{0..2}
md_offset 21474832384
al_offset 21474799616
bm_offset 21474144256

Found some data

 ==> This might destroy existing data! <==

Do you want to proceed?
[need to type 'yes' to confirm] yes
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
md_offset 215558397952
al_offset 215558365184
bm_offset 215551782912

Found some data

 ==> This might destroy existing data! <==

Do you want to proceed?
[need to type 'yes' to confirm] yes
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
md_offset 215557296128
al_offset 215557263360
bm_offset 215550681088

Found some data

 ==> This might destroy existing data! <==

Do you want to proceed?
[need to type 'yes' to confirm] yes
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success

Before you go any further, we'll need to load the drbd kernel module. Note that you won't normally need to do this. Later, after we get everything running the first time, we'll be able to start and stop the DRBD resources using the /etc/init.d/drbd script, which loads and unloads the drbd kernel module as needed.

modprobe drbd

Now go back to the terminal windows we had used to watch the cluster start. We now want to watch the output of cat /proc/drbd so we can keep tabs on the current state of the DRBD resources. We'll do this by using the watch program, which will refresh the output of the cat call every couple of seconds.

watch cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by dag@Build64R6, 2011-08-08 08:54:05

Back in the first terminal, we need to attach the backing device, /dev/sda{5..7} to their respective DRBD resources, r{0..2}. After running the following command, you will see no output on the first terminal, but the second terminal's /proc/drbd should update.

drbdadm attach r{0..2}
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by dag@Build64R6, 2011-08-08 08:54:05
 0: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown   r----s
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:20970844
 1: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown   r----s
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:210499788
 2: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown   r----s
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:210498712

Take note of the connection state, cs:StandAlone, the current role, ro:Secondary/Unknown and the disk state, ds:Inconsistent/DUnknown. This tells us that our resources are not talking to one another, are not usable because they are in the Secondary state (you can't even read the /dev/drbdX device) and that the backing device does not have an up to date view of the data.

This all makes sense of course, as the resources are brand new.

So the next step is to connect the two nodes together. As before, we won't see any output from the first terminal, but the second terminal will change.

Note: After running the following command on the first node, it's connection state will become cs:WFConnection which means that it is waiting for a connection from the other node.
drbdadm connect r{0..2}
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by dag@Build64R6, 2011-08-08 08:54:05
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:20970844
 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:210499788
 2: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:210498712

We can now see that the two nodes are talking to one another properly as the connection state has changed to cs:Connected. They can see that their peer node is in the same state as they are; Secondary/Inconsistent.

Seeing as the resources are brand new, there is no data to synchronize or save. So we're going to issue a special command that will only ever be used this one time. It will tell DRBD to immediately consider the DRBD resources to be up to date.

On one node only, run;

drbdadm -- --clear-bitmap new-current-uuid r{0..2}

As before, look to the second terminal to see the new state of affairs.

version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by dag@Build64R6, 2011-08-08 08:54:05
 0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 2: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Voila!

We could promote both sides to Primary by running drbdadm primary r{0..2} on both nodes, but there is no purpose in doing that at this stage as we can safely say our DRBD is ready to go. So instead, let's just stop DRBD entirely. We'll also prevent it from starting on boot as drbd will be managed by the cluster in a later step.

On both nodes run;

chkconfig drbd off
/etc/init.d/drbd stop
Stopping all DRBD resources: .

The second terminal will start complaining that /proc/drbd no longer exists. This is because the drbd init script unloaded the drbd kernel module.

Configuring Clustered Storage

Before we can provision the first virtual machine, we must first create the storage that will back them. This will take a few steps;

  • Configuring LVM's clustered locking and creating the PVs, VGs and LVs
  • Formatting and configuring the shared GFS2 partition.
  • Adding storage to the cluster's resource management.

Configuring Clustered LVM Locking

Before we create the clustered LVM, we need to first make a couple of changes to the LVM configuration.

  • We need to filter out the DRBD backing devices so that LVM doesn't see the same signature twice.
  • Switch and enforce the locking type from local locking to clustered locking.

The configuration option to filter out the DRBD backing device is, surprisingly, filter = [ ... ]. By default, it is set to allow everything via the "a/.*/" regular expression. We're only using DRBD in our LVM, so we're going to flip that to reject everything except DRBD by changing the regex to "a|/dev/drbd*|", "r/.*/".

For the locking, we're going to change the locking_type from 1 (local locking) to 3, clustered locking. We're also going to disallow fall-back to local locking. Normal


Creating The Shared GFS2 Partition

On both

/etc/init.d/drbd start
/etc/init.d/clvmd start

On an-node01

pvcreate /dev/drbd0 && \
vgcreate -c y vg0 /dev/drbd0 && \
lvcreate -L 10G -n shared /dev/vg0

# Change this to match your cluster name
mkfs.gfs2 -p lock_dlm -j 2 -t ClusterA:shared /dev/vg0/shared

On both

mkdir /shared
mount /dev/vg0/shared /shared/
echo `gfs2_edit -p sb /dev/vg0/shared | grep sb_uuid | sed -e "s/.*sb_uuid  *\(.*\)/UUID=\L\1\E \/shared\t\tgfs2\trw,suid,dev,exec,nouser,async\t0 0/"` >> /etc/fstab
/etc/init.d/gfs2 status

On an-node01

mkdir /shared/definitions
mkdir /shared/provision
# Change this to the proper 'vmXXXX-YY'
lvcreate -l 100%free -n vm0001-01 /dev/vg0


/etc/init.d/gfs2 stop && /etc/init.d/clvmd stop && /etc/init.d/drbd stop
/etc/init.d/rgmanager start && watch clustat


Thanks

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.