2x5 Scalable Cluster Tutorial: Difference between revisions

From Alteeve Wiki
Jump to navigation Jump to search
No edit summary
 
(83 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{howto_header}}
{{howto_header}}


= Overview =


This paper has one goal;
{{warning|1=This document is old, abandoned and very out of date. DON'T USE ANYTHING HERE! Consider it only as historical note taking.}}


* Creating a 2-node, high-availability cluster hosting [[Xen]] virtual machines.
= The Design =


== Technologies We Will Use ==
== Storage ==


We will introduce and use the following technologies:
Storage, high-level:
<source lang="text">
[ Storage Cluster ]                                                     
    _____________________________            _____________________________
  | [ an-node01 ]              |          | [ an-node02 ]              |
  |  _____    _____            |          |            _____    _____  |
  | ( HDD )  ( SSD )            |          |            ( SSD )  ( HDD ) |
  | (_____)  (_____)  __________|          |__________  (_____)  (_____) |
  |    |        |    | Storage  =--\    /--=  Storage |    |        |    |
  |    |        \----| Network ||  |    |  || Network |----/        |    |
  |    \-------------|_________||  |    |  ||_________|-------------/    |
  |_____________________________|  |    |  |_____________________________|
                                  __|_____|__                             
                                |  HDD LUN  |                             
                                |  SDD LUN  |                             
                                |___________|                             
                                      |                                   
                                  _____|_____                           
                                | Floating  |                           
                                |  SAN IP  |                               
[ VM Cluster ]                  |___________|                               
  ______________________________  | | | | |  ______________________________
| [ an-node03 ]                |  | | | | |  |                [ an-node06 ] |
|  _________                  |  | | | | |  |                  _________  |
| | [ vmA ] |                  |  | | | | |  |                  | [ vmJ ] | |
| |  _____  |                  |  | | | | |  |                  |  _____  | |
| | (_hdd_)-=----\            |  | | | | |  |            /----=-(_hdd_) | |
| |_________|    |            |  | | | | |  |            |    |_________| |
|  _________    |            |  | | | | |  |            |    _________  |
| | [ vmB ] |    |            |  | | | | |  |            |    | [ vmK ] | |
| |  _____  |    |            |  | | | | |  |            |    |  _____  | |
| | (_hdd_)-=--\ |  __________|  | | | | |  |__________  | /--=-(_hdd_) | |
| |_________|  | \--| Storage  =--/ | | | \--=  Storage |--/ |  |_________| |
|  _________  \----| Network ||    | | |    || Network |----/  _________  |
| | [ vmC ] |  /----|_________||    | | |    ||_________|----\  | [ vmL ] | |
| |  _____  |  |              |    | | |    |              |  |  _____  | |
| | (_hdd_)-=--/              |    | | |    |              \--=-(_hdd_) | |
| |_________|                  |    | | |    |                  |_________| |
|______________________________|    | | |    |______________________________|           
  ______________________________    | | |    ______________________________
| [ an-node04 ]                |    | | |    |                [ an-node07 ] |
|  _________                  |    | | |    |                  _________  |
| | [ vmD ] |                  |    | | |    |                  | [ vmM ] | |
| |  _____  |                  |    | | |    |                  |  _____  | |
| | (_hdd_)-=----\            |    | | |    |            /----=-(_hdd_) | |
| |_________|    |            |    | | |    |            |    |_________| |
|  _________    |            |    | | |    |            |    _________  |
| | [ vmE ] |    |            |    | | |    |            |    | [ vmN ] | |
| |  _____  |    |            |    | | |    |            |    |  _____  | |
| | (_hdd_)-=--\ |  __________|    | | |    |__________  | /--=-(_hdd_) | |
| |_________|  | \--| Storage  =----/ | \----=  Storage |--/ |  |_________| |
|  _________  \----| Network ||      |      || Network |----/  _________  |
| | [ vmF ] |  /----|_________||      |      ||_________|----\  | [ vmO ] | |
| |  _____  |  |              |      |      |              |  |  _____  | |
| | (_hdd_)-=--+              |      |      |              \--=-(_hdd_) | |
| | (_ssd_)-=--/              |      |      |                  |_________| |
| |_________|                  |      |      |                              |
|______________________________|      |      |______________________________|           
  ______________________________      |                                     
| [ an-node05 ]                |      |                                     
|  _________                  |      |                                     
| | [ vmG ] |                  |      |                                     
| |  _____  |                  |      |                                     
| | (_hdd_)-=----\            |      |                                     
| |_________|    |            |      |                                     
|  _________    |            |      |                                     
| | [ vmH ] |    |            |      |                                     
| |  _____  |    |            |      |                                     
| | (_hdd_)-=--\ |            |      |                                     
| | (_sdd_)-=--+ |  __________|      |                                     
| |_________|  | \--| Storage  =------/                                     
|  _________  \----| Network ||                                           
| | [ vmI ] |  /----|_________||                                           
| |  _____  |  |              |                                           
| | (_hdd_)-=--/              |                                           
| |_________|                  |                                           
|______________________________|                                                       
</source>


* [[RHCS]], Red Hat Cluster Services version 3, aka "Cluster 3", running on Red Hat Enterprise Linux 6.0 x86_64.
== Long View ==
** RHCS implements:
*** <span class="code">cman</span>; The cluster manager.
*** <span class="code">corosync</span>; The cluster engine, implementing the <span class="code">totem</span> protocol and <span class="code">[[cpg]]</span> and other core cluster services.
*** <span class="code">rgmanager</span>; The resource group manager handles restoring and failing over services in the cluster, including our Xen [[VM]]s.
* [[Fencing]] devices needed to keep a cluster safe.
** Two fencing types are discussed;
*** [[IPMI]]; The most common fence method used in servers.
*** [[Node Assassin]]; A home-brew fence device ideal for learning or as a backup to [[IPMI]].
* Xen; The virtual server [[hypervisor]].
** Converting the host OS into the special access [[dom0]] virtual machine.
** Provisioning [[domU]] VMs.
* Putting all cluster-related daemons under the control of <span class="code">rgmanager</span>.
** Making the VMs highly available.


== Prerequisites ==
{{note|1=Yes, this is a big graphic, but this is also a big project. I am no artist though, and any help making this clearer is greatly appreciated!}}


It is expected that you are already comfortable with the Linux command line, specifically <span class="code">[[bash]]</span>, and that you are familiar with general administrative tasks in Red Hat based distributions. You will also need to be comfortable using editors like [[vim]], [[nano]], [[gedit]], [[kate]] or similar. This paper uses <span class="code">vim</span> in examples. Simply substitute your favourite editor in it's place.
[[Image:2x5_the-plan_01.png|center|thumb|800px|The planned network. This shows separate IPMI and full redundancy through-out the cluster. This is the way a production cluster should be built, but is not expected for dev/test clusters.]]


You are also expected to be comfortable with networking concepts. You will be expected to understand [[TCP/IP]], [[multicast]], [[broadcast]], [[subnets]] and [[netmasks]], routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.
== Failure Mapping ==


Where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.
VM Cluster; Guest VM failure migration planning;


== Platform ==
* Each node can host 5 VMs @ 2GB/VM.
* This is an N-1 cluster with five nodes; 20 VMs total.


Red Hat Cluster Service version 3, also known as "Cluster Stable 3" or "RHCS 3", entered the server distribution world with the release of [[RHEL]] 6. It is used by downstream distributions like CentOS and Scientific Linux. This tutorial should be easily adapted to any Red Hat derivative distribution. It is expected that most users will have 64-[[bit]] [[CPU]]s, thus, we will use the [[x86_64]] distribution and packages.
<source lang="text">
          |    All    | an-node03 | an-node04 | an-node05 | an-node06 | an-node07 |
          | on-line  |  down    |  down    |  down    |  down    |  down    |
----------+-----------+-----------+-----------+-----------+-----------+-----------+
an-node03 |  vm01    |    --    |  vm01    |  vm01    |  vm01    |  vm01    |
          |  vm02    |    --    |  vm02    |  vm02    |  vm02    |  vm02    |
          |  vm03    |    --    |  vm03    |  vm03    |  vm03    |  vm03    |
          |  vm04    |    --    |  vm04    |  vm04    |  vm04    |  vm04    |
          |    --    |    --    |  vm05    |  vm09    |  vm13    |  vm17    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node04 |  vm05    |  vm05    |    --    |  vm05    |  vm05    |  vm05    |
          |  vm06    |  vm06    |    --    |  vm06    |  vm06    |  vm06    |
          |  vm07    |  vm07    |    --    |  vm07    |  vm07    |  vm07    |
          |  vm08    |  vm08    |    --    |  vm08    |  vm08    |  vm08    |
          |    --    |  vm01    |    --    |  vm10    |  vm14    |  vm18    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node05 |  vm09    |  vm09    |  vm09    |    --    |  vm09    |  vm09    |
          |  vm10    |  vm10    |  vm10    |    --    |  vm10    |  vm10    |
          |  vm11    |  vm11    |  vm11    |    --    |  vm11    |  vm11    |
          |  vm12    |  vm12    |  vm12    |    --    |  vm12    |  vm12    |
          |    --    |  vm02    |  vm06    |    --    |  vm15    |  vm19    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node06 |  vm13    |  vm13    |  vm13    |  vm13    |    --    |  vm13    |
          |  vm14    |  vm14    |  vm14    |  vm14    |    --    |  vm14    |
          |  vm15    |  vm15    |  vm15    |  vm15    |    --    |  vm15    |
          |  vm16    |  vm16    |  vm16    |  vm16    |    --    |  vm16    |
          |    --    |  vm03    |  vm07    |  vm11    |    --    |  vm20    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node07 |  vm17    |  vm17    |  vm17    |  vm17    |  vm17    |    --    |
          |  vm18    |  vm18    |  vm18    |  vm18    |  vm18    |    --    |
          |  vm19    |  vm19    |  vm19    |  vm19    |  vm19    |    --    |
          |  vm20    |  vm20    |  vm20    |  vm20    |  vm20    |    --    |
          |    --    |  vm04    |  vm08    |  vm12    |  vm16    |    --    |
----------+-----------+-----------+-----------+-----------+-----------+-----------+
</source>


If you are on an 32-bit system, you should be able to follow along fine. Simply replace <span class="code">x86_64</span> with <span class="code">i386</span> or <span class="code">i686</span> in package names. Be aware though that issues arising from the need for [[PAE]] will not be discussed.
== Cluster Overview ==


If you do not have a Red Hat Network account, you can download [[CentOS]] or another derivative of the same release, currently <span class="code">6.0</span>.
{{note|1=This is not programatically accurate!}}


'''Note''': When last checked, down-stream distributions have not yet been released. It is expected that they will be available around mid to late December.
This is meant to show, at a logical level, how the parts of a cluster work together. It is the first draft and is likely defective in terrible ways.


== Focus and Goal ==
<source lang="text">
[ Resource Managment ]                                                                                                         
  ___________    ___________                                                                                               
|          |  |          |                                                                                             
| Service A |  | Service B |                                                                                             
|___________|  |___________|                                                                                             
            |    |        |                                                                                                   
          __|_____|__    ___|_______________                                                                                   
        |          |  |                  |                                                                                   
        | RGManager |  | Clustered Storage |================================================.                                 
        |___________|  |___________________|                                                |                                 
              |                  |                                                          |                                 
              |__________________|______________                                            |                               
                            |                    \                                          |                               
        _________      ____|____                |                                          |                               
        |        |    |        |                |                                          |                               
/------| Fencing |----| Locking |                |                                          |                               
|      |_________|    |_________|                |                                          |                               
_|___________|_____________|______________________|__________________________________________|_____
|          |            |                      |                                          |                                 
|    ______|_____    ____|___                  |                                          |                                 
|    |            |  |        |                  |                                          |                                 
|    | Membership |  | Quorum |                  |                                          |                                 
|    |____________|  |________|                  |                                          |                                 
|          |____________|                      |                                          |                                 
|                      __|__                    |                                          |                                 
|                    /    \                    |                                          |                                 
|                    { Totem }                  |                                          |                                 
|                    \_____/                    |                                          |                                 
|      __________________|_______________________|_______________ ______________            |                                   
|    |-----------|-----------|----------------|-----------------|--------------|          |                                   
|  ___|____    ___|____    ___|____        ___|____        _____|_____    _____|_____    __|___                               
| |        |  |        |  |        |      |        |      |          |  |          |  |      |                               
| | Node 1 |  | Node 2 |  | Node 3 |  ...  | Node N |      | Storage 1 |==| Storage 2 |==| DRBD |                               
| |________|  |________|  |________|      |________|      |___________|  |___________|  |______|                               
\_____|___________|___________|________________|_________________|______________|                                               
                                                                                                                               
[ Cluster Communication ]                                                                                                       
</source>


Clusters can serve to solve three problems; '''Reliability''', '''Performance''' and '''Scalability'''.
== Network IPs ==


This paper will build a cluster designed to be more reliable, also known as a '''High-Availability cluster''' or simply '''HA Cluster'''. At the end of this paper, you should have a fully functioning two-node cluster capable of hosting a "floating" virtual servers. That is, VMs that exist on one node that can be easily moved to the other node with minimal or no down time.
<source lang="text">
SAN: 10.10.1.1


= Base System Setup =
Node:
          | IFN        | SN        | BCN      | IPMI      |
----------+-------------+------------+-----------+-----------+
an-node01 | 10.255.0.1  | 10.10.0.1  | 10.20.0.1 | 10.20.1.1 |                                               
an-node02 | 10.255.0.2  | 10.10.0.2  | 10.20.0.2 | 10.20.1.2 |                                               
an-node03 | 10.255.0.3  | 10.10.0.3  | 10.20.0.3 | 10.20.1.3 |                                               
an-node04 | 10.255.0.4  | 10.10.0.4  | 10.20.0.4 | 10.20.1.4 |                                               
an-node05 | 10.255.0.5  | 10.10.0.5  | 10.20.0.5 | 10.20.1.5 |                                               
an-node06 | 10.255.0.6  | 10.10.0.6  | 10.20.0.6 | 10.20.1.6 |                                               
an-node07 | 10.255.0.7  | 10.10.0.7  | 10.20.0.7 | 10.20.1.7 |                                               
----------+-------------+------------+-----------+-----------+


This paper is admittedly long-winded. There is a "cheat-sheet" version planned, but it will be written only after this main tutorial is complete. Please be patient! Clustering is not inherently difficult, but there are a lot of pieces that need to work together for anything to work. Grab a coffee or tea and settle in.
Aux Equipment:
          | BCN        |
----------+-------------+
pdu1      | 10.20.2.1  |                                                                                                 
pdu2      | 10.20.2.2  |                                                                                                 
switch1  | 10.20.2.3  |                                                                                                 
switch2  | 10.20.2.4  |                                                                                                 
ups1      | 10.20.2.5  |                                                                                       
ups2      | 10.20.2.6  |                                                                                       
----------+-------------+
                                                                                                                 
VMs:                                                                                                             
          | VMN        |                                                                                       
----------+-------------+
vm01      | 10.254.0.1  |                                                                                       
vm02      | 10.254.0.2  |                                                                                       
vm03      | 10.254.0.3  |                                                                                       
vm04      | 10.254.0.4  |                                                                                       
vm05      | 10.254.0.5  |                                                                                       
vm06      | 10.254.0.6  |                                                                                       
vm07      | 10.254.0.7  |                                                                                       
vm08      | 10.254.0.8  |                                                                                       
vm09      | 10.254.0.9  |                                                                                       
vm10      | 10.254.0.10 |                                                                                       
vm11      | 10.254.0.11 |                                                                                       
vm12      | 10.254.0.12 |                                                                                       
vm13      | 10.254.0.13 |                                                                                       
vm14      | 10.254.0.14 |                                                                                       
vm15      | 10.254.0.15 |                                                                                       
vm16      | 10.254.0.16 |                                                                                       
vm17      | 10.254.0.17 |                                                                                       
vm18      | 10.254.0.18 |                                                                                       
vm19      | 10.254.0.19 |                                                                                       
vm20      | 10.254.0.20 |                                                                                       
----------+-------------+
</source>


== Hardware ==
= Install The Cluster Software =


We will need two physical servers each with the following hardware:
If you are using Red Hat Enterprise Linux, you will need to add the <span class="code">RHEL Server Optional (v. 6 64-bit x86_64)</span> channel for each node in your cluster. You can do this in [[RHN]] by going the your subscription management page, clicking on each server, clicking on "Alter Channel Subscriptions", click to enable the <span class="code">RHEL Server Optional (v. 6 64-bit x86_64)</span> channel and then by clicking on "Change Subscription".


* One or more multi-core [[CPU]]s with Virtualization support.
This actual installation is simple, just use <span class="code">yum</span> to install <span class="code">cman</span>.
* Three network cards; At least one should be gigabit or faster.
* One or more hard drives.
* You will need some form of a [[fence|fence device]]. This can be an [[IPMI]]-enabled server, a [http://nodeassassin.org Node Assassin], a fenceable [http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900&tab=features PDU] or similar.


This paper uses the following hardware, and would suggest these be "minimum specifications":
<source lang="bash">
yum install cman fence-agents rgmanager resource-agents lvm2-cluster gfs2-utils python-virtinst libvirt qemu-kvm-tools qemu-kvm virt-manager virt-viewer virtio-win
</source>


* ASUS [http://support.asus.com/search/search.aspx?keyword=m4a78l-m&SLanguage=en-us M4A78L-M]
== Initial Config ==
* AMD Athlon II x2 250
* 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
* 1x Intel 82540 PCI NICs
* 1x D-Link DGE-560T
* Node Assassin


This is not an endorsement of the above hardware. I bought what was within my budget that would serve the purposes of creating this document. What you purchase shouldn't matter, so long as the minimum requirements are met.
Everything uses <span class="code">ricci</span>, which itself needs to have a password set. I set this to match <span class="code">root</span>.


'''Note''': I use three physical [[NIC]]s, but you can get away with fewer by using [[VLAN]]s or by simply re-using a given interface. Neither appealed to me given the minimal cost of add-in network cards and the relative complexity of VLANs. If you wish to alter your network setup, please do so.
'''Both''':


== Pre-Assembly Information ==
<source lang="bash">
passwd ricci
</source>
<source lang="bash">
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
</source>


With multiple NICs, it is quite likely that the mapping of physical devices to logical <span class="code">ethX</span> devices may not be ideal. This is a particular issue if you decide to [[Setting Up a PXE Server in Fedora|network boot]] your install media.
With these decisions and the information gathered, here is what our first <span class="code">/etc/cluster/cluster.conf</span> file will look like.  


There is no requirement, from a clustering point of view, that any given network card be mapped to any given <span class="code">ethX</span> device. However, you will be jumping between servers fairly often and having various setups adds one more level of complexity. For this reason, I strongly recommend you follow this section.
<source lang="bash">
touch /etc/cluster/cluster.conf
vim /etc/cluster/cluster.conf
</source>
<source lang="xml">
<?xml version="1.0"?>
<cluster config_version="1" name="an-cluster">
<cman expected_votes="1" two_node="1" />
<clusternodes>
<clusternode name="an-node01.alteeve.ca" nodeid="1">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an01" />
</method>
<method name="pdu">
<device action="reboot" name="pdu1" port="1" />
<device action="reboot" name="pdu2" port="1" />
</method>
</fence>
</clusternode>
<clusternode name="an-node02.alteeve.ca" nodeid="2">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an02" />
</method>
<method name="pdu">
<device action="reboot" name="pdu1" port="2" />
<device action="reboot" name="pdu2" port="2" />
</method>
</fence>
</clusternode>
<clusternode name="an-node03.alteeve.ca" nodeid="3">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an03" />
</method>
<method name="pdu">
<device action="reboot" name="pdu1" port="3" />
<device action="reboot" name="pdu2" port="3" />
</method>
</fence>
</clusternode>
<clusternode name="an-node04.alteeve.ca" nodeid="4">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an04" />
</method>
<method name="pdu">
<device action="reboot" name="pdu1" port="4" />
<device action="reboot" name="pdu2" port="4" />
</method>
</fence>
</clusternode>
<clusternode name="an-node05.alteeve.ca" nodeid="5">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an05" />
</method>
<method name="pdu">
<device action="reboot" name="pdu1" port="5" />
<device action="reboot" name="pdu2" port="5" />
</method>
</fence>
</clusternode>
<clusternode name="an-node06.alteeve.ca" nodeid="6">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an06" />
</method>
<method name="pdu">
<device action="reboot" name="pdu1" port="6" />
<device action="reboot" name="pdu2" port="6" />
</method>
</fence>
</clusternode>
<clusternode name="an-node07.alteeve.ca" nodeid="7">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an07" />
</method>
<method name="pdu">
<device action="reboot" name="pdu1" port="7" />
<device action="reboot" name="pdu2" port="7" />
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_ipmilan" ipaddr="an-node01.ipmi" login="root" name="ipmi_an01" passwd="secret" />
<fencedevice agent="fence_ipmilan" ipaddr="an-node02.ipmi" login="root" name="ipmi_an02" passwd="secret" />
<fencedevice agent="fence_ipmilan" ipaddr="an-node03.ipmi" login="root" name="ipmi_an03" passwd="secret" />
<fencedevice agent="fence_ipmilan" ipaddr="an-node04.ipmi" login="root" name="ipmi_an04" passwd="secret" />
<fencedevice agent="fence_ipmilan" ipaddr="an-node05.ipmi" login="root" name="ipmi_an05" passwd="secret" />
<fencedevice agent="fence_ipmilan" ipaddr="an-node06.ipmi" login="root" name="ipmi_an06" passwd="secret" />
<fencedevice agent="fence_ipmilan" ipaddr="an-node07.ipmi" login="root" name="ipmi_an07" passwd="secret" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu1.alteeve.ca" name="pdu1" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu2.alteeve.ca" name="pdu2" />
</fencedevices>
<fence_daemon post_join_delay="30" />
<totem rrp_mode="none" secauth="off" />
<rm>
<resources>
<ip address="10.10.1.1" monitor_link="on" />
<script file="/etc/init.d/tgtd" name="tgtd" />
<script file="/etc/init.d/drbd" name="drbd" />
<script file="/etc/init.d/clvmd" name="clvmd" />
<script file="/etc/init.d/gfs2" name="gfs2" />
<script file="/etc/init.d/libvirtd" name="libvirtd" />
</resources>
<failoverdomains>
<!-- Used for storage -->
<!-- SAN Nodes -->
<failoverdomain name="an1_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node01.alteeve.ca" />
</failoverdomain>
<failoverdomain name="an2_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node02.alteeve.ca" />
</failoverdomain>
<!-- VM Nodes -->
<failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" />
</failoverdomain>
<failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca" />
</failoverdomain>
<failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" />
</failoverdomain>
<failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca" />
</failoverdomain>
<failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca" />
</failoverdomain>
<!-- Domain for the SAN -->
<failoverdomain name="an1_primary" nofailback="1" ordered="1" restricted="0">
<failoverdomainnode name="an-node01.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node02.alteeve.ca" priority="2" />
</failoverdomain>
<!-- Domains for VMs running primarily on an-node03 -->
<failoverdomain name="an3_an4" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an3_an5" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an3_an6" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an3_an7" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
</failoverdomain>
<!-- Domains for VMs running primarily on an-node04 -->
<failoverdomain name="an4_an3" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an4_an5" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an4_an6" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an4_an7" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
</failoverdomain>
<!-- Domains for VMs running primarily on an-node05 -->
<failoverdomain name="an5_an3" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an5_an4" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an5_an6" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an5_an7" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
</failoverdomain>
<!-- Domains for VMs running primarily on an-node06 -->
<failoverdomain name="an6_an3" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an6_an4" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an6_an5" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an6_an7" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
</failoverdomain>
<!-- Domains for VMs running primarily on an-node07 -->
<failoverdomain name="an7_an3" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an7_an4" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an7_an5" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an7_an6" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
</failoverdomain>
</failoverdomains>
<!-- SAN Services -->
<service autostart="1" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="tgtd" />
</script>
</script>
</service>
<service autostart="1" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="tgtd" />
</script>
</script>
</service>
<service autostart="1" domain="an1_primary" name="san_ip" recovery="relocate">
<ip ref="10.10.1.1" />
</service>
<!-- VM Storage services. -->
<service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd" />
</script>
</script>
</script>
</service>
<service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd" />
</script>
</script>
</script>
</service>
<service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd" />
</script>
</script>
</script>
</service>
<service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd" />
</script>
</script>
</script>
</service>
<service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd" />
</script>
</script>
</script>
</service>
<!-- VM Services -->
<!-- VMs running primarily on an-node03 -->
<vm name="vm01" domain="an03_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm02" domain="an03_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm03" domain="an03_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm04" domain="an03_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<!-- VMs running primarily on an-node04 -->
<vm name="vm05" domain="an04_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm06" domain="an04_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm07" domain="an04_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm08" domain="an04_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<!-- VMs running primarily on an-node05 -->
<vm name="vm09" domain="an05_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm10" domain="an05_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm11" domain="an05_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm12" domain="an05_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<!-- VMs running primarily on an-node06 -->
<vm name="vm13" domain="an06_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm14" domain="an06_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm15" domain="an06_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm16" domain="an06_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<!-- VMs running primarily on an-node07 -->
<vm name="vm17" domain="an07_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm18" domain="an07_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm19" domain="an07_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm20" domain="an07_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
</rm>
</cluster>
</source>


Before you assemble your servers, record their network cards' [[MAC]] addresses. I like to keep simple text files like these:
Save the file, then validate it. If it fails, address the errors and try again.


<source lang="bash">
<source lang="bash">
cat an-node01.mac
ip addr list | grep <ip>
rg_test test /etc/cluster/cluster.conf
ccs_config_validate
</source>
</source>
<source lang="text">
<source lang="text">
90:E6:BA:71:82:EA eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
Configuration validates
00:21:91:19:96:53 eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>
</source>
Push it to the other node:


<source lang="bash">
<source lang="bash">
cat an-node02.mac
rsync -av /etc/cluster/cluster.conf root@an-node02:/etc/cluster/
</source>
</source>
<source lang="text">
<source lang="text">
90:E6:BA:71:82:D8 eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
sending incremental file list
00:21:91:19:96:5A eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
cluster.conf
00:0E:0C:59:45:78 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
 
sent 781 bytes  received 31 bytes  541.33 bytes/sec
total size is 701  speedup is 0.86
</source>
</source>


This will prove very handy later.
Start:
 
 
 
'''''DO NOT PROCEED UNTIL YOUR cluster.conf FILE VALIDATES!'''''
 
Unless you have it perfect, your cluster will fail.


== OS Install ==
Once it validates, proceed.


There is no hard and fast rule on how you install the host operating systems. Ultimately, it's a question of what you prefer. There are some things that you should keep in mind though.
== Starting The Cluster For The First Time ==


* Balance the desire for tools against the reality that all programs have bugs.
By default, if you start one node only and you've enabled the <span class="code"><cman two_node="1" expected_votes="1"/></span> option as we have done, the lone server will effectively gain quorum. It will try to connect to the cluster, but there won't be a cluster to connect to, so it will fence the other node after a timeout period. This timeout is <span class="code">6</span> seconds by default.
** Bugs could be exploited to gain access to your server. If the host is compromised, all of the virtual servers are compromised.
* The host operating system, known as <span class="code">[[dom0]]</span> in Xen, should do nothing but run the [[hypervisor]].
* If you install a graphical interface, like [[Xorg]] and [[Gnome]], consider disabling it.
** This paper takes this approach and will cover disabling the graphical interface.


Below is the kickstart script used by the nodes for this paper. You should be able to adapt it easily to suit your needs. All options are documented.
For now, we will leave the default as it is. If you're interested in changing it though, the argument you are looking for is <span class="code">[[RHCS v3 cluster.conf#post_join_delay|post_join_delay]]</span>.


* [https://alteeve.com/files/an-cluster/ks/generic_server_rhel6.ks generic_server_rhel6.ks]
This behaviour means that we'll want to start both nodes well within six seconds of one another, least the slower one get needlessly fenced.  


== Post OS Install ==
'''Left off here'''


There are a handful of changes we will want to make now that the install is complete. Some of these are optional and you may skip them if you prefer. However, the remainder of this paper assumes these changes have been made. If you used the [[kickstart]] script, then some of these steps will have already been completed.
Note to help minimize dual-fences:
* <span class="code">you could add FENCED_OPTS="-f 5" to /etc/sysconfig/cman on *one* node</span> (ilo fence devices may need this)


=== Disable selinux ===
== DRBD Config ==


Given the complexity of clustering, we will disable <span class="code">selinux</span> to keep it from being more complex. Obviously, this introduces security issues that you may not be comfortable with.
Install from source:


To disable <span class="code">selinux</span>, edit <span class="code">/etc/selinux/config</span> and change <span class="code">SELINUX=enforcing</span> to <span class="code">SELINUX=permissive</span>. You will need to reboot in order for the changes to take effect, but don't do it yet as some changes to come may also need a reboot.
'''Both''':


=== Change the Default Run-Level ===
<source lang="bash">
# Obliterate peer - fence via cman
wget -c https://alteeve.ca/files/an-cluster/sbin/obliterate-peer.sh -O /sbin/obliterate-peer.sh
chmod a+x /sbin/obliterate-peer.sh
ls -lah /sbin/obliterate-peer.sh


This is an optional step intended to improve performance.
# Download, compile and install DRBD
wget -c http://oss.linbit.com/drbd/8.3/drbd-8.3.11.tar.gz
tar -xvzf drbd-8.3.11.tar.gz
cd drbd-8.3.11
./configure \
  --prefix=/usr \
  --localstatedir=/var \
  --sysconfdir=/etc \
  --with-utils \
  --with-km \
  --with-udev \
  --with-pacemaker \
  --with-rgmanager \
  --with-bashcompletion
make
make install
chkconfig --add drbd
chkconfig drbd off
</source>


If you don't plan to work on your nodes directly, it makes sense to switch the default run level from <span class="code">5</span> to <span class="code">3</span>. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reduces the possible attack vectors.
== Configure ==


To do this, edit <span class="code">/etc/inittab</span>, change the <span class="code">id:5:initdefault:</span> line to <span class="code">id:3:initdefault:</span> and then switch to run level 3:
'''<span class="code">an-node01</span>''':


<source lang="bash">
<source lang="bash">
vim /etc/inittab
# Configure DRBD's global value.
cp /etc/drbd.d/global_common.conf /etc/drbd.d/global_common.conf.orig
vim /etc/drbd.d/global_common.conf
diff -u /etc/drbd.d/global_common.conf
</source>
<source lang="diff">
--- /etc/drbd.d/global_common.conf.orig 2011-08-01 21:58:46.000000000 -0400
+++ /etc/drbd.d/global_common.conf 2011-08-01 23:18:27.000000000 -0400
@@ -15,24 +15,35 @@
# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
+ fence-peer "/sbin/obliterate-peer.sh";
}
startup {
# wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
+ become-primary-on both;
+ wfc-timeout 300;
+ degr-wfc-timeout 120;
}
disk {
# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
# no-disk-drain no-md-flushes max-bio-bvecs
+ fencing resource-and-stonith;
}
net {
# sndbuf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
+ allow-two-primaries;
+ after-sb-0pri discard-zero-changes;
+ after-sb-1pri discard-secondary;
+ after-sb-2pri disconnect;
}
syncer {
# rate after al-extents use-rle cpu-mask verify-alg csums-alg
+ # This should be no more than 30% of the maximum sustainable write speed.
+ rate 20M;
}
}
</source>
<source lang="bash">
vim /etc/drbd.d/r0.res
</source>
</source>
<source lang="text">
<source lang="text">
id:3:initdefault:
resource r0 {
        device          /dev/drbd0;
        meta-disk      internal;
        on an-node01.alteeve.ca {
                address        192.168.2.71:7789;
                disk            /dev/sda5;
        }
        on an-node02.alteeve.ca {
                address        192.168.2.72:7789;
                disk            /dev/sda5;
        }
}
</source>
</source>
<source lang="bash">
<source lang="bash">
init 3
cp /etc/drbd.d/r0.res /etc/drbd.d/r1.res
vim /etc/drbd.d/r1.res
</source>
<source lang="text">
resource r1 {
        device          /dev/drbd1;
        meta-disk      internal;
        on an-node01.alteeve.ca {
                address        192.168.2.71:7790;
                disk            /dev/sdb1;
        }
        on an-node02.alteeve.ca {
                address        192.168.2.72:7790;
                disk            /dev/sdb1;
        }
}
</source>
</source>


=== Make Boot Messages Visible ===
{{note|1=If you have multiple DRBD resources on on (set of) backing disks, consider adding <span class="code">syncer { after <minor-1>; }</span>. For example, tell <span class="code">/dev/drbd1</span> to wait for <span class="code">/dev/drbd0</span> by adding <span class="code">syncer { after 0; }</span>. This will prevent simultaneous resync's which could seriously impact performance. Resources will wait in <span class="code"></span> state until the defined resource has completed sync'ing.}}
 
This is another optional step that disables the <span class="code">rhgb</span> (Red Hat Graphical Boot) and <span class="code">quiet</span> kernel arguments. These options provide the nice boot-time splash screen. I like to turn them off though as they also hide a lot of boot messages that can be helpful.  


To make this change, edit the grub menu and remove the <span class="code">rhgb quiet</span> arguments from the <span class="code">kernel /vmlinuz...</span> line.
Validate:


<source lang="bash">
<source lang="bash">
vim /boot/grub/menu.lst
drbdadm dump
</source>
</source>
<source lang="text">
  --==  Thank you for participating in the global usage survey  ==--
The server's response is:


Change:
you are the 369th user to install this version
# /usr/etc/drbd.conf
common {
    protocol              C;
    net {
        allow-two-primaries;
        after-sb-0pri    discard-zero-changes;
        after-sb-1pri    discard-secondary;
        after-sb-2pri    disconnect;
    }
    disk {
        fencing          resource-and-stonith;
    }
    syncer {
        rate            20M;
    }
    startup {
        wfc-timeout      300;
        degr-wfc-timeout 120;
        become-primary-on both;
    }
    handlers {
        pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        local-io-error  "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
        fence-peer      /sbin/obliterate-peer.sh;
    }
}


<source lang="text">
# resource r0 on an-node01.alteeve.ca: not ignored, not stacked
title Red Hat Enterprise Linux (2.6.32-71.el6.x86_64)
resource r0 {
         root (hd0,0)
    on an-node01.alteeve.ca {
         kernel /vmlinuz-2.6.32-71.el6.x86_64 ro root=UUID=ef8ebd1b-8c5f-4bc8-b683-ead5f4603fec rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=auto rhgb quiet
         device          /dev/drbd0 minor 0;
         initrd /initramfs-2.6.32-71.el6.x86_64.img
         disk            /dev/sda5;
        address          ipv4 192.168.2.71:7789;
        meta-disk        internal;
    }
    on an-node02.alteeve.ca {
        device          /dev/drbd0 minor 0;
        disk            /dev/sda5;
        address          ipv4 192.168.2.72:7789;
        meta-disk        internal;
    }
}
 
# resource r1 on an-node01.alteeve.ca: not ignored, not stacked
resource r1 {
    on an-node01.alteeve.ca {
        device          /dev/drbd1 minor 1;
         disk            /dev/sdb1;
        address          ipv4 192.168.2.71:7790;
        meta-disk        internal;
    }
    on an-node02.alteeve.ca {
        device          /dev/drbd1 minor 1;
        disk            /dev/sdb1;
        address          ipv4 192.168.2.72:7790;
        meta-disk        internal;
    }
}
</source>
</source>


To:
<source lang="bash">
rsync -av /etc/drbd.d root@an-node02:/etc/
</source>
<source lang="text">
drbd.d/
drbd.d/global_common.conf
drbd.d/global_common.conf.orig
drbd.d/r0.res
drbd.d/r1.res


<source lang="text">
sent 3523 bytes  received 110 bytes  7266.00 bytes/sec
title Red Hat Enterprise Linux (2.6.32-71.el6.x86_64)
total size is 3926  speedup is 1.08
        root (hd0,0)
        kernel /vmlinuz-2.6.32-71.el6.x86_64 ro root=UUID=ef8ebd1b-8c5f-4bc8-b683-ead5f4603fec rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=auto
        initrd /initramfs-2.6.32-71.el6.x86_64.img
</source>
</source>


= Setup Inter-Node Networking =
== Initialize and First start ==


This is the first stage of our network setup. Here we will walk through setting up the three networks between our two nodes. Later we will revisit networking to tie the virtual machines together.
'''Both''':


== Warning About Managed Switches ==
Create the meta-data.


'''WARNING''': Please pay attention to this warning! The vast majority of cluster problems end up being network related. The hardest ones to diagnose are usually [[multicast]] issues.
<source lang="bash">
modprobe
drbdadm create-md r{0,1}
</source>
<source lang="text">
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
</source>


If you use a managed switch, be careful about enabling [[Multicast IGMP Snooping]] or [[Spanning Tree Protocol]]. They have been known to cause problems by not allowing multicast packets to reach all nodes. This can cause somewhat random break-downs in communication between your nodes, leading to seemingly random fences and DLM lock timeouts. If your switches support [[PIM Routing]], be sure to use it!
Attach, connect and confirm (after both have attached and connected):


If you have problems with your cluster not forming, or seemingly random fencing, try using a cheap [http://dlink.ca/products/?pid=230 unmanaged] switch. If the problem goes away, you are most likely dealing with a managed switch configuration problem.
<source lang="bash">
drbdadm attach r{0,1}
drbdadm connect r{0,1}
cat /proc/drbd
</source>
<source lang="text">
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:441969960
1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:29309628
</source>


== Network Layout ==
There is no data, so force both devices to be instantly UpToDate:
 
<source lang="bash">
drbdadm -- --clear-bitmap new-current-uuid r{0,1}
cat /proc/drbd
</source>
<source lang="text">
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
</source>


This setup expects you to have three physical network cards connected to three independent networks. Each network serves a purpose:
Set both to primary and run a final check.


* Network connected to the Internet and thus has untrusted traffic.
<source lang="bash">
* Storage network used for keeping data between the nodes in sync.
drbdadm primary r{0,1}
* Back-channel network used for secure internode communication.
cat /proc/drbd
</source>
<source lang="text">
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:672 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:672 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
</source>


These are the networks and names that will be used in this tutorial. Please note that, inside [[VM]]s, device names will not match the list below. This table is valid for the operating systems running the [[hypervisor]]s, known as [[dom0]] in Xen or as the host in other virtualized environments.
== Update the cluster ==


{|
<source lang="bash">
!class="cell_all"|Network Description
vim /etc/cluster/cluster.conf
!class="cell_tbr"|Short Name
</source>
!class="cell_tbr"|Device Name
<source lang="xml">
!class="cell_tbr"|Suggested Subnet
<?xml version="1.0"?>
!class="cell_tbr"|NIC Properties
<cluster config_version="17" name="an-clusterA">
|-
        <cman expected_votes="1" two_node="1"/>
|class="cell_blr"|Back-Channel Network
        <totem rrp_mode="none" secauth="off"/>
|class="cell_br"|<span class="code">BCN</span>
        <clusternodes>
|class="cell_br"|<span class="code">eth0</span>
                <clusternode name="an-node01.alteeve.ca" nodeid="1">
|class="cell_br"|<span class="code">192.168.1.0/24</span>
                        <fence>
|class="cell_br"|NICs with [[IPMI]] piggy-back '''must''' be used here.<br />Second-fastest NIC should be used here.<br />If using a [[Setting Up a PXE Server in Fedora|PXE server]], this should be a bootable NIC.
                                <method name="apc_pdu">
|-
                                        <device action="reboot" name="pdu2" port="1"/>
|class="cell_blr"|Storage Network
                                </method>
|class="cell_br"|<span class="code">SN</span>
                        </fence>
|class="cell_br"|<span class="code">eth1</span>
                </clusternode>
|class="cell_br"|<span class="code">192.168.2.0/24</span>
                <clusternode name="an-node02.alteeve.ca" nodeid="2">
|class="cell_br"|Fastest NIC should be used here.
                        <fence>
|-
                                <method name="apc_pdu">
|class="cell_blr"|Internet-Facing Network
                                        <device action="reboot" name="pdu2" port="2"/>
|class="cell_br"|<span class="code">IFN</span>
                                </method>
|class="cell_br"|<span class="code">eth2</span>
                        </fence>
|class="cell_br"|<span class="code">192.168.3.0/24</span>
                </clusternode>
|class="cell_br"|Remaining NIC should be used here.
        </clusternodes>
|}
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
        </fencedevices>
        <fence_daemon post_join_delay="30"/>
        <rm>
                <resources>
                        <ip address="192.168.2.100" monitor_link="on"/>
                        <script file="/etc/init.d/drbd" name="drbd"/>
                        <script file="/etc/init.d/clvmd" name="clvmd"/>
                        <script file="/etc/init.d/tgtd" name="tgtd"/>
                </resources>
                <failoverdomains>
                        <failoverdomain name="an1_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node01.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an2_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node02.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an1_primary" nofailback="1" ordered="1" restricted="0">
                                <failoverdomainnode name="an-node01.alteeve.ca" priority="1"/>
                                <failoverdomainnode name="an-node02.alteeve.ca" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <service autostart="0" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd"/>
                        </script>
                </service>
                <service autostart="0" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd"/>
                        </script>
                </service>
        </rm>
</cluster>
</source>
<source lang="bash">
rg_test test /etc/cluster/cluster.conf
</source>
<source lang="text">
Running in test mode.
Loading resource rule from /usr/share/cluster/oralistener.sh
Loading resource rule from /usr/share/cluster/apache.sh
Loading resource rule from /usr/share/cluster/SAPDatabase
Loading resource rule from /usr/share/cluster/postgres-8.sh
Loading resource rule from /usr/share/cluster/lvm.sh
Loading resource rule from /usr/share/cluster/mysql.sh
Loading resource rule from /usr/share/cluster/lvm_by_vg.sh
Loading resource rule from /usr/share/cluster/service.sh
Loading resource rule from /usr/share/cluster/samba.sh
Loading resource rule from /usr/share/cluster/SAPInstance
Loading resource rule from /usr/share/cluster/checkquorum
Loading resource rule from /usr/share/cluster/ocf-shellfuncs
Loading resource rule from /usr/share/cluster/svclib_nfslock
Loading resource rule from /usr/share/cluster/script.sh
Loading resource rule from /usr/share/cluster/clusterfs.sh
Loading resource rule from /usr/share/cluster/fs.sh
Loading resource rule from /usr/share/cluster/oracledb.sh
Loading resource rule from /usr/share/cluster/nfsserver.sh
Loading resource rule from /usr/share/cluster/netfs.sh
Loading resource rule from /usr/share/cluster/orainstance.sh
Loading resource rule from /usr/share/cluster/vm.sh
Loading resource rule from /usr/share/cluster/lvm_by_lv.sh
Loading resource rule from /usr/share/cluster/tomcat-6.sh
Loading resource rule from /usr/share/cluster/ip.sh
Loading resource rule from /usr/share/cluster/nfsexport.sh
Loading resource rule from /usr/share/cluster/openldap.sh
Loading resource rule from /usr/share/cluster/ASEHAagent.sh
Loading resource rule from /usr/share/cluster/nfsclient.sh
Loading resource rule from /usr/share/cluster/named.sh
Loaded 24 resource rules
=== Resources List ===
Resource type: ip
Instances: 1/1
Agent: ip.sh
Attributes:
  address = 192.168.2.100 [ primary unique ]
  monitor_link = on
  nfslock [ inherit("service%nfslock") ]


Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:
Resource type: script
Agent: script.sh
Attributes:
  name = drbd [ primary unique ]
  file = /etc/init.d/drbd [ unique required ]
  service_name [ inherit("service%name") ]


# If your nodes have [[IPMI]] piggy-backing on a normal NIC, that NIC '''''must''''' be used on <span class="code">BCN</span> subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a ''major'' security risk.
Resource type: script
# The fastest NIC should be used for your <span class="code">SN</span> subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
Agent: script.sh
# If you still have two NICs to choose from, use the fastest remaining NIC for your <span class="code">BCN</span> subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
Attributes:
# The final NIC should be used for the <span class="code">IFN</span> subnet.
  name = clvmd [ primary unique ]
  file = /etc/init.d/clvmd [ unique required ]
  service_name [ inherit("service%name") ]


== Node IP Addresses ==
Resource type: script
Agent: script.sh
Attributes:
  name = tgtd [ primary unique ]
  file = /etc/init.d/tgtd [ unique required ]
  service_name [ inherit("service%name") ]


Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:
Resource type: service [INLINE]
Instances: 1/1
Agent: service.sh
Attributes:
  name = an1_storage [ primary unique required ]
  domain = an1_only [ reconfig ]
  autostart = 0 [ reconfig ]
  exclusive = 0 [ reconfig ]
  nfslock = 0
  nfs_client_cache = 0
  recovery = restart [ reconfig ]
  depend_mode = hard
  max_restarts = 0
  restart_expire_time = 0
  priority = 0


{|
Resource type: service [INLINE]
|class="cell_br"|&nbsp;
Instances: 1/1
|class="cell_tbr"|'''Internet-Facing Network''' (<span class="code">IFN</span>)
Agent: service.sh
|class="cell_tbr"|'''Storage Network''' (<span class="code">SN</span>)
Attributes:
|class="cell_tbr"|'''Back-Channel Network''' (<span class="code">BCN</span>)
  name = an2_storage [ primary unique required ]
|-
  domain = an2_only [ reconfig ]
|class="cell_blr"|'''an-node01'''
  autostart = 0 [ reconfig ]
|class="cell_br"|<span class="code">192.168.1.71</span>
  exclusive = 0 [ reconfig ]
|class="cell_br"|<span class="code">192.168.2.71</span>
  nfslock = 0
|class="cell_br"|<span class="code">192.168.3.71</span>
  nfs_client_cache = 0
|-
  recovery = restart [ reconfig ]
|class="cell_blr"|'''an-node02'''
  depend_mode = hard
|class="cell_br"|<span class="code">192.168.1.72</span>
  max_restarts = 0
|class="cell_br"|<span class="code">192.168.2.72</span>
  restart_expire_time = 0
|class="cell_br"|<span class="code">192.168.3.72</span>
  priority = 0
|}


== Disable The NetworkManager Daemon ==
Resource type: service [INLINE]
Instances: 1/1
Agent: service.sh
Attributes:
  name = san_ip [ primary unique required ]
  domain = an1_primary [ reconfig ]
  autostart = 0 [ reconfig ]
  exclusive = 0 [ reconfig ]
  nfslock = 0
  nfs_client_cache = 0
  recovery = relocate [ reconfig ]
  depend_mode = hard
  max_restarts = 0
  restart_expire_time = 0
  priority = 0


Some cluster software '''will not''' start with <span class="code">NetworkManager</span> running! This is because <span class="code">NetworkManager</span> is designed to be a highly-adaptive, easy to use network configuration system that can adapt to frequent changes in a network. For workstations and laptops, this is wonderful. For clustering, this can be disastrous. We need to ensure that, once set, the network will not change.
=== Resource Tree ===
service (S0) {
  name = "an1_storage";
  domain = "an1_only";
  autostart = "0";
  exclusive = "0";
  nfslock = "0";
  nfs_client_cache = "0";
  recovery = "restart";
  depend_mode = "hard";
  max_restarts = "0";
  restart_expire_time = "0";
  priority = "0";
  script (S0) {
    name = "drbd";
    file = "/etc/init.d/drbd";
    service_name = "an1_storage";
    script (S0) {
      name = "clvmd";
      file = "/etc/init.d/clvmd";
      service_name = "an1_storage";
    }
  }
}
service (S0) {
  name = "an2_storage";
  domain = "an2_only";
  autostart = "0";
  exclusive = "0";
  nfslock = "0";
  nfs_client_cache = "0";
  recovery = "restart";
  depend_mode = "hard";
  max_restarts = "0";
  restart_expire_time = "0";
  priority = "0";
  script (S0) {
    name = "drbd";
    file = "/etc/init.d/drbd";
    service_name = "an2_storage";
    script (S0) {
      name = "clvmd";
      file = "/etc/init.d/clvmd";
      service_name = "an2_storage";
    }
  }
}
service (S0) {
  name = "san_ip";
  domain = "an1_primary";
  autostart = "0";
  exclusive = "0";
  nfslock = "0";
  nfs_client_cache = "0";
  recovery = "relocate";
  depend_mode = "hard";
  max_restarts = "0";
  restart_expire_time = "0";
  priority = "0";
  ip (S0) {
    address = "192.168.2.100";
    monitor_link = "on";
    nfslock = "0";
  }
}
=== Failover Domains ===
Failover domain: an1_only
Flags: Restricted No Failback
  Node an-node01.alteeve.ca (id 1, priority 0)
Failover domain: an2_only
Flags: Restricted No Failback
  Node an-node02.alteeve.ca (id 2, priority 0)
Failover domain: an1_primary
Flags: Ordered No Failback
  Node an-node01.alteeve.ca (id 1, priority 1)
  Node an-node02.alteeve.ca (id 2, priority 2)
=== Event Triggers ===
Event Priority Level 100:
  Name: Default
    (Any event)
    File: /usr/share/cluster/default_event_script.sl
[root@an-node01 ~]# cman_tool version -r
You have not authenticated to the ricci daemon on an-node01.alteeve.ca
Password:
[root@an-node01 ~]# clusvcadm -e service:an1_storage
Local machine trying to enable service:an1_storage...Success
service:an1_storage is now running on an-node01.alteeve.ca
[root@an-node01 ~]# cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:924 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:916 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
</source>
<source lang="bash">
cman_tool version -r
</source>
<source lang="text">
You have not authenticated to the ricci daemon on an-node01.alteeve.ca
Password:
</source>


Disable <span class="code">NetworkManager</span> from starting with the system.
'''<span class="code">an-node01</span>''':


<source lang="bash">
<source lang="bash">
chkconfig NetworkManager off
clusvcadm -e service:an1_storage
chkconfig --list NetworkManager
</source>
</source>
<source lang="text">
<source lang="text">
NetworkManager 0:off 1:off 2:off 3:off 4:off 5:off 6:off
service:an1_storage is now running on an-node01.alteeve.ca
</source>
</source>


The second command shows us that <span class="code">NetworkManager</span> is now disabled in all [[run-levels]].
'''<span class="code">an-node02</span>''':


== Enable the network Daemon ==
<source lang="bash">
clusvcadm -e service:an2_storage
</source>
<source lang="text">
service:an2_storage is now running on an-node02.alteeve.ca
</source>


The first step is to map your physical interfaces to the desired <span class="code">ethX</span> name. There is an existing tutorial that will show you how to do this.
'''Either'''


* [[Changing the ethX to Ethernet Device Mapping in RHEL6]]
<source lang="bash">
cat /proc/drbd
</source>
<source lang="text">
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:924 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:916 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
</source>


There are a few ways to configure <span class="code">network</span> in Fedora:
== Configure Clustered LVM ==


* <span class="code">system-config-network</span> (graphical)
<span class="code">an-node01</span>''':
* <span class="code">system-config-network-tui</span> (ncurses)
* Directly editing the <span class="code">/etc/sysconfig/network-scripts/ifcfg-eth*</span> files.


If you decide that you want to hand-craft your network interfaces, take a look at the tutorial above. In it are example configuration files that are compatible with this tutorial. There are also links to documentation on what options are available in the network configuration files.
<source lang="bash">
cp /etc/lvm/lvm.conf /etc/lvm/lvm.conf.orig
vim /etc/lvm/lvm.conf
diff -u /etc/lvm/lvm.conf.orig /etc/lvm/lvm.conf
</source>
<source lang="diff">
--- /etc/lvm/lvm.conf.orig 2011-08-02 21:59:01.000000000 -0400
+++ /etc/lvm/lvm.conf 2011-08-02 22:00:17.000000000 -0400
@@ -50,7 +50,8 @@
    # By default we accept every block device:
-    filter = [ "a/.*/" ]
+    #filter = [ "a/.*/" ]
+    filter = [ "a|/dev/drbd*|", "r/.*/" ]
    # Exclude the cdrom drive
    # filter = [ "r|/dev/cdrom|" ]
@@ -308,7 +309,8 @@
    # Type 3 uses built-in clustered locking.
    # Type 4 uses read-only locking which forbids any operations that might
    # change metadata.
-    locking_type = 1
+    #locking_type = 1
+    locking_type = 3
    # Set to 0 to fail when a lock request cannot be satisfied immediately.
    wait_for_locks = 1
@@ -324,7 +326,8 @@
    # to 1 an attempt will be made to use local file-based locking (type 1).
    # If this succeeds, only commands against local volume groups will proceed.
    # Volume Groups marked as clustered will be ignored.
-    fallback_to_local_locking = 1
+    #fallback_to_local_locking = 1
+    fallback_to_local_locking = 0
    # Local non-LV directory that holds file-based locks while commands are
    # in progress.  A directory like /tmp that may get wiped on reboot is OK.
</source>
<source lang="bash">
rsync -av /etc/lvm/lvm.conf root@an-node02:/etc/lvm/
</source>
<source lang="text">
sending incremental file list
lvm.conf


'''''WARNING''''': Do '''not''' proceed until your node's networking is fully configured! This may be a small sub-section, but it is critical that you have everything setup properly before going any further!
sent 2412 bytes  received 247 bytes  5318.00 bytes/sec
total size is 24668  speedup is 9.28
</source>


== Update the Hosts File ==
Create the LVM PVs, VGs and LVs.


Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the <span class="code">/etc/hosts</span> file:
'''<span class="code">an-node01</span>''':


'''Note''': Any pre-existing entries matching the name returned by <span class="code">uname -n</span> must be removed from <span class="code">/etc/hosts</span>. There is a good chance there will be an entry that resolves to <span class="code">127.0.0.1</span> which would cause problems later.
<source lang="bash">
pvcreate /dev/drbd{0,1}
</source>
<source lang="text">
  Physical volume "/dev/drbd0" successfully created
  Physical volume "/dev/drbd1" successfully created
</source>


Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by <span class="code">uname -n</span> is resolvable to the back-channel subnet. I like to add a entries for all networks, but this is optional.
'''<span class="code">an-node02</span>''':
 
The updated <span class="code">/etc/hosts</span> file should look something like this:


<source lang="bash">
<source lang="bash">
vim /etc/hosts
pvscan
</source>
</source>
<source lang="text">
<source lang="text">
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
  PV /dev/drbd0                      lvm2 [421.50 GiB]
::1        localhost localhost.localdomain localhost6 localhost6.localdomain6
  PV /dev/drbd1                      lvm2 [27.95 GiB]
   Total: 2 [449.45 GiB] / in use: 0 [0  ] / in no VG: 2 [449.45 GiB]
</source>


# Internet Facing Network
'''<span class="code">an-node01</span>''':
192.168.1.71    an-node01 an-node01.alteeve.com an-node01.ifn
192.168.1.72    an-node02 an-node02.alteeve.com an-node02.ifn


# Storage Network
<source lang="bash">
192.168.2.71    an-node01.sn
vgcreate -c y hdd_vg0 /dev/drbd0 && vgcreate -c y sdd_vg0 /dev/drbd1
192.168.2.72    an-node02.sn
</source>
<source lang="text">
  Clustered volume group "hdd_vg0" successfully created
  Clustered volume group "ssd_vg0" successfully created
</source>


# Back Channel Network
'''<span class="code">an-node02</span>''':
192.168.3.71    an-node01.bcn
192.168.3.72    an-node02.bcn


# Node Assassins
<source lang="bash">
192.168.3.61    fence_na01 fence_na01.alteeve.com
vgscan
192.168.3.62    motoko motoko.alteeve.com
</source>
<source lang="text">
  Reading all physical volumes. This may take a while...
  Found volume group "ssd_vg0" using metadata type lvm2
  Found volume group "hdd_vg0" using metadata type lvm2
</source>
</source>


Now to test this, ping both nodes by their name, as returned by <span class="code">uname -n</span>, and make sure the ping packets are sent on the back channel network (<span class="code">192.168.1.0/24</span>).
'''<span class="code">an-node01</span>''':


<source lang="bash">
<source lang="bash">
ping -c 5 an-node01.alteeve.com
lvcreate -l 100%FREE -n lun0 /dev/hdd_vg0 && lvcreate -l 100%FREE -n lun1 /dev/ssd_vg0
</source>
</source>
<source lang="text">
<source lang="text">
PING an-node01 (192.168.1.71) 56(84) bytes of data.
  Logical volume "lun0" created
64 bytes from an-node01 (192.168.1.71): icmp_seq=1 ttl=64 time=0.399 ms
  Logical volume "lun1" created
64 bytes from an-node01 (192.168.1.71): icmp_seq=2 ttl=64 time=0.403 ms
</source>
64 bytes from an-node01 (192.168.1.71): icmp_seq=3 ttl=64 time=0.413 ms
 
64 bytes from an-node01 (192.168.1.71): icmp_seq=4 ttl=64 time=0.365 ms
'''<span class="code">an-node02</span>''':
64 bytes from an-node01 (192.168.1.71): icmp_seq=5 ttl=64 time=0.428 ms


--- an-node01 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4001ms
rtt min/avg/max/mdev = 0.365/0.401/0.428/0.030 ms
</source>
<source lang="bash">
<source lang="bash">
ping -c 5 an-node02.alteeve.com
lvscan
</source>
</source>
<source lang="text">
<source lang="text">
PING an-node02 (192.168.1.72) 56(84) bytes of data.
  ACTIVE            '/dev/ssd_vg0/lun1' [27.95 GiB] inherit
64 bytes from an-node02 (192.168.1.72): icmp_seq=1 ttl=64 time=0.419 ms
  ACTIVE            '/dev/hdd_vg0/lun0' [421.49 GiB] inherit
64 bytes from an-node02 (192.168.1.72): icmp_seq=2 ttl=64 time=0.405 ms
</source>
64 bytes from an-node02 (192.168.1.72): icmp_seq=3 ttl=64 time=0.416 ms
 
64 bytes from an-node02 (192.168.1.72): icmp_seq=4 ttl=64 time=0.373 ms
= iSCSI notes =
64 bytes from an-node02 (192.168.1.72): icmp_seq=5 ttl=64 time=0.396 ms


--- an-node02 ping statistics ---
IET vs tgt pros and cons needed.
5 packets transmitted, 5 received, 0% packet loss, time 4001ms
 
rtt min/avg/max/mdev = 0.373/0.401/0.419/0.030 ms
default iscsi port is 3260
</source>


If you did name your other nodes in <span class="code">/etc/hosts</span>, now is a good time to make sure that everything is working by pinging each interface by name and also pinging the fence devices.
''initiator'': This is the client.
''target'': This is the server side.
''sid'': Session ID; Found with <span class="code">iscsiadm -m session -P 1</span>. SID and sysfs path are not persistent, partially start-order based.
''iQN'': iSCSI Qualified Name; This is a string that uniquely identifies targets and initiators.


From <span class="code">an-node01</span>
'''Both''':


<source lang="bash">
<source lang="bash">
ping -c 5 an-node02
yum install iscsi-initiator-utils scsi-target-utils
ping -c 5 an-node02.ifn
ping -c 5 an-node02.sn
ping -c 5 an-node02.bcn
ping -c 5 fence_na01
ping -c 5 fence_na01.alteeve.com
ping -c 5 motoko
ping -c 5 motoko.alteeve.com
</source>
</source>


Then repeat the set of pings from <span class="code">an-node02</span> to the <span class="code">an-node01</span> networks and the fence devices.
'''<span class="code">an-node01</span>''':


From <span class="code">an-node02</span>
<source lang="bash">
cp /etc/tgt/targets.conf /etc/tgt/targets.conf.orig
vim /etc/tgt/targets.conf
diff -u /etc/tgt/targets.conf.orig /etc/tgt/targets.conf
</source>
<source lang="bash">
--- /etc/tgt/targets.conf.orig 2011-07-31 12:38:35.000000000 -0400
+++ /etc/tgt/targets.conf 2011-08-02 22:19:06.000000000 -0400
@@ -251,3 +251,9 @@
#        vendor_id VENDOR1
#    </direct-store>
#</target>
+
+<target iqn.2011-08.com.alteeve:an-clusterA.target01>
+ direct-store /dev/drbd0
+ direct-store /dev/drbd1
+ vendor_id Alteeve
</source>
<source lang="bash">
rsync -av /etc/tgt/targets.conf root@an-node02:/etc/tgt/
</source>
<source lang="text">
sending incremental file list
targets.conf


<source lang="bash">
sent 909 bytes  received 97 bytes  670.67 bytes/sec
ping -c 5 an-node01
total size is 7093  speedup is 7.05
ping -c 5 an-node01.ifn
ping -c 5 an-node01.sn
ping -c 5 an-node01.bcn
ping -c 5 fence_na01
ping -c 5 fence_na01.alteeve.com
ping -c 5 motoko
ping -c 5 motoko.alteeve.com
</source>
</source>
Be sure that, if your [[fence]] device uses a name, that you include entries to resolve it as well. You can see how I've done this with the two [[Node Assassin]] devices I use. The same applies to [[IPMI]] or other devices, if you plan to reference them by name.


Fencing will be discussed in more detail later on in this HowTo.
=== Update the cluster ===
 
<source lang="xml">
              <service autostart="0" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd">
                                        <script ref="tgtd"/>
                                </script>
                        </script>
                </service>
                <service autostart="0" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd">
                                        <script ref="tgtd"/>
                                </script>
                        </script>
                </service>
</source>


== Disable Firewalls ==
= Connect to the SAN from a VM node =


In the spirit of keeping things simple, and understanding that this is a test cluster, we will flush netfilter tables and disable <span class="code">iptables</span> and <span class="code">ip6tables</span> from starting on our nodes.
'''<span class="code">an-node03+</span>''':


<source lang="bash">
<source lang="bash">
chkconfig --level 2345 iptables off
iscsiadm -m discovery -t sendtargets -p 192.168.2.100
/etc/init.d/iptables stop
</source>
chkconfig --level 2345 ip6tables off
<source lang="text">
/etc/init.d/ip6tables stop
192.168.2.100:3260,1 iqn.2011-08.com.alteeve:an-clusterA.target01
</source>
</source>


What I like to do in production clusters is disable the IP address on the internet-facing interfaces on the [[dom0]] machines. The only real connection to the interface is inside a [[VM]] designed to be a firewall running Shorewall. That VM will have two virtual interfaces connected to <span class="code">eth0</span> and <span class="code">eth2</span>. With that VM in place, and with all other VMs only having a virtual interface connected to <span class="code">eth0</span>, all Internet traffic is forced through the one firewall VM.
<source lang="bash">
iscsiadm --mode node --portal 192.168.2.100 --target iqn.2011-08.com.alteeve:an-clusterA.target01 --login
</source>
<source lang="text">
Logging in to [iface: default, target: iqn.2011-08.com.alteeve:an-clusterA.target01, portal: 192.168.2.100,3260]
Login to [iface: default, target: iqn.2011-08.com.alteeve:an-clusterA.target01, portal: 192.168.2.100,3260] successful.
</source>


When you are finished building your cluster, you may want to check out the Shorewall tutorial below.
<source lang="bash">
fdisk -l
</source>
<source lang="text">
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00062f4a


* [[Shorewall on RPM-based Servers]]
  Device Boot      Start        End      Blocks  Id  System
/dev/sda1  *           1          33      262144  83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              33        5255    41943040  83  Linux
/dev/sda3            5255        5777    4194304  82  Linux swap / Solaris


== Setup SSH Shared Keys ==
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000


'''This is an optional step'''. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. Keep in mind, this tutorial assumes that you are building a test cluster, so there is a focus on ease of use.
Disk /dev/sdb doesn't contain a valid partition table


If you're a little new to [[SSH]], it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.
Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000


You will need to create an SSH key for each source user, and then you will need to copy it's newly generated public key to each remote user's home directory that you want to connect to. In this example, we want to connect to either node, from either node, as the <span class="code">root</span> user. So we will create a key for each node's <span class="code">root</span> user and then copy the generated public key to the ''other'' node's <span class="code">root</span> user's directory.
Disk /dev/sdc doesn't contain a valid partition table
</source>
 
== Setup the VM Cluster ==


Here, simply, is what we will do.
Install RPMs.


* Log in to <span class="code">an-node01</span> as <span class="code">root</span>
<source lang="bash">
** Generate an ssh key
yum -y install lvm2-cluster cman fence-agents
** Copy the contents from <span class="code">/root/.ssh/id_rsa.pub</span>
</source>
* Log in to <span class="code">an-node02</span> as <span class="code">root</span>
** Edit the file <span class="code">/root/.ssh/authorized_keys</span>
** Paste in the content's of <span class="code">root@an-node01</span>'s public key.
** Generate an ssh key
** Copy the contents from <span class="code">/root/.ssh/id_rsa.pub</span>
* Log back in to <span class="code">an-node01</span> as <span class="code">root</span>
** Edit the file <span class="code">/root/.ssh/authorized_keys</span>
** Paste in the content's of <span class="code">root@an-node02</span>'s public key.


Here are the detailed steps.
Configure <span class="code">lvm.conf</span>.


For each user, on each machine you want to connect '''from''', run:
<source lang="bash">
cp /etc/lvm/lvm.conf /etc/lvm/lvm.conf.orig
vim /etc/lvm/lvm.conf
diff -u /etc/lvm/lvm.conf.orig /etc/lvm/lvm.conf
</source>
<source lang="diff">
--- /etc/lvm/lvm.conf.orig 2011-08-02 21:59:01.000000000 -0400
+++ /etc/lvm/lvm.conf 2011-08-03 00:35:45.000000000 -0400
@@ -308,7 +308,8 @@
    # Type 3 uses built-in clustered locking.
    # Type 4 uses read-only locking which forbids any operations that might
    # change metadata.
-    locking_type = 1
+    #locking_type = 1
+    locking_type = 3
    # Set to 0 to fail when a lock request cannot be satisfied immediately.
    wait_for_locks = 1
@@ -324,7 +325,8 @@
    # to 1 an attempt will be made to use local file-based locking (type 1).
    # If this succeeds, only commands against local volume groups will proceed.
    # Volume Groups marked as clustered will be ignored.
-    fallback_to_local_locking = 1
+    #fallback_to_local_locking = 1
+    fallback_to_local_locking = 0
    # Local non-LV directory that holds file-based locks while commands are
    # in progress.  A directory like /tmp that may get wiped on reboot is OK.
</source>


<source lang="bash">
<source lang="bash">
# The '2047' is just to screw with brute-forces a bit.
rsync -av /etc/lvm/lvm.conf root@an-node04:/etc/lvm/
ssh-keygen -t rsa -N "" -b 2047 -f ~/.ssh/id_rsa
</source>
</source>
<source lang="text">
<source lang="text">
Generating public/private rsa key pair.
sending incremental file list
Your identification has been saved in /root/.ssh/id_rsa.
lvm.conf
Your public key has been saved in /root/.ssh/id_rsa.pub.
 
The key fingerprint is:
sent 873 bytes  received 247 bytes  2240.00 bytes/sec
08:d8:ed:72:38:61:c5:0e:cf:bf:dc:28:e5:3c:a7:88 root@an-node01.alteeve.com
total size is 24625  speedup is 21.99
The key's randomart image is:
+--[ RSA 2047]----+
|    ..          |
|  o.o.          |
|  . ==.          |
|  . =+.        |
|    + +.S        |
|    +  o        |
|      = +      |
|    ...B o      |
|    E ...+      |
+-----------------+
</source>
</source>
<source lang="bash">
rsync -av /etc/lvm/lvm.conf root@an-node05:/etc/lvm/
</source>
<source lang="text">
sending incremental file list
lvm.conf


This will create two files: the private key called <span class="code">~/.ssh/id_dsa</span> and the public key called <span class="code">~/.ssh/id_dsa.pub</span>. The private '''''must never''''' be group or world readable! That is, it should be set to mode <span class="code">0600</span>.
sent 873 bytes  received 247 bytes  2240.00 bytes/sec
total size is 24625  speedup is 21.99
</source>


The two files should look like:
Config the cluster.


'''Private key''':
<source lang="bash">
<source lang="bash">
cat ~/.ssh/id_rsa
vim /etc/cluster/cluster.conf
</source>
<source lang="xml">
<?xml version="1.0"?>
<cluster config_version="5" name="an-clusterB">
        <totem rrp_mode="none" secauth="off"/>
        <clusternodes>
                <clusternode name="an-node03.alteeve.ca" nodeid="1">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="3"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node04.alteeve.ca" nodeid="2">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="4"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node05.alteeve.ca" nodeid="3">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="5"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
        </fencedevices>
        <fence_daemon post_join_delay="30"/>
        <rm>
                <resources>
                        <script file="/etc/init.d/iscsi" name="iscsi" />
                        <script file="/etc/init.d/clvmd" name="clvmd" />
                </resources>
                <failoverdomains>
                        <failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node03.alteeve.ca" />
                        </failoverdomain>
                        <failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node04.alteeve.ca" />
                        </failoverdomain>
                        <failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node05.alteeve.ca" />
                        </failoverdomain>
                </failoverdomains>
                <service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd"/>
                        </script>
                </service>
                <service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd"/>
                        </script>
                </service>
                <service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd"/>
                        </script>
                </service>
        </rm> 
</cluster>
</source>
<source lang="bash">
ccs_config_validate
</source>
</source>
<source lang="text">
<source lang="text">
-----BEGIN RSA PRIVATE KEY-----
Configuration validates
MIIEoQIBAAKCAQBlL42DC+NJVpJ0rdrWQ1rxGEbPrDoe8j8+RQx3QYiB014R7jY5
EaTenThxG/cudgbLluFxq6Merfl9Tq2It3k9Koq9nV9ZC/vXBcl4MC7pGSQaUw2h
DVwI7OCSWtnS+awR/1d93tANXRwy7K5ic1pcviJeN66dPuuPqJEF/SKE7yEBapMq
sN28G4IiLdimsV+UYXPQLOiMy5stmyGhFhQGH3kxYzJPOgiwZEFPZyXinGVoV+qa
9ERSjSKAL+g21zbYB/XFK9jLNSJqDIPa//wz0T+73agZ0zNlxygmXcJvapEsFGDG
O6tcy/3XlatSxjEZvvfdOnC310gJVp0bcyWDAgMBAAECggEAMZd0y91vr+n2Laln
r8ujLravPekzMyeXR3Wf/nLn7HkjibYubRnwrApyNz11kBfYjL+ODqAIemjZ9kgx
VOhXS1smVHhk2se8zk3PyFAVLblcsGo0K9LYYKd4CULtrzEe3FNBFje10FbqEytc
7HOMvheR0IuJ0Reda/M54K2H1Y6VemtMbT+aTcgxOSOgflkjCTAeeOajqP5r0TRg
1tY6/k46hLiBka9Oaj+QHHoWp+aQkb+ReHUBcUihnz3jcw2u8HYrQIO4+v4Ud2kr
C9QHPW907ykQTMAzhMvZ3DIOcqTzA0r857ps6FANTM87tqpse5h2KfdIjc0Ok/AY
eKgYAQKBgQDm/P0RygIJl6szVhOb5EsQU0sBUoMT3oZKmPcjHSsyVFPuEDoq1FG7
uZYMESkVVSYKvv5hTkRuVOqNE/EKtk5bwu4mM0S3qJo99cLREKB6zNdBp9z2ACDn
0XIIFIalXAPwYpoFYi1YfG8tFfSDvinLI6JLDT003N47qW1cC5rmgQKBgHAkbfX9
8u3LiT8JqCf1I+xoBTwH64grq/7HQ+PmwRqId+HyyDCm9Y/mkAW1hYQB+cL4y3OO
kGL60CZJ4eFiTYrSfmVa0lTbAlEfcORK/HXZkLRRW03iuwdAbZ7DIMzTvY2HgFlU
L1CfemtmzEC4E6t5/nA4Ytk9kPSlzbzxfXIDAoGAY/WtaqpZ0V7iRpgEal0UIt94
wPy9HrcYtGWX5Yk07VXS8F3zXh99s1hv148BkWrEyLe4i9F8CacTzbOIh1M3e7xS
pRNgtH3xKckV4rVoTVwh9xa2p3qMwuU/jMGdNygnyDpTXusKppVK417x7qU3nuIv
1HzJNPwz6+u5GLEo+oECgYAs++AEKj81dkzytXv3s1UasstOvlqTv/j5dZNdKyZQ
72cvgsUdBwxAEhu5vov1XRmERWrPSuPOYI/4m/B5CYbTZgZ/v8PZeBTg17zgRtgo
qgJq4qu+fXHKweR3KAzTPSivSiiJLMTiEWb5CD5sw6pYQdJ3z5aPUCwChzQVU8Wf
YwKBgQCvoYG7gwx/KGn5zm5tDpeWb3GBJdCeZDaj1ulcnHR0wcuBlxkw/TcIadZ3
kqIHlkjll5qk5EiNGNlnpHjEU9X67OKk211QDiNkg3KAIDMKBltE2AHe8DhFsV8a
Mc/t6vHYZ632hZ7b0WNuudB4GHJShOumXD+NfJgzxqKJyfGkpQ==
-----END RSA PRIVATE KEY-----
</source>
</source>


'''Public key''':
Make sure iscsi and clvmd do not start on boot, stop both, then make sure they start and stop cleanly.
 
<source lang="bash">
<source lang="bash">
cat ~/.ssh/id_rsa.pub
chkconfig clvmd off; chkconfig iscsi off; /etc/init.d/iscsi stop && /etc/init.d/clvmd stop
</source>
</source>
<source lang="text">
<source lang="text">
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAGUvjYML40lWknSt2tZDWvEYRs+sOh7yPz5FDHdBiIHTXhHuNjkRpN6dOHEb9y52BsuW4XGrox6t+X1OrYi3eT0qir2dX1kL+9cFyXgwLukZJBpTDaENXAjs4JJa2dL5rBH/V33e0A1dHDLsrmJzWly+Il43rp0+64+okQX9IoTvIQFqkyqw3bwbgiIt2KaxX5Rhc9As6IzLmy2bIaEWFAYfeTFjMk86CLBkQU9nJeKcZWhX6pr0RFKNIoAv6DbXNtgH9cUr2Ms1ImoMg9r//DPRP7vdqBnTM2XHKCZdwm9qkSwUYMY7q1zL/deVq1LGMRm+9906cLfXSAlWnRtzJYM= root@an-node01.alteeve.com
Stopping iscsi:                                            [  OK  ]
</source>
<source lang="bash">
/etc/init.d/clvmd start && /etc/init.d/iscsi start && /etc/init.d/iscsi stop && /etc/init.d/clvmd stop
</source>
<source lang="text">
Starting clvmd:
Activating VG(s):  No volume groups found
                                                          [  OK  ]
Starting iscsi:                                            [  OK  ]
Stopping iscsi:                                            [  OK  ]
Signaling clvmd to exit                                    [  OK  ]
clvmd terminated                                          [  OK  ]
</source>
</source>


Copy the public key and then <span class="code">ssh</span> normally into the remote machine as the <span class="code">root</span> user. Create a file called <span class="code">~/.ssh/authorized_keys</span> and paste in the key.
Use the cluster to stop (in case it autostarted before now) and then start the services.
 
From '''an-node01''', type:


<source lang="bash">
<source lang="bash">
ssh root@an-node02
# Disable (stop)
clusvcadm -d service:an3_storage
clusvcadm -d service:an4_storage
clusvcadm -d service:an5_storage
# Enable (start)
clusvcadm -e service:an3_storage -m an-node03.alteeve.ca
clusvcadm -e service:an4_storage -m an-node04.alteeve.ca
clusvcadm -e service:an5_storage -m an-node05.alteeve.ca
# Check
clustat
</source>
</source>
<source lang="text">
<source lang="text">
The authenticity of host 'an-node02 (192.168.1.72)' can't be established.
Cluster Status for an-clusterB @ Wed Aug  3 00:25:10 2011
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Member Status: Quorate
Are you sure you want to continue connecting (yes/no)? yes
 
Warning: Permanently added 'an-node02,192.168.1.72' (RSA) to the list of known hosts.
Member Name                            ID  Status
root@an-node02's password:  
------ ----                            ---- ------
Last login: Fri Oct  1 20:07:01 2010 from 192.168.1.102
an-node03.alteeve.ca                        1 Online, Local, rgmanager
an-node04.alteeve.ca                        2 Online, rgmanager
an-node05.alteeve.ca                        3 Online, rgmanager
 
Service Name                  Owner (Last)                   State       
------- ----                  ----- ------                  -----       
service:an3_storage            an-node03.alteeve.ca          started     
service:an4_storage            an-node04.alteeve.ca          started     
service:an5_storage            an-node05.alteeve.ca          started     
</source>
 
== Flush iSCSI's Cache ==
 
If you remove an iQN (or change the name of one), the <span class="code">/etc/init.d/iscsi</span> script will return errors. To flush it and re-scan:
 
I am sure there is a more elegant way.
 
<source lang="bash">
/etc/init.d/iscsi stop && rm -rf /var/lib/iscsi/nodes/* && iscsiadm -m discovery -t sendtargets -p 192.168.2.100
</source>
</source>


You will now be logged into <span class="code">an-node02</span> as the <span class="code">root</span> user. Create the <span class="code">~/.ssh/authorized_keys</span> file and paste into it the public key from <span class="code">an-node01</span>. If the remote machine's user hasn't used <span class="code">ssh</span> yet, their <span class="code">~/.ssh</span> directory will not exist.
== Setup the VM Cluster's Clustered LVM ==
 
=== Partition the SAN disks ===
 
'''<span class="code">an-node03</span>''':


<source lang="bash">
<source lang="bash">
cat ~/.ssh/authorized_keys
fdisk -l
</source>
</source>
<source lang="text">
<source lang="text">
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAGUvjYML40lWknSt2tZDWvEYRs+sOh7yPz5FDHdBiIHTXhHuNjkRpN6dOHEb9y52BsuW4XGrox6t+X1OrYi3eT0qir2dX1kL+9cFyXgwLukZJBpTDaENXAjs4JJa2dL5rBH/V33e0A1dHDLsrmJzWly+Il43rp0+64+okQX9IoTvIQFqkyqw3bwbgiIt2KaxX5Rhc9As6IzLmy2bIaEWFAYfeTFjMk86CLBkQU9nJeKcZWhX6pr0RFKNIoAv6DbXNtgH9cUr2Ms1ImoMg9r//DPRP7vdqBnTM2XHKCZdwm9qkSwUYMY7q1zL/deVq1LGMRm+9906cLfXSAlWnRtzJYM= root@an-node01.alteeve.com
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00062f4a
 
  Device Boot      Start        End      Blocks  Id  System
/dev/sda1  *          1          33      262144  83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              33        5255    41943040  83  Linux
/dev/sda3            5255        5777    4194304  82  Linux swap / Solaris
 
Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
 
Disk /dev/sdc doesn't contain a valid partition table
 
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
 
Disk /dev/sdb doesn't contain a valid partition table
</source>
</source>


Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!
Create partitions.


= Cluster Setup, Part 1 =
<source lang="bash">
fdisk /dev/sdb
</source>
<source lang="text">
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x403f1fb8.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.


There will be two stages to setting up the cluster.
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)


* Part 1; Setting up the core cluster.
WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
* Setting up DRBD, Clustered LVM, GFS2 and Xen then provisioning a Virtual server.
        switch off the mode (command 'c') and change display units to
* Part 2; Adding DRBD and Xen domU VMs to the cluster.
        sectors (command 'u').


== A Word On Complexity ==
Command (m for help): c
DOS Compatibility flag is not set


Clustering is not inherently hard, but it is inherently complex. Consider;
Command (m for help): u
Changing display/entry units to sectors


* Any given program has <span class="code">N</span> bugs.
Command (m for help): n
** [[RHCS]] uses; <span class="code">cman</span>, <span class="code">corosync</span>, <span class="code">totem</span>, <span class="code">fenced</span>, <span class="code">rgmanager</span>, <span class="code">dlm</span>, <span class="code">qdisk</span> and <span class="code">GFS2</span>,
Command action
** We will be adding <span class="code">DRBD</span>, <span class="code">CLVM</span> and <span class="code">Xen</span>.
  e  extended
** Right there, we have <span class="code">N^11</span> possible bugs. We'll call this <span class="code">A</span>.
  p  primary partition (1-4)
* A cluster has <span class="code">Y</span> nodes.
p
** In our case, <span class="code">2</span> nodes, each with <span class="code">3</span> networks.
Partition number (1-4): 1
** The network infrastructure (Switches, routers, etc). If you use managed switches, add another layer of complexity.
First cylinder (1-55022, default 1): 1
** This gives us another <span class="code">Y^(2*3)</span>, and then <span class="code">^2</span> again for managed switches. We'll call this <span class="code">B</span>.
Last cylinder, +cylinders or +size{K,M,G} (1-55022, default 55022):
* Let's add the human factor. Let's say that a person needs roughly 5 years of cluster experience to be considered an expert. For each year less than this, add a <span class="code">Z</span> "oops" factor, <span class="code">(5-Z)^2</span>. We'll call this <span class="code">C</span>.
Using default value 55022
* So, finally, add up the complexity, using this tutorial's layout, 0-years of experience and managed switches.
** <span class="code">(N^11) * (Y^(2*3)^2) * ((5-0)^2) == (A * B * C)</span> == an-unknown-but-big-number.


This isn't meant to scare you away, but it is meant to be a sobering statement. Obviously, those numbers are somewhat artificial, but the point remains.
Command (m for help): p


Any one piece is easy to understand, thus, clustering is inherently easy. However, given the large number of variables, you must really understand all the pieces and how they work together. '''''DO NOT''''' think that you will have this mastered and working in a month. Certainly don't try to sell clusters as a service without a ''lot'' of internal testing.
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x403f1fb8


Clustering is kind of like chess. The rules are pretty straight forward, but the complexity can take some time to master.
  Device Boot      Start        End      Blocks  Id  System
/dev/sdb1              1      55022  441964183+  83  Linux


== An Overview Before We Begin ==
Command (m for help): t
Selected partition 1
Hex code (type L to list codes): 8e
Changed system type of partition 1 to 8e (Linux LVM)


When looking at a cluster, there is a tendency to want to dive right into the configuration file. That is not very useful in clustering.
Command (m for help): p


* When you look at the configuration file, it is quite short.
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x403f1fb8


It isn't like most applications or technologies though. Most of us learn by taking something, like a configuration file, and tweaking it this way and that to see what happens. I tried that with clustering and learned only what it was like to bang my head against the wall.
  Device Boot      Start        End      Blocks  Id  System
/dev/sdb1              1      55022  441964183+  8e  Linux LVM


* Understanding the parts and how they work together is critical.
Command (m for help): w
The partition table has been altered!


You will find that the discussion on the components of clustering, and how those components and concepts interact, will be much longer than the initial configuration. It is true that we could talk very briefly about the actual syntax, but it would be a disservice. Please, don't rush through the next section or, worse, skip it and go right to the configuration. You will waste far more time than you will save.
Calling ioctl() to re-read partition table.
Syncing disks.
</source>


* Clustering is easy, but it has a complex web of inter-connectivity. You must grasp this network if you want to be an effective cluster administrator!
<source lang="bash">
fdisk /dev/sdc
</source>
<source lang="text">
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0xba7503eb.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.


=== Component; cman ===
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)


This was, traditionally, the <span class="code">c</span>luster <span class="code">man</span>ager. In the 3.0 series, it acts mainly as a service manager, handling the starting and stopping of clustered services. In the 3.1 series, <span class="code">cman</span> will be removed entirely.
WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
        switch off the mode (command 'c') and change display units to
        sectors (command 'u').


=== Component; corosync ===
Command (m for help): c
DOS Compatibility flag is not set


Corosync is the heart of the cluster. All other computers operate though this component, and no cluster component can work without it. Further, it is shared between both Pacemaker and RHCS clusters.
Command (m for help): u
Changing display/entry units to sectors


In Red Hat clusters, <span class="code">corosync</span> is configured via the central <span class="code">cluster.conf</span> file. In Pacemaker clusters, it is configured directly in <span class="code">corosync.conf</span>. As we will be building an RHCS, we will only use <span class="code">cluster.conf</span>. That said, (almost?) all <span class="code">corosync.conf</span> options are available in <span class="code">cluster.conf</span>. This is important to note as you will see references to both configuration files when searching the Internet.
Command (m for help): n
Command action
  e  extended
  p  primary partition (1-4)
p
Partition number (1-4): 1
First sector (2048-58613759, default 2048):
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-58613759, default 58613759):
Using default value 58613759


=== Concept; quorum ===
Command (m for help): t
Selected partition 1
Hex code (type L to list codes): 8e
Changed system type of partition 1 to 8e (Linux LVM)


[[Quorum]] is defined as a collection of machines and devices in a cluster with a clear majority of votes.
Command (m for help): p


The idea behind quorum is that, which ever group of machines has it, can safely start clustered services even when defined members are not accessible.
Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders, total 58613760 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xba7503eb


Take this scenario;
  Device Boot      Start        End      Blocks  Id  System
/dev/sdc1            2048    58613759    29305856  8e  Linux LVM


* You have a cluster of four nodes, each with one vote.
Command (m for help): w
** The cluster's <span class="code">expected_votes</span> is <span class="code">4</span>. A clear majority, in this case, is <span class="code">3</span> because <span class="code">(4/2)+1</span>, rounded down, is <span class="code">3</span>.
The partition table has been altered!
** Now imagine that there is a failure in the network equipment and one of the nodes disconnects from the rest of the cluster.
** You now have two partitions; One partition contains three machines and the other partition has one.
** The three machines will have quorum, and the other machine will lose quorum.
** The partition with quorum will reconfigure and continue to provide cluster services.
** The partition without quorum will withdraw from the cluster and shut down all cluster services.


This behaviour acts as a guarantee that the two partitions will never try to access the same clustered resources, like a shared filesystem, thus guaranteeing the safety of those shared resources.
Calling ioctl() to re-read partition table.
Syncing disks.
</source>


This also helps explain why an even <span class="code">50%</span> is not enough to have quorum, a common question for people new to clustering. Using the above scenario, imagine if the split were 2-nodes and 2-nodes. Because either can't be sure what the other would do, neither can safely proceed. If we allowed an even 50% to have quorum, both partition might try to take over the clustered services and disaster would soon follow.
<source lang="bash">
fdisk -l
</source>
<source lang="text">
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00062f4a


There is one, and '''only''' one except to this rule.
  Device Boot      Start        End      Blocks  Id  System
/dev/sda1  *          1          33      262144  83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              33        5255    41943040  83  Linux
/dev/sda3            5255        5777    4194304  82  Linux swap / Solaris


In the case of a two node cluster, as we will be building here, any failure results in a 50/50 split. If we enforced quorum in a two-node cluster, there would never be high availability because and failure would cause both nodes to withdraw. The risk with this exception is that we now place the entire safety of the cluster on [[fencing]], a concept we will cover in a second. Fencing is a second line of defense and something we are loath to rely on alone.
Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xba7503eb


Even in a two-node cluster though, proper quorum can be maintained by using a quorum disk, called a [[qdisk]]. This is another topic we will touch on in a moment. This tutorial will implement a <span class="code">qdisk</span> specifically so that we can get away from this <span class="code">two_node</span> exception.
  Device Boot      Start        End      Blocks  Id  System
/dev/sdc1              2      28620    29305856  8e  Linux LVM


=== Concept; Virtual Synchrony ===
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x403f1fb8


All cluster operations have to occur in the same order across all nodes. This concept is called "virtual synchrony", and it is provided by <span class="code">corosync</span> using "closed process groups", <span class="code">[[CPG]]</span>.
  Device Boot      Start        End      Blocks  Id  System
/dev/sdb1              1      55022  441964183+  8e  Linux LVM
</source>


Let's look at how locks are handled on clustered file systems as an example.
=== Setup LVM devices ===


* As various nodes want to work on files, they send a lock request to the cluster. When they are done, they send a lock release to the cluster.
Create PV.
** Lock and unlock messages must arrive in the same order to all nodes, regardless of the real chronological order that they were issued.
* Let's say one node sends out messages "<span class="code">a1 a2 a3 a4</span>". Meanwhile, the other node sends out "<span class="code">b1 b2 b3 b4</span>".
** All of these messages go to <span class="code">corosync</span> which gathers them up and sends them up and sorts them.
** It is totally possible that corosync will get the messages as "<span class="code">a2 b1 b2 a1 b4 a3 a4 b4</span>".
** The <span class="code">corosync</span> application will then ensure that all nodes get the messages in the above order, one at a time. All nodes must confirm that they got a given message before the next message is sent to any node.


This will tie into fencing and <span class="code">totem</span>, as we'll see in the next sections.
'''<span class="code">an-node03</span>''':


=== Concept; Fencing ===
<source lang="bash">
pvcreate /dev/sd{b,c}1
</source>
<source lang="text">
  Physical volume "/dev/sdb1" successfully created
  Physical volume "/dev/sdc1" successfully created
</source>


Fencing is a '''absolutely critical''' part of clustering. Without '''fully''' working fence devices, '''''your cluster will fail'''''.
'''<span class="code">an-node04</span> and <span class="code">an-node05</span>''':


Was that strong enough, or should I say that again? Let's be safe:
<source lang="bash">
pvscan
</source>
<source lang="text">
  PV /dev/sdb1                      lvm2 [421.49 GiB]
  PV /dev/sdc1                      lvm2 [27.95 GiB]
  Total: 2 [449.44 GiB] / in use: 0 [0  ] / in no VG: 2 [449.44 GiB]
</source>


'''''DO NOT BUILD A CLUSTER WITHOUT PROPER, WORKING AND TESTED FENCING'''''.
Create the VGs.


Sorry, I promise that this will be the only time that I speak so strongly. Fencing really is critical, and explaining the need for fencing is nearly a weekly event. So then, let's discuss fencing.
'''<span class="code">an-node03</span>''':


When a node stops responding, an internal timeout and counter start ticking away. During this time, no messages are moving through the cluster and the cluster is, essentially, hung. If the node responds in time, the timeout and counter reset and the cluster begins operating properly again.
<source lang="bash">
vgcreate -c y san_vg01 /dev/sdb1
</source>
<source lang="text">
  Clustered volume group "san_vg01" successfully created
</source>
<source lang="bash">
vgcreate -c y san_vg02 /dev/sdc1
</source>
<source lang="text">
  Clustered volume group "san_vg02" successfully created
</source>


If, on the other hand, the node does not respond in time, the node will be declared dead. The cluster will take a "head count" to see which nodes it still has contact with and will determine then if there are enough to have quorum. If so, the cluster will issue a "fence" against the silent node. This is a call to a program called <span class="code">fenced</span>, the fence daemon.
'''<span class="code">an-node04</span> and <span class="code">an-node05</span>''':


The fence daemon will look at the cluster configuration and get the fence devices configured for the dead node. Then, one at a time and in the order that they appear in the configuration, the fence daemon will call those fence devices, via their fence agents, passing to the fence agent any configured arguments like username, password, port number and so on. If the first fence agent returns a failure, the next fence agent will be called. If the second fails, the third will be called, then the forth and so on. Once the last (or perhaps only) fence device fails, the fence daemon will retry again, starting back at the start of the list. It will do this indefinitely until one of the fence devices success.
<source lang="bash">
vgscan
</source>
<source lang="text">
  Reading all physical volumes. This may take a while...
  Found volume group "san_vg02" using metadata type lvm2
  Found volume group "san_vg01" using metadata type lvm2
</source>


Here's the flow, in point form:
Create the first VM's LVs.


* The <span class="code">corosync</span> program collects messages and sends them off, one at a time, to all nodes.
'''<span class="code">an-node03</span>''':
* All nodes respond, and the next message is sent. Repeat continuously during normal operation.
* Suddenly, one node stops responding.
** Communication freezes while the cluster waits for the silent node.
** A timeout starts (<span class="code">300</span>ms by default), and each time the timeout is hit, and error counter increments.
** The silent node responds before the counter reaches the limit.
*** The counter is reset to <span class="code">0</span>
*** The cluster operates normally again.
* Again, one node stops responding.
** Again, the timeout begins and the error count increments each time the timeout is reached.
** Time error exceeds the limit (<span class="code">10</span> is the default); Three seconds have passed (<span class="code">300ms * 10</span>).
** The node is declared dead.
** The cluster checks which members it still has, and if that provides enough votes for quorum.
*** If there are too few votes for quorum, the cluster software freezes and the node(s) withdraw from the cluster.
*** If there are enough votes for quorum, the silent node is declared dead.
**** <span class="code">corosync</span> calls <span class="code">fenced</span>, telling it to fence the node.
**** Which fence device(s) to use, that is, what <span class="code">fence_agent</span> to call and what arguments to pass, is gathered.
**** For each configured fence device:
***** The agent is called and <span class="code">fenced</span> waits for the <span class="code">fence_agent</span> to exit.
***** The <span class="code">fence_agent</span>'s exit code is examined. If it's a success, recovery starts. If it failed, the next configured fence agent is called.
**** If all (or the only) configured fence fails, <span class="code">fenced</span> will start over.
**** <span class="code">fenced</span> will wait and loop forever until a fence agent succeeds. During this time, '''the cluster is hung'''.
** Once a <span class="code">fence_agent</span> succeeds, the cluster is reconfigured.
*** A new closed process group (<span class="code">cpg</span>) is formed.
*** A new fence domain is formed.
*** Lost cluster resources are recovered as per <span class="code">rgmanager</span>'s configuration (including file system recovery as needed).
*** Normal cluster operation is restored.


This skipped a few key things, but the general flow of logic should be there.
<source lang="bash">
lvcreate -L 10G -n shared01 /dev/san_vg01
</source>
<source lang="text">
  Logical volume "shared01" created
</source>
<source lang="bash">
lvcreate -L 50G -n vm0001_hdd1 /dev/san_vg01
</source>
<source lang="text">
  Logical volume "vm0001_hdd1" created
</source>
<source lang="bash">
lvcreate -L 10G -n vm0001_ssd1 /dev/san_vg02
</source>
<source lang="text">
  Logical volume "vm0001_ssd1" created
</source>


This is why fencing is so important. Without a properly configured and tested fence device or devices, the cluster will never successfully fence and the cluster will stay hung forever.
'''<span class="code">an-node04</span> and <span class="code">an-node05</span>''':


=== Component; totem ===
<source lang="bash">
lvscan
</source>
<source lang="text">
  ACTIVE            '/dev/san_vg01/shared01' [10.00 GiB] inherit
  ACTIVE            '/dev/san_vg02/vm0001_ssd1' [10.00 GiB] inherit
  ACTIVE            '/dev/san_vg01/vm0001_hdd1' [50.00 GiB] inherit
</source>


The <span class="code">totem</span> protocol defined message passing within the cluster and it is used by <span class="code">corosync</span>. A token is passed around all the nodes in the cluster, and the timeout discussed in [[Red_Hat_Cluster_Service_3_Tutorial#Concept;_Fencing|fencing]] above is actually a token timeout. The counter, then, is the number of lost tokens that are allowed before a node is considered dead.
== Create Shared GFS2 Partition ==


The <span class="code">totem</span> protocol supports something called '<span class="code">rrp</span>', '''R'''edundant '''R'''ing '''P'''rotocol. Through <span class="code">rrp</span>, you can add a second backup ring on a separate network to take over in the event of a failure in the first ring. In RHCS, these rings are known as "<span class="code">ring 0</span>" and "<span class="code">ring 1</span>".
'''<span class="code">an-node03</span>''':


=== Component; rgmanager ===
<source lang="bash">
mkfs.gfs2 -p lock_dlm -j 5 -t an-clusterB:shared01 /dev/san_vg01/shared01
</source>
<source lang="text">
This will destroy any data on /dev/san_vg01/shared01.
It appears to contain: symbolic link to `../dm-2'


When the cluster configuration changes, <span class="code">corosync</span> calls <span class="code">rgmanager</span>, the resource group manager. It will examine what changed and then will start, stop, migrate or recover cluster resources as needed.
Are you sure you want to proceed? [y/n] y


=== Component; qdisk ===
Device:                    /dev/san_vg01/shared01
Blocksize:                4096
Device Size                10.00 GB (2621440 blocks)
Filesystem Size:          10.00 GB (2621438 blocks)
Journals:                  5
Resource Groups:          40
Locking Protocol:          "lock_dlm"
Lock Table:                "an-clusterB:shared01"
UUID:                      6C0D7D1D-A1D3-ED79-705D-28EE3D674E75
</source>


If you have a cluster of <span class="code">2</span> to <span class="code">16</span> nodes, you can use a quorum disk. This is a small partition on shared storage device, like a [[SAN]] or [[DRBD]] device, that the cluster can use to make much better decisions about which nodes should have quorum when a split in the network happens.
Add it to <span class="code">/etc/fstab</span> (needed for the <span class="code">gfs2</span> init script to find and mount):


The way a <span class="code">qdisk</span> works, at it's most basic, is to have one or more votes in quorum. Generally, but not necessarily always, the <span class="code">qdisk</span> device has one vote less than the total number of nodes (<span class="code">N-1</span>).
'''<span class="code">an-node03</span> - <span class="code">an-node07</span>''':


* In a two node cluster, the <span class="code">qdisk</span> would have one vote.
<source lang="bash">
* In a seven node cluster, the <span class="code">qdisk</span> would have six votes.
echo `gfs2_edit -p sb /dev/san_vg01/shared01 | grep sb_uuid | sed -e "s/.*sb_uuid  *\(.*\)/UUID=\L\1\E \/shared01\t\tgfs2\trw,suid,dev,exec,nouser,async\t0 0/"` >> /etc/fstab
cat /etc/fstab
</source>
<source lang="bash">
#
# /etc/fstab
# Created by anaconda on Fri Jul  8 22:01:41 2011
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=2c1f4cb1-959f-4675-b9c7-5d753c303dd1 /                      ext3    defaults        1 1
UUID=9a0224dc-15b4-439e-8d7c-5f9dbcd05e3f /boot                  ext3    defaults        1 2
UUID=4f2a83e8-1769-40d8-ba2a-e1f535306848 swap                    swap    defaults        0 0
tmpfs                  /dev/shm                tmpfs  defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                  /sys                    sysfs  defaults        0 0
proc                    /proc                  proc    defaults        0 0
UUID=6c0d7d1d-a1d3-ed79-705d-28ee3d674e75 /shared01 gfs2 rw,suid,dev,exec,nouser,async 0 0
</source>


Imagine these two scenarios; First without <span class="code">qdisk</span>, the revisited to see how <span class="code">qdisk</span> helps.
Make the mount point and mount it.


* First Scenarion; A two node cluster, which we will implement here.
<source lang="bash">
mkdir /shared01
/etc/init.d/gfs2 start
</source>
<source lang="text">
Mounting GFS2 filesystem (/shared01):                      [  OK  ]
</source>
<source lang="bash">
df -h
</source>
<source lang="text">
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              40G  3.3G  35G  9% /
tmpfs                1.8G  32M  1.8G  2% /dev/shm
/dev/sda1            248M  85M  151M  36% /boot
/dev/mapper/san_vg01-shared01
                      10G  647M  9.4G  7% /shared01
</source>


If the network connection on the <span class="code">totem</span> ring(s) breaks, you will enter into a dangerous state called a "'''split-brain'''". Normally, this can't happen because quorum can only be held by one side at a time. In a <span class="code">two_node</span> cluster though, this is allowed.  
Stop GFS2 on all five nodes and update the cluster.conf config.


Without a <span class="code">qdisk</span>, either node could potentially start the cluster resources. This is a disastrous possibility and it is avoided by a fence dual. Both nodes will try to fence the other at the same time, but only the fastest one wins. The idea behind this is that one will always live because the other will die before it can get it's fence call out. In theory, this works fine. In practice though, there are cases where fence calls can be "queued", thus, in fact, allow both nodes to die. This defeats the whole "high availability" thing, now doesn't it? Also, this possibility is why the <span class="code">two_node</span> option is the '''''only''''' exception to the quorum rules.
<source lang="bash">
/etc/init.d/gfs2 stop
</source>
<source lang="text">
Unmounting GFS2 filesystem (/shared01):                    [  OK  ]
</source>
<source lang="bash">
df -h
</source>
<source lang="text">
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              40G  3.3G  35G  9% /
tmpfs                1.8G  32M  1.8G  2% /dev/shm
/dev/sda1            248M  85M  151M  36% /boot
/dev/mapper/san_vg01-shared01
                      10G  647M  9.4G  7% /shared01
</source>


So how does a <span class="code">qdisk</span> help?
'''<span class="code">an-node03</span>''':


Two ways!
<source lang="xml">
<?xml version="1.0"?>
<cluster config_version="9" name="an-clusterB">
        <totem rrp_mode="none" secauth="off"/>
        <clusternodes>
                <clusternode name="an-node03.alteeve.ca" nodeid="3">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="3"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node04.alteeve.ca" nodeid="4">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="4"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node05.alteeve.ca" nodeid="5">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="5"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node06.alteeve.ca" nodeid="6">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="6"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node07.alteeve.ca" nodeid="7">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="7"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
        </fencedevices>
        <fence_daemon post_join_delay="30"/>
        <rm>
                <resources>
                        <script file="/etc/init.d/iscsi" name="iscsi"/>
                        <script file="/etc/init.d/clvmd" name="clvmd"/>
                        <script file="/etc/init.d/gfs2" name="gfs2"/>
                </resources>
                <failoverdomains>
                        <failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node03.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node04.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node05.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node06.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node07.alteeve.ca"/>
                        </failoverdomain>
                </failoverdomains>
                <service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
                <service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
                <service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
                <service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
                <service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
        </rm>
</cluster>
</source>
<source lang="bash">
cman_tool version -r
</source>


First;
Check that <span class="code">rgmanager</span> picked up the updated config and remounted the GFS2 partition.


The biggest way it helps is by getting away from the <span class="code">two_node</span> exception. With the <span class="code">qdisk</span> partition, you are back up to three votes, so there will never be a 50/50 split. If either node retains access to the quorum disk while the other loses access, then right there things are decided. The one with the <span class="code">disk</span> has <span class="code">2</span> votes and wins quorum and will fence the other. Meanwhile, the other will only have <span class="code">1</span> votes, thus it will lose quorum, and will withdraw from the cluster and ''not'' try to fence the other node.
<source lang="bash">
df -h
</source>
<source lang="text">
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              40G  3.3G  35G  9% /
tmpfs                1.8G  32M  1.8G  2% /dev/shm
/dev/sda1            248M  85M  151M  36% /boot
/dev/mapper/san_vg01-shared01
                      10G  647M  9.4G  7% /shared01
</source>


Second;
= Configure KVM =


You can use [[heuristics]] with <span class="code">qdisk</span> to have a more intelligent partition recovery mechanism. For example, let's look again at the scenario where the link(s) between the two nodes hosting the <span class="code">totem</span> ring is cut. This time though, let's assume that the storage network link is still up, so both nodes have access to the <span class="code">qdisk</span> partition. How would the <span class="code">qdisk</span> act as a tie breaker?
Host network and VM hypervisor config.


One way is to have a heuristics test that checks to see if one of the nodes has access to a particular router. With this heuristics test, if only one node had access to that switch, the <span class="code">qdisk</span> would give it's vote to that node and ensure that the "healthiest" node survived. Pretty cool, eh?
== Disable the 'qemu' Bridge ==


* Second Scenarion; A seven node cluster with six dead members.
By default, <span class="code">[[libvirtd]]</span> creates a bridge called <span class="code">virbr0</span> designed to connect virtual machines to the first <span class="code">eth0</span> interface. Our system will not need this, so we will remove it. This bridge is configured in the <span class="code">/etc/libvirt/qemu/networks/default.xml</span> file.  


Admittedly, this is an extreme scenario, but it serves to illustrate the point well. Remember how we said that the general rule is that the <span class="code">qdisk</span> has <span class="code">N-1</span> votes?
So to remove this bridge, simply delete the contents of the file, stop the bridge, delete the bridge and then stop <span class="code">iptables</span> to make sure any rules created for the bridge are flushed.


With our seven node cluster, on it's own, there would be a total of <span class="code">7</span> votes, so normally quorum would require <span class="code">4</span> nodes be alive (<span class="code">((7/2)+1) = (3.5+1) = 4.5</span>, rounded down is <span class="code">4</span>). With the death of the fourth node, all cluster services would fail. We understand now why this would be the case, but what if the nodes are, for example, serving up websites? In this case, 3 nodes are still sufficient to do the job. Heck, even 1 node is better than nothing. With the rules of quorum though, it just wouldn't happen.
<source lang="bash">
cat /dev/null >/etc/libvirt/qemu/networks/default.xml
ifconfig virbr0 down
brctl delbr virbr0
/etc/init.d/iptables stop
</source>


Let's now look at how the <span class="code">qdisk</span> can help.
== Configure Bridges ==


By giving the <span class="code">qdisk</span> partition <span class="code">6</span> votes, you raise the cluster's total expected votes from <span class="code">7</span> to <span class="code">13</span>. With this new count, the votes needed to for quorum is <span class="code">7</span> (<span class="code">((13/2)+1) = (6.5+1) = 7.5</span>, rounded down is <span class="code">7</span>).
On '''<span class="code">an-node03</span>''' through '''<span class="code">an-node07</span>''':


So looking back at the scenario where we've lost four of our seven nodes; The surviving nodes have <span class="code">3</span> votes, but they can talk to the <span class="code">qdisk</span> which provides another <span class="code">6</span> votes, for a total of <span class="code">9</span>. With that, quorum is achieved and the three nodes are allowed to form a cluster and continue to provide services. Even if you lose all but one node, you are still in business because the one surviving node, which is still able to talk to the <span class="code">qdisk</span> and thus win it's <span class="code">6</span> votes, has a total of <span class="code">7</span> and thus has quorum!
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-{eth,vbr}0
</source>


There is another benefit. As we mentioned in the first scenario, we can add heuristics to the <span class="code">qdisk</span>. Imagine that, rather than having six nodes die, they instead partition off because of a break in the network. Without <span class="code">qdisk</span>, the six nodes would easily win quorum, fence the one other node and then reform the cluster. What if, though, the one lone node was the only one with access to a critical route to the Internet? The six nodes would be useless in a web-server environment. With the heuristics provided by <span class="code">qdisk</span>, that one useful node would get the <span class="code">qdisk</span>'s 6 votes and win quorum over the other six nodes!
''<span class="code">ifcfg-eth0</span>'':
<source lang="bash">
# Internet facing
HWADDR="bc:ae:c5:44:8a:de"
DEVICE="eth0"
BRIDGE="vbr0"
BOOTPROTO="static"
IPV6INIT="yes"
NM_CONTROLLED="no"
ONBOOT="yes"
</source>


A little <span class="code">qdisk</span> goes a long way.
Note that you can use what ever bridge names makes sense to you. However, the file name for the bridge configuration must sort after the <span class="code">ifcfg-ethX</span> file. If the bridge file is read before the ethernet interface, it will fail to come up. Also, the bridge name as defined in the file does not need to match the one used it the actual file name. Personally, I like <span class="code">vbrX</span> for "''v''m ''br''idge".


=== Component; DRBD ===
''<span class="code">ifcfg-vbr0</span>'':
<source lang="bash">
# Bridge - IFN
DEVICE="vbr0"
TYPE="Bridge"
IPADDR=192.168.1.73
NETMASK=255.255.255.0
GATEWAY=192.168.1.254
DNS1=192.139.81.117
DNS2=192.139.81.1
</source>


[[DRBD]]; Distributed Replicating Block Device, is a technology that takes raw storage from two or more nodes and keeps their data synchronized in real time. It is sometimes described as "RAID 1 over Nodes", and that is conceptually accurate. In this tutorial's cluster, DRBD will be used to provide that back-end storage as a cost-effective alternative to a tranditional [[SAN]] or [[iSCSI]] device.
You may wish to not make the Back-Channel Network accessible to the virtual machines, then there is no need to setup this second bridge.


To help visualize DRBD's use and role, look at the map of our [[Red Hat Cluster Service 3 Tutorial#A Map of the Cluster's Storage|cluster's storage]].
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-{eth,vbr}2
</source>


=== Component; CLVM ===
''<span class="code">ifcfg-eth2</span>'':
<source lang="bash">
# Back-channel
HWADDR="00:1B:21:72:9B:56"
DEVICE="eth2"
BRIDGE="vbr2"
BOOTPROTO="static"
IPV6INIT="yes"
NM_CONTROLLED="no"
ONBOOT="yes"
</source>


With DRBD providing the raw storage for the cluster, we must now create partitions. This is where Clustered [[LVM]], known as CLVM, comes into play.
''<span class="code">ifcfg-vbr2</span>'':
<source lang="bash">
# Bridge - BCN
DEVICE="vbr2"
TYPE="Bridge"
IPADDR=192.168.3.73
NETMASK=255.255.255.0
</source>


CLVM is ideal in that it understands that it is clustered and therefor won't provide access to nodes outside of the formed cluster. That is, not a member of <span class="code">corosync</span>'s closed process group, which, in turn, requires quorum.
Leave the cluster, lest we be fenced.


It is ideal because it can take one or more raw devices, known as "physical volumes", or simple as [[PV]]s, and combine their raw space into one or more "volume groups", known as [[VG]]s. These volume groups then act just like a typical hard drive and can be "partitioned" into one or more "logical volumes", known as [[LV]]s. These LVs are what will be formatted with a clustered file system.
<source lang="bash">
/etc/init.d/rgmanager stop && /etc/init.d/cman stop
</source>


LVM is particularly attractive because of how incredibly flexible it is. We can easily add new physical volumes later, and then grow an existing volume group to use the new space. This new space can then be given to existing logical volumes, or entirely new logical volumes can be created. This can all be done while the cluster is online offering an upgrade path with no down time.
Restart networking and then check that the new bridges are up and that the proper ethernet devices are slaved to them.


=== Component; GFS2 ===
<source lang="bash">
/etc/init.d/network restart
</source>
<source lang="text">
Shutting down interface eth0:                              [  OK  ]
Shutting down interface eth1:                              [  OK  ]
Shutting down interface eth2:                              [  OK  ]
Shutting down loopback interface:                          [  OK  ]
Bringing up loopback interface:                            [  OK  ]
Bringing up interface eth0:                                [  OK  ]
Bringing up interface eth1:                                [  OK  ]
Bringing up interface eth2:                                [  OK  ]
Bringing up interface vbr0:                                [  OK  ]
Bringing up interface vbr2:                                [  OK  ]
</source>


With DRBD providing the clusters raw storage space, and Clustered LVM providing the logical partitions, we can now look at the clustered file system. This is the role of the Global File System version 2, known simply as [[GFS2]].
<source lang="bash">
brctl show
</source>
<source lang="text">
bridge name bridge id STP enabled interfaces
vbr0 8000.bcaec5448ade no eth0
vbr2 8000.001b21729b56 no eth2
</source>


It works much like standard filesystem, with <span class="code">mkfs.gfs2</span>, <span class="code">fsck.gfs2</span> and so on. The major difference is that it and <span class="code">clvmd</span> use the cluster's distributed locking mechanism provided by <span class="code">dlm_controld</span>. Once formatted, the GFS2-formatted partition can be mounted and used by any node in the cluster's closed process group. All nodes can then safely read from and write to the data on the partition simultaneously.
<source lang="bash">
ifconfig
</source>
<source lang="text">
eth0      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE 
          inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:4439 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2752 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:508352 (496.4 KiB)  TX bytes:494345 (482.7 KiB)
          Interrupt:31 Base address:0x8000


=== Component; DLM ===
eth1      Link encap:Ethernet  HWaddr 00:1B:21:72:96:E8 
          inet addr:192.168.2.73  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:96e8/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:617100 errors:0 dropped:0 overruns:0 frame:0
          TX packets:847718 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:772489353 (736.7 MiB)  TX bytes:740536232 (706.2 MiB)
          Interrupt:18 Memory:fe9e0000-fea00000


One of the major roles of a cluster is to provide distributed locking on clustered storage. In fact, storage software can not be clustered without using DLM, as provided by the <span class="code">dlm_controld</span> daemon, using <span class="code">corosync</span>'s virtual synchrony.  
eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56 
          inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:86586 errors:0 dropped:0 overruns:0 frame:0
          TX packets:80934 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:11366700 (10.8 MiB)  TX bytes:10091579 (9.6 MiB)
          Interrupt:17 Memory:feae0000-feb00000


Through DLM, all nodes accessing clustered storage are guaranteed to get [[POSIX]] locks, called <span class="code">plock</span>s, in the same order across all nodes. Both CLVM and GFS2 rely on DLM, though other clustered storage, like OCFS2, use it as well.
lo        Link encap:Local Loopback 
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:32 errors:0 dropped:0 overruns:0 frame:0
          TX packets:32 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:11507 (11.2 KiB)  TX bytes:11507 (11.2 KiB)


=== Component; Xen ===
vbr0      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE 
          inet addr:192.168.1.73  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:165 errors:0 dropped:0 overruns:0 frame:0
          TX packets:89 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:25875 (25.2 KiB)  TX bytes:17081 (16.6 KiB)


There are two major open-source virtualization platforms available in the Linux world today; Xen and KVM. The former is maintained by [http://www.citrix.com/xenserver Citrix] and the other by [http://www.redhat.com/solutions/virtualization/ Redhat]. It would be difficult to say which is "better", as they're both very good. Xen can be argued to be more mature where KVM is the "official" solution supported by Red Hat directly.
vbr2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56 
          inet addr:192.168.3.73  Bcast:192.168.3.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:74 errors:0 dropped:0 overruns:0 frame:0
          TX packets:27 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:19021 (18.5 KiB)  TX bytes:4137 (4.0 KiB)
</source>


We will be using the Xen [[hypervisor]] and a "host" virtual server called [[dom0]]. In Xen, every machine is a virtual server, including the system you installed when you built the server. This is possible thanks to a small Xen micro-operating system that initially boots, then starts up your original installed operating system as a virtual server with special access to the underlying hardware and hypervisor management tools.
Rejoin the cluster.


The rest of the virtual servers in a Xen environment are collectively called "[[domU]]" virtual servers. These will be the highly-available resource that will migrate between nodes during failure events.
<source lang="bash">
/etc/init.d/cman start && /etc/init.d/rgmanager start
</source>


== A Little History ==


In the RHCS version 2 days (RHEL 5.x and derivatives), there was a component called <span class="code">openais</span> which handled <span class="code">totem</span>. The OpenAIS project was designed to be the heart of the cluster and was based around the [http://www.saforum.org/ Service Availability Forum]'s [http://www.saforum.org/Application-Interface-Specification~217404~16627.htm Application Interface Specification]. AIS is an open [[API]] designed to provide inter-operable high availability services.
Repeat these configurations, altering for [[MAC]] and [[IP]] addresses as appropriate, for the other four VM cluster nodes.


In 2008, it was decided that the AIS specification was overkill for clustering and a duplication of effort in the existing and easier to maintain <span class="code">corosync</span> project. OpenAIS was then split off as a separate project specifically designed to act as an optional add-on to corosync for users who wanted AIS functionality.
<span class="code"></span>
<source lang="bash">
</source>
<source lang="text">
</source>


You will see a lot of references to OpenAIS while searching the web for information on clustering. Understanding it's evolution will hopefully help you avoid confusion.
== Benchmarks ==


== Finally; Begin Configuration ==
GFS2 partition on <span class="code">an-node07</span>'s <span class="code">/shared01</span> partition. Test #1, no optimization:


At the heart of Red Hat Cluster Services is the [[Cluster.conf|<span class="code">/etc/cluster/cluster.conf</span>]] configuration file.
<source lang="bash">
bonnie++ -d /shared01/ -s 8g -u root:root
</source>
<source lang="text">
Version  1.96      ------Sequential Output------ --Sequential Input- --Random-
Concurrency  1    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
an-node07.alteev 8G  388  95 22203  6 14875  8  2978  95 48406  10 107.3  5
Latency              312ms  44400ms  31355ms  41505us    540ms  11926ms
Version  1.96      ------Sequential Create------ --------Random Create--------
an-node07.alteeve.c -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                16  1144  18 +++++ +++  8643  56  939  19 +++++ +++  8262  55
Latency              291ms    586us    2085us    3511ms      51us    3669us
1.96,1.96,an-node07.alteeve.ca,1,1312497509,8G,,388,95,22203,6,14875,8,2978,95,48406,10,107.3,5,16,,,,,1144,18,+++++,+++,8643,56,939,19,+++++,+++,8262,55,312ms,44400ms,31355ms,41505us,540ms,11926ms,291ms,586us,2085us,3511ms,51us,3669us
</source>


This is an [[XML]] configuration file that stores and controls all of the cluster configuration, including node setups, resource management, fault tolerances, fence devices and their use and so on.  
CentOS 5.6 x86_64 VM <span class="code">vm0001_labzilla</span>'s <span class="code">/root</span> directory. Test #1, no optimization. VM provisioned using command in below section.


The goal of this tutorial is to introduce you to clustering, so only four components will be shown here, configured in two stages:
<source lang="bash">
bonnie++ -d /root/ -s 8g -u root:root
</source>
<source lang="text">
Version  1.96      ------Sequential Output------ --Sequential Input- --Random-
Concurrency  1    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
labzilla-new.can 8G  674  98 15708  5 14875  7  1570  65 47806  10 119.1  7
Latency            66766us    7680ms    1588ms    187ms    269ms    1292ms
Version  1.96      ------Sequential Create------ --------Random Create--------
labzilla-new.candco -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                16 27666  39 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency            11360us    1904us    799us    290us      44us      41us
1.96,1.96,labzilla-new.candcoptical.com,1,1312522208,8G,,674,98,15708,5,14875,7,1570,65,47806,10,119.1,7,16,,,,,27666,39,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,66766us,7680ms,1588ms,187ms,269ms,1292ms,11360us,1904us,799us,290us,44us,41us
</source>


* Stage 1
== Provision vm0001 ==
** Node definitions
** Fence device setup
* Stage 2
** Quorum Disk setup
** Resource Management


There is a tremendous amount of options to allow for extremely fine-grained control of the cluster. To discuss all of the option would require a dedicated article and would distract quite a bit at this stage. There is an ongoing project to do just that, but it is not complete yet.
Created LV already, so:


* [[Cluster.conf|All cluster.conf Options]]
<source lang="bash">
virt-install --connect qemu:///system \
  --name vm0001_labzilla \
  --ram 1024 \
  --arch x86_64 \
  --vcpus 2 \
  --cpuset 1-3 \
  --location http://192.168.1.254/c5/x86_64/img/ \
  --extra-args "ks=http://192.168.1.254/c5/x86_64/ks/labzilla_c5.ks" \
  --os-type linux \
  --os-variant rhel5.4 \
  --disk path=/dev/san_vg01/vm0001_hdd1 \
  --network bridge=vbr0 \
  --vnc
</source>


It is strongly advised that, ''after'' this tutorial, you take the time to review these options. For now though, let's keep it simple.
== Provision vm0002 ==


=== What Do We Need To Start? ===
Created LV already, so:


* First
<source lang="bash">
virt-install --connect qemu:///system \
  --name vm0002_innovations \
  --ram 1024 \
  --arch x86_64 \
  --vcpus 2 \
  --cpuset 1-3 \
  --cdrom /shared01/media/Win_Server_2008_Bis_x86_64.iso \
  --os-type windows \
  --os-variant win2k8 \
  --disk path=/dev/san_vg01/vm0002_hdd2 \
  --network bridge=vbr0 \
  --hvm \
  --vnc
</source>


You need a name for your cluster. It's important that it be unique if the cluster will be on a network with other clusters. This paper will use <span class="code">an-cluster-01</span>.
Update the <span class="code">cluster.conf</span> to add the VMs to the cluster.


* Second
<source lang="xml">
<?xml version="1.0"?>
<cluster config_version="12" name="an-clusterB">
<totem rrp_mode="none" secauth="off"/>
<clusternodes>
<clusternode name="an-node03.alteeve.ca" nodeid="3">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="3"/>
</method>
</fence>
</clusternode>
<clusternode name="an-node04.alteeve.ca" nodeid="4">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="4"/>
</method>
</fence>
</clusternode>
<clusternode name="an-node05.alteeve.ca" nodeid="5">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="5"/>
</method>
</fence>
</clusternode>
<clusternode name="an-node06.alteeve.ca" nodeid="6">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="6"/>
</method>
</fence>
</clusternode>
<clusternode name="an-node07.alteeve.ca" nodeid="7">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="7"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
</fencedevices>
<fence_daemon post_join_delay="30"/>
<rm log_level="5">
<resources>
<script file="/etc/init.d/iscsi" name="iscsi"/>
<script file="/etc/init.d/clvmd" name="clvmd"/>
<script file="/etc/init.d/gfs2" name="gfs2"/>
</resources>
<failoverdomains>
<failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an3_primary" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="1"/>
<failoverdomainnode name="an-node04.alteeve.ca" priority="2"/>
<failoverdomainnode name="an-node05.alteeve.ca" priority="3"/>
<failoverdomainnode name="an-node06.alteeve.ca" priority="4"/>
<failoverdomainnode name="an-node07.alteeve.ca" priority="5"/>
</failoverdomain>
<failoverdomain name="an4_primary" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="5"/>
<failoverdomainnode name="an-node04.alteeve.ca" priority="1"/>
<failoverdomainnode name="an-node05.alteeve.ca" priority="2"/>
<failoverdomainnode name="an-node06.alteeve.ca" priority="3"/>
<failoverdomainnode name="an-node07.alteeve.ca" priority="4"/>
</failoverdomain>
</failoverdomains>
<service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<vm autostart="0" domain="an3_primary" exclusive="0" max_restarts="2" name="vm0001_labzilla" path="/shared01/definitions/" recovery="restart" restart_expire_time="600"/>
<vm autostart="0" domain="an4_primary" exclusive="0" max_restarts="2" name="vm0002_innovations" path="/shared01/definitions/" recovery="restart" restart_expire_time="600"/>
</rm>
</cluster>
</source>


We need the <span class="code">hostname</span> of each cluster node. You can get this by using the <span class="code">uname</span> program.


<span class="code"></span>
<source lang="bash">
<source lang="bash">
uname -n
</source>
</source>
<source lang="text">
<source lang="text">
an-node01.alteeve.com
</source>
</source>


In the example above, the first node's <span class="code">hostname</span> is <span class="code">an-node01</span>. This will be used shortly when defining the node.
= Stuff =


* Third
Multi-VM after primary SAN (violent) ejection from cluster. Both VMs remained up!


You need to know what fence device you will be using and how to use them. Exactly how you do this will depend on what fence device you have available to you. There are many possibilities, but [[IPMI]] is probably the most common and [[Node Assassin]] is available to everyone, so those two will be discussed in detail shortly.
[[Image:two_vms_on_two_by_seven_cluster_build_01.png|thumb|center|800px|Two VMs (windows and Linux) running on the SAN. Initial testing of survivability of primary SAN failure completed successfully!]]


The main thing to know about your fence device(s) are:
First build of the 2x7 Cluster.


* What is the fence agent called? IPMI uses <span class="code">fence_ipmilan</span> and Node Assassin uses <span class="code">fence_na</span>.
[[Image:first_ever_build.png|thumb|center|700px|First-ever successful build/install of the "Cluster Set" cluster configuration. Fully HA, fully home-brew on all open source software using only commodity hardware. Much tuning/testing to come!]]
* Does one device support multiple devices? If so, what port number is each node on?
* What IP address or resolvable hostname is the device available at?
* Does the device require a user name and password? If so, what are they?
* What are the supported fence options and which do you want to use? The most common being <span class="code">reboot</span>, but <span class="code">off</span> is another popular option.
* Does the fence device support or require other options? If so, what are they and what values do you want to use?


* Summary
== Bonding and Trunking ==


For this tutorial, we will have the following information now:
The goal here is to take the network out as a single point of failure.


* Cluster name: <span class="code">an-cluster-01</span>
The design is to use two stacked switches, bonded connections in the nodes with each leg of the bond cabled through either switch. While both are up, the aggregate bandwidth will be achieved using trunking in the switch and the appropriate bond driver configuration. The recovery from failure will need to be configured in such a way that it will be faster than the cluster's token loss timeouts multiplied by the token retransmit loss count.
* Node: <span class="code">an-node01</span>
** Fist fence device: <span class="code">IPMI</span>
*** IPMI is per-node, so no port number is needed.
*** The IPMI interface is at <span class="code">192.168.3.51</span>.
*** Username is <span class="code">admin</span> and the password is <span class="code">secret</span>.
*** No special arguments are needed.
** Second fence device: <span class="code">Node Assassin</span>
*** Node Assassin supports four nodes, and <span class="code">an-node01</span> is connected to port <span class="code">1</span>.
*** The Node Assassin interface is at <span class="code">192.168.3.61</span>.
*** Username is <span class="code">admin</span> and the password is <span class="code">sauce</span>.
*** We want to use the <span class="code">quiet</span> option (<span class="code">quiet="true"</span>).
* Node: <span class="code">an-node02</span>
** Fist fence device: <span class="code">IPMI</span>
*** IPMI is per-node, so no port number is needed.
*** The IPMI interface is at <span class="code">192.168.3.52</span>.
*** Username is <span class="code">admin</span> and the password is <span class="code">secret</span>.
*** No special arguments are needed.
** Second fence device: <span class="code">Node Assassin</span>
*** Node Assassin supports four nodes, and <span class="code">an-node01</span> is connected to port <span class="code">2</span>.
*** The Node Assassin interface is at <span class="code">192.168.3.61</span>.
*** Username is <span class="code">admin</span> and the password is <span class="code">sauce</span>.
*** We want to use the <span class="code">quiet</span> option (<span class="code">quiet="true"</span>).


Note that:
This tutorial uses 2x [http://dlink.ca/products/?pid=DGS-3100-24&tab=3 D-Link DGS-3100-24] switches. This is not to endorse these switches, per-se, but it does provide a relatively affordable, decent quality switches for those who'd like to replicate this setup.
* IPMI is per-node, and thus has different IP addresses an no ports.
* Node Assassin supports multiple nodes, so has a common IP address and different ports per node.


We now know enough to write the first version of the <span class="code">cluster.conf</span> file!
=== Configure The Stack ===


== Install The Cluster Software ==
First, stack the switches using a ring topology (both HDMI connectors/cables used). If both switches are brand new, simple cable them together and the switches will auto-negotiate the stack configuration. If you are adding a new switch, then power on the existing switch, cable up the second switch and then power on the second switch. After a short time, it's stack ID should increment and you should see the new switch appear in the existing switch's interface.


If you are using Red Hat Enterprise Linux, you will need to add the <span class="code">RHEL Server Optional (v. 6 64-bit x86_64)</span> channel for each node in your cluster. You can do this in [[RHN]] by going the your subscription management page, clicking on each server, clicking on "Alter Channel Subscriptions", click to enable the <span class="code">RHEL Server Optional (v. 6 64-bit x86_64)</span> channel and then by clicking on "Change Subscription".
=== Configuring the Bonding Drivers ===
 
This tutorial uses four interfaces joined into two bonds of two NICs like so:


This actual installation is simple, just use <span class="code">yum</span> to install <span class="code">cman</span>.
<source lang="text">
# Internet Facing Network:
eth0 + eth1 == bond0
</source>


<source lang="bash">
<source lang="text">
yum install cman
# Storage and Cluster Communications:
eth2 + eth3 == bond1
</source>
</source>


This will pull in a good number of dependencies.  
This requires a few steps.


For now, we do not want the cluster manager to start on boot. We'll turn it off and disable the service until we are finished testing.
* Create <span class="code">/etc/modprobe.d/bonding.conf</span> and add an entry for the two bonding channels we will create.
 
''Note'': My <span class="code">eth0</span> device is an onboard controller with a maximum [[MTU]] of 7200 [[bytes]]. This means that the whole bond is restricted to this MTU.


<source lang="bash">
<source lang="bash">
chkconfig cman off
vim /etc/modprobe.d/bonding.conf
/etc/init.d/cman stop
</source>
</source>
<source lang="text">
<source lang="text">
  Leaving fence domain...                                [  OK  ]
alias bond0 bonding
  Stopping gfs_controld...                                [  OK  ]
alias bond1 bonding
  Stopping dlm_controld...                                [  OK  ]
  Stopping fenced...                                      [  OK  ]
  Stopping cman...                                        [  OK  ]
  Unloading kernel modules...                            [  OK  ]
  Unmounting configfs...                                  [  OK  ]
</source>
</source>


== Configuring And Testing Fence Devices ==
* Create the <span class="code">ifcfg-bondX</span> configuration files.


Before we can look at the cluster configuration, we first need to make sure the fence devices are setup and working properly. We're going to setup IPMI on either node and a Node Assassin fence device. If you only use one, you can safely ignore the other for now. If you use a different fence device, please consult the manufacturer's documentation. Better yet, contribute the setup to this document!
Internet Facing configuration


=== Configure And Test IPMI ===
<source lang="bash">
touch /etc/sysconfig/network-scripts/ifcfg-bond{0,1}
vim /etc/sysconfig/network-scripts/ifcfg-eth{0,1} /etc/sysconfig/network-scripts/ifcfg-bond0
cat /etc/sysconfig/network-scripts/ifcfg-eth0
</source>
<source lang="bash">
# Internet Facing Network - Link 1
HWADDR="BC:AE:C5:44:8A:DE"
DEVICE="eth0"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond0"
SLAVE="yes"
MTU="7200"
</source>


IPMI requires having a system board with an IPMI baseboard management controller, known as a BMC. IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.
<source lang="bash">
cat /etc/sysconfig/network-scripts/ifcfg-eth1
</source>
<source lang="bash">
# Internet Facing Network - Link 2
HWADDR="00:1B:21:72:96:E8"
DEVICE="eth1"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond0"
SLAVE="yes"
MTU="7200"
</source>


Most manufacturers provide a method of configuring the BMC. Many provide a menu in the BIOS or at boot time. Modern IPMI-enabled systems offer a dedicated web interface that you can access using a browser. There is a third option as well, which we will show here, and that is by using a command line tool called, conveniently, <span class="code">ipmitool</span>.
<source lang="bash">
cat /etc/sysconfig/network-scripts/ifcfg-bond0
</source>
<source lang="bash">
# Internet Facing Network - Bonded Interface
DEVICE="bond0"
BOOTPROTO="static"
NM_CONTROLLED="no"
ONBOOT="yes"
IPADDR="192.168.1.73"
NETMASK="255.255.255.0"
GATEWAY="192.168.1.254"
DNS1="192.139.81.117"
DNS2="192.139.81.1"
BONDING_OPTS="miimon=1000 mode=0"
MTU="7200"
</source>


==== Configuring IPMI From The Command Line ====
Merged Storage Network and Back Channel Network configuration.


To start, we need to install the IPMI user software.
''Note'': The interfaces in this bond all support maximum [[MTU]] of 9000 [[bytes]].


'''ToDo''': confirm this is valid for RHEL6
<source lang="bash">
<source lang="bash">
yum install ipmitool freeipmi freeipmi-bmc-watchdog freeipmi-ipmidetectd OpenIPMI
vim /etc/sysconfig/network-scripts/ifcfg-eth{2,3} /etc/sysconfig/network-scripts/ifcfg-bond1
cat /etc/sysconfig/network-scripts/ifcfg-eth2
</source>
<source lang="bash">
# Storage and Back Channel Networks - Link 1
HWADDR="00:1B:21:72:9B:56"
DEVICE="eth2"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond1"
SLAVE="yes"
MTU="9000"
</source>
</source>


Once installed and and the daemon has been started, you should be able to check the local IPMI BMC using <span class="code">ipmitool</span>.
<source lang="bash">
cat /etc/sysconfig/network-scripts/ifcfg-eth3
</source>
<source lang="bash">
# Storage and Back Channel Networks - Link 2
HWADDR="00:0E:0C:59:46:E4"
DEVICE="eth3"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond1"
SLAVE="yes"
MTU="9000"
</source>


<source lang="bash">
<source lang="bash">
/etc/init.d/ipmi start
cat /etc/sysconfig/network-scripts/ifcfg-bond1
</source>
</source>
<source lang="text">
<source lang="bash">
Starting ipmi drivers:                                    [  OK  ]
# Storage and Back Channel Networks - Bonded Interface
DEVICE="bond1"
BOOTPROTO="static"
NM_CONTROLLED="no"
ONBOOT="yes"
IPADDR="192.168.3.73"
NETMASK="255.255.255.0"
BONDING_OPTS="miimon=1000 mode=0"
MTU="9000"
</source>
</source>
Restart networking.
{{note|1=I've noticed that this can error out and fail to start slaved devices at times when using <span class="code">/etc/init.d/network restart</span>. If you have any trouble, you may need to completely stop all networking, then start it back up. This, of course, requires network-less access to the node's console (direct access, [[iKVM]], console redirection, etc).}}
Some of the errors we will see below are because the network interface configuration changed while the interfaces were still up. To avoid this, if you have networkless access to the nodes, would be to stop the network interfaces prior to beginning editing.
<source lang="bash">
<source lang="bash">
ipmitool chassis status
/etc/init.d/network restart
</source>
</source>
<source lang="text">
<source lang="text">
System Power        : on
Shutting down interface eth0:                             [  OK  ]
Power Overload      : false
Shutting down interface eth1:                             [  OK  ]
Power Interlock      : inactive
Shutting down interface eth2:  /etc/sysconfig/network-scripts/ifdown-eth: line 99: /sys/class/net/bond1/bonding/slaves: No such file or directory
Main Power Fault    : false
                                                          [ OK  ]
Power Control Fault : false
Shutting down interface eth3: /etc/sysconfig/network-scripts/ifdown-eth: line 99: /sys/class/net/bond1/bonding/slaves: No such file or directory
Power Restore Policy : always-off
                                                          [  OK  ]
Last Power Event    : command
Shutting down loopback interface:                         [  OK  ]
Chassis Intrusion    : inactive
Bringing up loopback interface:                           [  OK  ]
Front-Panel Lockout : inactive
Bringing up interface bond0: RTNETLINK answers: File exists
Drive Fault          : false
Error adding address 192.168.1.73 for bond0.
Cooling/Fan Fault    : false
RTNETLINK answers: File exists
Front Panel Control : none
                                                          [  OK ]
Bringing up interface bond1:                               [  OK  ]
</source>
</source>


If you see something similar, you're up and running. You can now check the current configuration using the following command.
Confirm that we've got our new bonded interfaces


<source lang="bash">
<source lang="bash">
ipmitool -I open lan print 1
ifconfig
</source>
</source>
<source lang="text">
<source lang="text">
Set in Progress        : Set Complete
bond0     Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE 
Auth Type Support      : NONE MD2 MD5 OEM
          inet addr:192.168.1.73  Bcast:192.168.1.255  Mask:255.255.255.0
Auth Type Enable        : Callback : NONE MD2 MD5 OEM
          inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
                        : User     : NONE MD2 MD5 OEM
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:7200  Metric:1
                        : Operator : NONE MD2 MD5 OEM
          RX packets:1021 errors:0 dropped:0 overruns:0 frame:0
                        : Admin    : NONE MD2 MD5 OEM
          TX packets:502 errors:0 dropped:0 overruns:0 carrier:0
                        : OEM      :  
          collisions:0 txqueuelen:0  
IP Address Source      : Static Address
          RX bytes:128516 (125.5 KiB)  TX bytes:95092 (92.8 KiB)
IP Address              : 192.168.3.51
Subnet Mask             : 255.255.0.0
MAC Address            : 00:e0:81:aa:bb:cc
SNMP Community String  : AMI
IP Header              : TTL=0x00 Flags=0x00 Precedence=0x00 TOS=0x00
BMC ARP Control        : ARP Responses Disabled, Gratuitous ARP Disabled
Gratituous ARP Intrvl  : 0.0 seconds
Default Gateway IP      : 0.0.0.0
Default Gateway MAC    : 00:00:00:00:00:00
Backup Gateway IP      : 0.0.0.0
Backup Gateway MAC      : 00:00:00:00:00:00
802.1q VLAN ID          : Disabled
802.1q VLAN Priority    : 0
RMCP+ Cipher Suites    : 1,2,3,6,7,8,11,12,0,0,0,0,0,0,0,0
Cipher Suite Priv Max  : aaaaXXaaaXXaaXX
                        :    X=Cipher Suite Unused
                        :    c=CALLBACK
                        :    u=USER
                        :    o=OPERATOR
                        :    a=ADMIN
                        :    O=OEM
</source>


You can change the MAC address, but this isn't advised without a good reason to do so.  
bond1    Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56 
          inet addr:192.168.3.73  Bcast:192.168.3.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
          RX packets:787028 errors:0 dropped:0 overruns:0 frame:0
          TX packets:788651 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:65753950 (62.7 MiB)  TX bytes:1194295932 (1.1 GiB)


Below is an example set of commands that will configure the IPMI BMC, save the new settings and then check that the new settings took. Adapt the values to suit your environment and preferences.
eth0      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE 
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:7200  Metric:1
          RX packets:535 errors:0 dropped:0 overruns:0 frame:0
          TX packets:261 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:66786 (65.2 KiB)  TX bytes:47749 (46.6 KiB)
          Interrupt:31 Base address:0x8000


<source lang="bash">
eth1      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE 
# Don't change the MAC without a good reason. If you need to though, this should work.
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:7200  Metric:1
#ipmitool -I open lan set 1 macaddr 00:e0:81:aa:bb:cd
          RX packets:486 errors:0 dropped:0 overruns:0 frame:0
          TX packets:241 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:61730 (60.2 KiB)  TX bytes:47343 (46.2 KiB)
          Interrupt:18 Memory:fe8e0000-fe900000


# Set the IP to be static (instead of DHCP)
eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56 
ipmitool -I open lan set 1 ipsrc static
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:360190 errors:0 dropped:0 overruns:0 frame:0
          TX packets:394844 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:28756400 (27.4 MiB)  TX bytes:598159146 (570.4 MiB)
          Interrupt:17 Memory:fe9e0000-fea00000


# Set the IP, default gateway and subnet mask address of the IPMI interface.
eth3      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56 
ipmitool -I open lan set 1 ipaddr 192.168.3.51
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
ipmitool -I open lan set 1 defgw ipaddr 0.0.0.0
          RX packets:426838 errors:0 dropped:0 overruns:0 frame:0
ipmitool -I open lan set 1 netmask 255.255.255.0
          TX packets:393807 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:36997550 (35.2 MiB)  TX bytes:596136786 (568.5 MiB)


# Set the password.
lo        Link encap:Local Loopback 
ipmitool -I open lan set 1 password secret
          inet addr:127.0.0.1  Mask:255.0.0.0
ipmitool -I open user set password 2 secret
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
</source>


# Set the snmp community string, if appropriate
== Configuring High-Availability Networking ==
ipmitool -I open lan set 1 snmp alteeve


# Enable access
There are [http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/sec-Using_Channel_Bonding.html seven bonding] modes, which you can read about in [http://www.kernel.org/doc/Documentation/networking/bonding.txt detail here]. However, the [[RHCS]] stack only supports one of the modes, called [[Active/Passive Bonding]], also known as <span class="code">mode=1</span>.
ipmitool -I open lan set 1 access on


# Reset the IPMI BMC to make sure the changes took effect.
=== Configuring Your Switches ===
ipmitool mc reset cold


# Wait a few seconds and then re-run the call that dumped the setup to ensure
This method provides for no performance gains, but instead treats the slaved interfaces as independent paths. One of which acts as the primary while the other sits dormant. On failure, the bond switches to the backup interface, promoting it to the primary interface. Which is normally primary, and under what conditions a restored link might return to the primary role, are configurable should you wish to do so.
# it is now what we want.
sleep 5
ipmitool -I open lan print 1
</source>


If all went well, you should see the same output as above, but now with your new configuration.
Your managed switch will no doubt have one or more bonding, also known as [[trunking]] configuration options. Likewise, your switches may be stackable. It is strongly advised that you do *not* stack your switches. Being unstacked, it is obviously not possible to configure trunking. Should you decide to disregard this, be very sure to extensive test failure and recovery of both switches under real-world work loads.


==== Testing IPMI ====
Still on the topic of switches; Do not configure [[STP]] (spanning tree protocol) on any port connected to your cluster nodes! When a switch is added to the network, as is the case after restoring a lost switch, STP-enabled switches and port on those switches may block traffic for a period of time while STP renegotiates and reconfigures. This takes more than enough time to cause a cluster to partition. You may still enable and configure STP if you need to do so, simply ensure that you only do in on the appropriate ports.


The <span class="code">ipmitool</span> tool needs to be installed on the workstation that you want to run the tests from. You will certainly want to test fencing from each node against all other nodes! If you want to test from you personal computer though, be sure to install <span class="code">ipmitool</span> before hand. Note that the example below if for [[RPM]] based distributions. Please check your distribution for the availability of <span class="code">ipmitool</span> if you can't use <span class="code">yum</span>.
=== Preparing The Bonding Driver ===


<source lang="bash">
Before we modify the network, we will need to create the following file:
yum install ipmitool
</source>


The following commands only work against remote servers. You must use the example commands in the previous section when checking the local server.
You will need to create a file called


The least invasive test is to simply check remote machine's <span class="code">chassis power status</span>. Until this check works, there is no sense trying to actually reboot the remote servers.
With the switches unstacked and STP disabled, we can now configure your bonding interface.
 
Let's check <span class="code">an-node01</span> from <span class="code">an-node02</span>. Note that here we use the IP address directly, but in practice I like to use a name that resolves to the IP address of the '''IPMI''' interface (denoted by a <span class="code">.ipmi</span> suffix after the normal short hostname).


<source lang="bash">
<source lang="bash">
ipmitool -I lan -H 192.168.3.51 -U admin -P secret chassis power status
vim /etc/modprobe.d/bonding.conf
</source>
</source>
<source lang="text">
<source lang="text">
Chassis Power is on
alias bond0 bonding
alias bond1 bonding
alias bond2 bonding
</source>
</source>


Once this works, you can test a reboot, power off or power on event by replacing <span class="code">status</span> with <span class="code">cycle</span>, <span class="code">off</span> and <span class="code">on</span>, respectively. This is, in fact, what the <span class="code">fence_ipmilan</span> fence agent does!
If you only have four interfaces and plan to merge the [[SN]] and [[BCN]] networks, you can omit the <span class="code">bond2</span> entry.
 
You can then copy and paste the <span class="code">alias ...</span> entries from the file above into the terminal to avoid the need to reboot.
 
=== Deciding Which NICs to Bond ===


Lastly, make sure that the <span class="code">ipmi</span> daemon starts with the server.
If all of the interfaces in your server are identical, you can probably skip this step. Before you jump though, consider that not all of the [[PCIe]] interfaces may have all of their lanes connected, resulting in differing speeds. If you are unsure, I strongly recommend you run these tests.
 
TODO: Upload <span class="code">network_profiler.pl</span> here and explain it's use.
* Before we do that though, let's look at how we will verify the current link speed using <span class="code">[http://rpm.pbone.net/index.php3?stat=3&search=iperf&Search.x=33&Search.y=6&simple=2&dist%5B%5D=74&dist%5B%5D=0&dl=40&sr=1&field%5B%5D=1&field%5B%5D=2&srodzaj=1 ipperf]</span> (local copy of [https://alteeve.ca/files/iperf-2.0.5-1.el6.x86_64.rpm iperf-2.0.5-1.el6.x86_64.rpm]).


<source lang="bash">
chkconfig ipmi on
</source>


There are a few more option than what I mentioned here, which you can read about in <span class="code">man ipmitool</span>.
Once you've determined the various capabilities on your interfaces, pair them off with their closest-performing partners.


=== Configure And Test Node Assassin ===
Keep in mind:


The [[Node Assassin]] fence agent was added to the RHCS' <span class="code">fence-agents</span> in version <span class="code">3.0.16</span>. RHEL 6.0 ships with version <span class="code">3.0.12</span> though, so we'll need to install it manually, first.
* Any interface piggy-backing on an IPMI interface *must* be part of the [[BCN]] bond!
* The fasted interfaces should be paired for the [[SN]] bond.
* The lowest latency interfaces should be used the [[BCN]] bond.
* The lowest remaining two interfaces should be used in the [[IFN]] bond.


<source lang="bash">
=== Creating the Bonds ===
cd ~
wget -c http://nodeassassin.org/files/node_assassin/node_assassin-1.1.6.tar.gz
tar -xvzf node_assassin-1.1.6.tar.gz
cd node_assassin-1.1.6
./install
</source>
<source lang="text">
</source>


If you want to remove the Node Assassin fence agent, simply run <span class="code">./uninstall</span> from the same directory. Do note that it will delete the configuration file, too.
{{warning|1=This step will almost certainly leave you without a network access to your servers. It is *strongly* advised that you do the next steps when you have physical access to your servers. If that is simply not possible, then proceed with extreme caution.}}


Now that the fence agent is installed, you will need to configure it. The configuration file is pretty well documented, so we will just look at the specific lines that need to be edited.
In my case, I found that bonding the following optimal configuration:


<source lang="bash">
* <span class="code">eth0</span> and <span class="code">eth3</span> bonded as <span class="code">bond0</span>.
vim /etc/cluster/fence_na.conf
* <span class="code">eth1</span> and <span class="code">eth2</span> bonded as <span class="code">bond1</span>.
</source>
<source lang="text">
# This is the authentication information... It is currently a simple plain text
# compare, but this will change prior to first release.
system::username        =      admin
system::password        =      sauce
</source>
<source lang="text">
# The nodes name. This must match exactly with the name set in the given node.
na::1::na_name          =      fence_na01
</source>
<source lang="text">
# These are aliases for each Node Assassin port. They should match the name or
# URI of the node connected to the given port. This is optional but will make
# the fenced 'list' argument more accurate and sane. If a port is listed here,
# then the 'list' action will return '<node_id>,<value>'. If a port is not
# defined, 'list' will return '<node_id>,<node::X::name-node_id>'. If a port is
# set to 'unused', it will be skipped when replying to a 'list'.
na::1::alias::1        =      an_node01.alteeve.com
na::1::alias::2        =      an_node02.alteeve.com
na::1::alias::3        =      unused
na::1::alias::4        =      unused
</source>


== The First cluster.conf File ==
I did not have enough interfaces for three bonds, so I will configure the following:


Before we begin, let's discuss briefly a few things about the cluster we will build.
* <span class="code">bond0</span> will be the [[IFN]] interface on the <span class="code">192.168.1.0/24</span> subnet.
* <span class="code">bond1</span> will be the merged [[BCN]] and [[SN]] interfaces on the <span class="code">192.168.3.0/24</span> subnet.


* The <span class="code">totem</span> communication will be over private and secure networks. This means that we can ignore encrypting our cluster communications which will improve performance and simplify our configuration.
TODO: Create/show the <span class="code">diff</span>s for the following <span class="code">ifcfg-ethX</span> files.
* To start, we will use the special <span class="code">two_node</span> option. We will remove this when we add in the quorum disk in stage 2.
* We will not implement redundant ring protocol just now, but will add it later.


With these decisions and the information gathered, here is what our first <span class="code">/etc/cluster/cluster.conf</span> file will look like.
* Create <span class="code">bond0</span> our of <span class="code">eth0</span> and <span class="code">eth3</span>:


<source lang="bash">
<source lang="bash">
touch /etc/cluster/cluster.conf
vim /etc/sysconfig/network-scripts/ifcfg-eth0
vim /etc/cluster/cluster.conf
</source>
</source>
<source lang="xml">
<source lang="bash">
<?xml version="1.0"?>
# Internet Facing Network - Link 1
<cluster name="an-cluster" config_version="1">
HWADDR="BC:AE:C5:44:8A:DE"
<cman two_node="1" expected_votes="1" />
DEVICE="eth0"
<totem secauth="off" rrp_mode="none" />
BOOTPROTO="none"
<clusternodes>
NM_CONTROLLED="no"
<clusternode name="an-node01.alteeve.com" nodeid="1">
ONBOOT="no"
<fence>
MASTER="bond0"
<method name="ipmi">
SLAVE="yes"
<device name="fence_ipmi01" action="reboot" />
</method>
<method name="node_assassin">
<device name="fence_na01" port="01" action="reboot" />
</method>
</fence>
</clusternode>
<clusternode name="an-node02.alteeve.com" nodeid="2">
<fence>
<method name="ipmi">
<device name="fence_ipmi02" action="reboot" />
</method>
<method name="node_assassin">
<device name="fence_na01" port="02" action="reboot" />
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="fence_ipmi01" agent="fence_ipmilan" ipaddr="192.168.3.51" login="admin" passwd="secret" />
<fencedevice name="fence_ipmi02" agent="fence_ipmilan" ipaddr="192.168.3.52" login="admin" passwd="secret" />
<fencedevice name="fence_na01"  agent="fence_na"      ipaddr="192.168.3.61" login="admin" passwd="sauce" quiet="true" />
</fencedevices>
</cluster>
</source>
</source>


Save the file, then validate it using the <span class="code">xmllint</span> program. If it validates, the contents will be printed followed by a success message. If it fails, address the errors and try again.
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-eth3
</source>
<source lang="bash">
# Storage and Back Channel Networks - Link 2
HWADDR="00:0E:0C:59:46:E4"
DEVICE="eth3"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="no"
MASTER="bond0"
SLAVE="yes"
</source>


<source lang="bash">
<source lang="bash">
xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf
vim /etc/sysconfig/network-scripts/ifcfg-bond0
</source>
</source>
<source lang="text">
<source lang="bash">
<?xml version="1.0"?>
# Internet Facing Network - Bonded Interface
<cluster name="an-cluster" config_version="1">
DEVICE="bond0"
<cman two_node="1" expected_votes="1"/>
BOOTPROTO="static"
<totem secauth="off" rrp_mode="none"/>
NM_CONTROLLED="no"
<clusternodes>
ONBOOT="yes"
<clusternode name="an-node01.alteeve.com" nodeid="1">
IPADDR="192.168.1.73"
<fence>
NETMASK="255.255.255.0"
<method name="ipmi">
GATEWAY="192.168.1.254"
<device name="fence_ipmi01" action="reboot"/>
DNS1="192.139.81.117"
</method>
DNS2="192.139.81.1"
<method name="node_assassin">
# Clustering *only* supports mode=1 (active-passive)
<device name="fence_na01" port="01" action="reboot"/>
BONDING_OPTS="mode=1 miimon=100 use_carrier=0 updelay=0 downdelay=0"
</method>
</fence>
</clusternode>
<clusternode name="an-node02.alteeve.com" nodeid="2">
<fence>
<method name="ipmi">
<device name="fence_ipmi02" action="reboot"/>
</method>
<method name="node_assassin">
<device name="fence_na01" port="02" action="reboot"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="fence_ipmi01" agent="fence_ipmilan" ipaddr="192.168.3.51" login="admin" passwd="secret"/>
<fencedevice name="fence_ipmi02" agent="fence_ipmilan" ipaddr="192.168.3.52" login="admin" passwd="secret"/>
<fencedevice name="fence_na01" agent="fence_na" ipaddr="192.168.3.61" login="admin" passwd="sauce" quiet="true"/>
</fencedevices>
</cluster>
/etc/cluster/cluster.conf validates
</source>
</source>


'''''DO NOT PROCEED UNTIL YOUR cluster.conf FILE VALIDATES!'''''
== GFS2 ==


Unless you have it perfect, your cluster will fail.
Try adding <span class="code">noatime</span> to the <span class="code">/etc/fstab</span> options. h/t to Dak1n1; "it avoids cluster reads from turning into unnecessary writes, and can improve performance".


<span class="code"></span>
<span class="code"></span>

Latest revision as of 21:43, 20 June 2016

 AN!Wiki :: How To :: 2x5 Scalable Cluster Tutorial


Warning: This document is old, abandoned and very out of date. DON'T USE ANYTHING HERE! Consider it only as historical note taking.

The Design

Storage

Storage, high-level:

[ Storage Cluster ]                                                       
    _____________________________             _____________________________ 
   | [ an-node01 ]               |           | [ an-node02 ]               |
   |  _____    _____             |           |             _____    _____  |
   | ( HDD )  ( SSD )            |           |            ( SSD )  ( HDD ) |
   | (_____)  (_____)  __________|           |__________  (_____)  (_____) |
   |    |        |    | Storage  =--\     /--=  Storage |    |        |    |
   |    |        \----| Network ||  |     |  || Network |----/        |    |
   |    \-------------|_________||  |     |  ||_________|-------------/    |
   |_____________________________|  |     |  |_____________________________|
                                  __|_____|__                               
                                 |  HDD LUN  |                              
                                 |  SDD LUN  |                              
                                 |___________|                              
                                       |                                    
                                  _____|_____                             
                                 | Floating  |                            
                                 |   SAN IP  |                                
[ VM Cluster ]                   |___________|                                
  ______________________________   | | | | |   ______________________________ 
 | [ an-node03 ]                |  | | | | |  |                [ an-node06 ] |
 |  _________                   |  | | | | |  |                   _________  |
 | | [ vmA ] |                  |  | | | | |  |                  | [ vmJ ] | |
 | |  _____  |                  |  | | | | |  |                  |  _____  | |
 | | (_hdd_)-=----\             |  | | | | |  |             /----=-(_hdd_) | |
 | |_________|    |             |  | | | | |  |             |    |_________| |
 |  _________     |             |  | | | | |  |             |     _________  |
 | | [ vmB ] |    |             |  | | | | |  |             |    | [ vmK ] | |
 | |  _____  |    |             |  | | | | |  |             |    |  _____  | |
 | | (_hdd_)-=--\ |   __________|  | | | | |  |__________   | /--=-(_hdd_) | |
 | |_________|  | \--| Storage  =--/ | | | \--=  Storage |--/ |  |_________| |
 |  _________   \----| Network ||    | | |    || Network |----/   _________  |
 | | [ vmC ] |  /----|_________||    | | |    ||_________|----\  | [ vmL ] | |
 | |  _____  |  |               |    | | |    |               |  |  _____  | |
 | | (_hdd_)-=--/               |    | | |    |               \--=-(_hdd_) | |
 | |_________|                  |    | | |    |                  |_________| |
 |______________________________|    | | |    |______________________________|            
  ______________________________     | | |     ______________________________ 
 | [ an-node04 ]                |    | | |    |                [ an-node07 ] |
 |  _________                   |    | | |    |                   _________  |
 | | [ vmD ] |                  |    | | |    |                  | [ vmM ] | |
 | |  _____  |                  |    | | |    |                  |  _____  | |
 | | (_hdd_)-=----\             |    | | |    |             /----=-(_hdd_) | |
 | |_________|    |             |    | | |    |             |    |_________| |
 |  _________     |             |    | | |    |             |     _________  |
 | | [ vmE ] |    |             |    | | |    |             |    | [ vmN ] | |
 | |  _____  |    |             |    | | |    |             |    |  _____  | |
 | | (_hdd_)-=--\ |   __________|    | | |    |__________   | /--=-(_hdd_) | |
 | |_________|  | \--| Storage  =----/ | \----=  Storage |--/ |  |_________| |
 |  _________   \----| Network ||      |      || Network |----/   _________  |
 | | [ vmF ] |  /----|_________||      |      ||_________|----\  | [ vmO ] | |
 | |  _____  |  |               |      |      |               |  |  _____  | |
 | | (_hdd_)-=--+               |      |      |               \--=-(_hdd_) | |
 | | (_ssd_)-=--/               |      |      |                  |_________| |
 | |_________|                  |      |      |                              |
 |______________________________|      |      |______________________________|            
  ______________________________       |                                      
 | [ an-node05 ]                |      |                                      
 |  _________                   |      |                                      
 | | [ vmG ] |                  |      |                                      
 | |  _____  |                  |      |                                      
 | | (_hdd_)-=----\             |      |                                      
 | |_________|    |             |      |                                      
 |  _________     |             |      |                                      
 | | [ vmH ] |    |             |      |                                      
 | |  _____  |    |             |      |                                      
 | | (_hdd_)-=--\ |             |      |                                      
 | | (_sdd_)-=--+ |   __________|      |                                      
 | |_________|  | \--| Storage  =------/                                      
 |  _________   \----| Network ||                                             
 | | [ vmI ] |  /----|_________||                                             
 | |  _____  |  |               |                                             
 | | (_hdd_)-=--/               |                                             
 | |_________|                  |                                             
 |______________________________|

Long View

Note: Yes, this is a big graphic, but this is also a big project. I am no artist though, and any help making this clearer is greatly appreciated!
The planned network. This shows separate IPMI and full redundancy through-out the cluster. This is the way a production cluster should be built, but is not expected for dev/test clusters.

Failure Mapping

VM Cluster; Guest VM failure migration planning;

  • Each node can host 5 VMs @ 2GB/VM.
  • This is an N-1 cluster with five nodes; 20 VMs total.
          |    All    | an-node03 | an-node04 | an-node05 | an-node06 | an-node07 |
          | on-line   |   down    |   down    |   down    |   down    |   down    |
----------+-----------+-----------+-----------+-----------+-----------+-----------+
an-node03 |   vm01    |    --     |   vm01    |   vm01    |   vm01    |   vm01    |
          |   vm02    |    --     |   vm02    |   vm02    |   vm02    |   vm02    |
          |   vm03    |    --     |   vm03    |   vm03    |   vm03    |   vm03    |
          |   vm04    |    --     |   vm04    |   vm04    |   vm04    |   vm04    |
          |    --     |    --     |   vm05    |   vm09    |   vm13    |   vm17    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node04 |   vm05    |   vm05    |    --     |   vm05    |   vm05    |   vm05    |
          |   vm06    |   vm06    |    --     |   vm06    |   vm06    |   vm06    |
          |   vm07    |   vm07    |    --     |   vm07    |   vm07    |   vm07    |
          |   vm08    |   vm08    |    --     |   vm08    |   vm08    |   vm08    |
          |    --     |   vm01    |    --     |   vm10    |   vm14    |   vm18    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node05 |   vm09    |   vm09    |   vm09    |    --     |   vm09    |   vm09    |
          |   vm10    |   vm10    |   vm10    |    --     |   vm10    |   vm10    |
          |   vm11    |   vm11    |   vm11    |    --     |   vm11    |   vm11    |
          |   vm12    |   vm12    |   vm12    |    --     |   vm12    |   vm12    |
          |    --     |   vm02    |   vm06    |    --     |   vm15    |   vm19    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node06 |   vm13    |   vm13    |   vm13    |   vm13    |    --     |   vm13    |
          |   vm14    |   vm14    |   vm14    |   vm14    |    --     |   vm14    |
          |   vm15    |   vm15    |   vm15    |   vm15    |    --     |   vm15    |
          |   vm16    |   vm16    |   vm16    |   vm16    |    --     |   vm16    |
          |    --     |   vm03    |   vm07    |   vm11    |    --     |   vm20    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node07 |   vm17    |   vm17    |   vm17    |   vm17    |   vm17    |    --     |
          |   vm18    |   vm18    |   vm18    |   vm18    |   vm18    |    --     |
          |   vm19    |   vm19    |   vm19    |   vm19    |   vm19    |    --     |
          |   vm20    |   vm20    |   vm20    |   vm20    |   vm20    |    --     |
          |    --     |   vm04    |   vm08    |   vm12    |   vm16    |    --     |
----------+-----------+-----------+-----------+-----------+-----------+-----------+

Cluster Overview

Note: This is not programatically accurate!

This is meant to show, at a logical level, how the parts of a cluster work together. It is the first draft and is likely defective in terrible ways.

[ Resource Managment ]                                                                                                           
  ___________     ___________                                                                                                
 |           |   |           |                                                                                               
 | Service A |   | Service B |                                                                                               
 |___________|   |___________|                                                                                               
            |     |         |                                                                                                    
          __|_____|__    ___|_______________                                                                                     
         |           |  |                   |                                                                                    
         | RGManager |  | Clustered Storage |================================================.                                   
         |___________|  |___________________|                                                |                                   
               |                  |                                                          |                                   
               |__________________|______________                                            |                                 
                            |                    \                                           |                                 
         _________      ____|____                 |                                          |                                 
        |         |    |         |                |                                          |                                 
 /------| Fencing |----| Locking |                |                                          |                                 
 |      |_________|    |_________|                |                                          |                                 
_|___________|_____________|______________________|__________________________________________|_____
 |           |             |                      |                                          |                                  
 |     ______|_____    ____|___                   |                                          |                                  
 |    |            |  |        |                  |                                          |                                  
 |    | Membership |  | Quorum |                  |                                          |                                  
 |    |____________|  |________|                  |                                          |                                  
 |           |____________|                       |                                          |                                  
 |                      __|__                     |                                          |                                  
 |                     /     \                    |                                          |                                  
 |                    { Totem }                   |                                          |                                  
 |                     \_____/                    |                                          |                                  
 |      __________________|_______________________|_______________ ______________            |                                    
 |     |-----------|-----------|----------------|-----------------|--------------|           |                                    
 |  ___|____    ___|____    ___|____         ___|____        _____|_____    _____|_____    __|___                                 
 | |        |  |        |  |        |       |        |      |           |  |           |  |      |                                
 | | Node 1 |  | Node 2 |  | Node 3 |  ...  | Node N |      | Storage 1 |==| Storage 2 |==| DRBD |                                
 | |________|  |________|  |________|       |________|      |___________|  |___________|  |______|                                
 \_____|___________|___________|________________|_________________|______________|                                                
                                                                                                                                 
[ Cluster Communication ]

Network IPs

SAN: 10.10.1.1

Node:
          | IFN         | SN         | BCN       | IPMI      |
----------+-------------+------------+-----------+-----------+
an-node01 | 10.255.0.1  | 10.10.0.1  | 10.20.0.1 | 10.20.1.1 |                                                 
an-node02 | 10.255.0.2  | 10.10.0.2  | 10.20.0.2 | 10.20.1.2 |                                                 
an-node03 | 10.255.0.3  | 10.10.0.3  | 10.20.0.3 | 10.20.1.3 |                                                 
an-node04 | 10.255.0.4  | 10.10.0.4  | 10.20.0.4 | 10.20.1.4 |                                                 
an-node05 | 10.255.0.5  | 10.10.0.5  | 10.20.0.5 | 10.20.1.5 |                                                 
an-node06 | 10.255.0.6  | 10.10.0.6  | 10.20.0.6 | 10.20.1.6 |                                                 
an-node07 | 10.255.0.7  | 10.10.0.7  | 10.20.0.7 | 10.20.1.7 |                                                 
----------+-------------+------------+-----------+-----------+

Aux Equipment:
          | BCN         |
----------+-------------+
pdu1      | 10.20.2.1   |                                                                                                  
pdu2      | 10.20.2.2   |                                                                                                  
switch1   | 10.20.2.3   |                                                                                                  
switch2   | 10.20.2.4   |                                                                                                  
ups1      | 10.20.2.5   |                                                                                         
ups2      | 10.20.2.6   |                                                                                         
----------+-------------+
                                                                                                                  
VMs:                                                                                                              
          | VMN         |                                                                                         
----------+-------------+
vm01      | 10.254.0.1  |                                                                                         
vm02      | 10.254.0.2  |                                                                                         
vm03      | 10.254.0.3  |                                                                                         
vm04      | 10.254.0.4  |                                                                                         
vm05      | 10.254.0.5  |                                                                                         
vm06      | 10.254.0.6  |                                                                                         
vm07      | 10.254.0.7  |                                                                                         
vm08      | 10.254.0.8  |                                                                                         
vm09      | 10.254.0.9  |                                                                                         
vm10      | 10.254.0.10 |                                                                                         
vm11      | 10.254.0.11 |                                                                                         
vm12      | 10.254.0.12 |                                                                                         
vm13      | 10.254.0.13 |                                                                                         
vm14      | 10.254.0.14 |                                                                                         
vm15      | 10.254.0.15 |                                                                                         
vm16      | 10.254.0.16 |                                                                                         
vm17      | 10.254.0.17 |                                                                                         
vm18      | 10.254.0.18 |                                                                                         
vm19      | 10.254.0.19 |                                                                                         
vm20      | 10.254.0.20 |                                                                                         
----------+-------------+

Install The Cluster Software

If you are using Red Hat Enterprise Linux, you will need to add the RHEL Server Optional (v. 6 64-bit x86_64) channel for each node in your cluster. You can do this in RHN by going the your subscription management page, clicking on each server, clicking on "Alter Channel Subscriptions", click to enable the RHEL Server Optional (v. 6 64-bit x86_64) channel and then by clicking on "Change Subscription".

This actual installation is simple, just use yum to install cman.

yum install cman fence-agents rgmanager resource-agents lvm2-cluster gfs2-utils python-virtinst libvirt qemu-kvm-tools qemu-kvm virt-manager virt-viewer virtio-win

Initial Config

Everything uses ricci, which itself needs to have a password set. I set this to match root.

Both:

passwd ricci
New password: 
Retype new password: 
passwd: all authentication tokens updated successfully.

With these decisions and the information gathered, here is what our first /etc/cluster/cluster.conf file will look like.

touch /etc/cluster/cluster.conf
vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="1" name="an-cluster">
	<cman expected_votes="1" two_node="1" />
	<clusternodes>
		<clusternode name="an-node01.alteeve.ca" nodeid="1">
			<fence>
				<method name="ipmi">
					<device action="reboot" name="ipmi_an01" />
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="1" />
					<device action="reboot" name="pdu2" port="1" />
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node02.alteeve.ca" nodeid="2">
			<fence>
				<method name="ipmi">
					<device action="reboot" name="ipmi_an02" />
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="2" />
					<device action="reboot" name="pdu2" port="2" />
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node03.alteeve.ca" nodeid="3">
			<fence>
				<method name="ipmi">
					<device action="reboot" name="ipmi_an03" />
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="3" />
					<device action="reboot" name="pdu2" port="3" />
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node04.alteeve.ca" nodeid="4">
			<fence>
				<method name="ipmi">
					<device action="reboot" name="ipmi_an04" />
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="4" />
					<device action="reboot" name="pdu2" port="4" />
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node05.alteeve.ca" nodeid="5">
			<fence>
				<method name="ipmi">
					<device action="reboot" name="ipmi_an05" />
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="5" />
					<device action="reboot" name="pdu2" port="5" />
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node06.alteeve.ca" nodeid="6">
			<fence>
				<method name="ipmi">
					<device action="reboot" name="ipmi_an06" />
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="6" />
					<device action="reboot" name="pdu2" port="6" />
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node07.alteeve.ca" nodeid="7">
			<fence>
				<method name="ipmi">
					<device action="reboot" name="ipmi_an07" />
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="7" />
					<device action="reboot" name="pdu2" port="7" />
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice agent="fence_ipmilan" ipaddr="an-node01.ipmi" login="root" name="ipmi_an01" passwd="secret" />
		<fencedevice agent="fence_ipmilan" ipaddr="an-node02.ipmi" login="root" name="ipmi_an02" passwd="secret" />
		<fencedevice agent="fence_ipmilan" ipaddr="an-node03.ipmi" login="root" name="ipmi_an03" passwd="secret" />
		<fencedevice agent="fence_ipmilan" ipaddr="an-node04.ipmi" login="root" name="ipmi_an04" passwd="secret" />
		<fencedevice agent="fence_ipmilan" ipaddr="an-node05.ipmi" login="root" name="ipmi_an05" passwd="secret" />
		<fencedevice agent="fence_ipmilan" ipaddr="an-node06.ipmi" login="root" name="ipmi_an06" passwd="secret" />
		<fencedevice agent="fence_ipmilan" ipaddr="an-node07.ipmi" login="root" name="ipmi_an07" passwd="secret" />
		<fencedevice agent="fence_apc_snmp" ipaddr="pdu1.alteeve.ca" name="pdu1" />
		<fencedevice agent="fence_apc_snmp" ipaddr="pdu2.alteeve.ca" name="pdu2" />
	</fencedevices>
	<fence_daemon post_join_delay="30" />
	<totem rrp_mode="none" secauth="off" />
	<rm>
		<resources>
			<ip address="10.10.1.1" monitor_link="on" />
			<script file="/etc/init.d/tgtd" name="tgtd" />
			<script file="/etc/init.d/drbd" name="drbd" />
			<script file="/etc/init.d/clvmd" name="clvmd" />
			<script file="/etc/init.d/gfs2" name="gfs2" />
			<script file="/etc/init.d/libvirtd" name="libvirtd" />
		</resources>
		<failoverdomains>
			<!-- Used for storage -->
			<!-- SAN Nodes -->
			<failoverdomain name="an1_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node01.alteeve.ca" />
			</failoverdomain>
			<failoverdomain name="an2_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node02.alteeve.ca" />
			</failoverdomain>
			
			<!-- VM Nodes -->
			<failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca" />
			</failoverdomain>
			<failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.ca" />
			</failoverdomain>
			<failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.ca" />
			</failoverdomain>
			<failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node06.alteeve.ca" />
			</failoverdomain>
			<failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node07.alteeve.ca" />
			</failoverdomain>
			
			<!-- Domain for the SAN -->
			<failoverdomain name="an1_primary" nofailback="1" ordered="1" restricted="0">
				<failoverdomainnode name="an-node01.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node02.alteeve.ca" priority="2" />
			</failoverdomain>
			
			<!-- Domains for VMs running primarily on an-node03 -->
			<failoverdomain name="an3_an4" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an3_an5" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an3_an6" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an3_an7" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
			</failoverdomain>
			
			<!-- Domains for VMs running primarily on an-node04 -->
			<failoverdomain name="an4_an3" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an4_an5" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an4_an6" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an4_an7" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
			</failoverdomain>
			
			<!-- Domains for VMs running primarily on an-node05 -->
			<failoverdomain name="an5_an3" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an5_an4" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an5_an6" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an5_an7" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
			</failoverdomain>
			
			<!-- Domains for VMs running primarily on an-node06 -->
			<failoverdomain name="an6_an3" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an6_an4" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an6_an5" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an6_an7" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
			</failoverdomain>
			
			<!-- Domains for VMs running primarily on an-node07 -->
			<failoverdomain name="an7_an3" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an7_an4" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an7_an5" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an7_an6" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
			</failoverdomain>
		</failoverdomains>
		
		<!-- SAN Services -->
		<service autostart="1" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="tgtd" />
				</script>
			</script>
		</service>
		<service autostart="1" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="tgtd" />
				</script>
			</script>
		</service>
		<service autostart="1" domain="an1_primary" name="san_ip" recovery="relocate">
			<ip ref="10.10.1.1" />
		</service>
		
		<!-- VM Storage services. -->
		<service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2">
						<script ref="libvirtd" />
					</script>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2">
						<script ref="libvirtd" />
					</script>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2">
						<script ref="libvirtd" />
					</script>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2">
						<script ref="libvirtd" />
					</script>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2">
						<script ref="libvirtd" />
					</script>
				</script>
			</script>
		</service>
		
		<!-- VM Services -->
		<!-- VMs running primarily on an-node03 -->
		<vm name="vm01" domain="an03_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm02" domain="an03_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm03" domain="an03_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm04" domain="an03_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		
		<!-- VMs running primarily on an-node04 -->
		<vm name="vm05" domain="an04_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm06" domain="an04_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm07" domain="an04_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm08" domain="an04_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		
		<!-- VMs running primarily on an-node05 -->
		<vm name="vm09" domain="an05_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm10" domain="an05_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm11" domain="an05_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm12" domain="an05_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		
		<!-- VMs running primarily on an-node06 -->
		<vm name="vm13" domain="an06_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm14" domain="an06_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm15" domain="an06_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm16" domain="an06_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		
		<!-- VMs running primarily on an-node07 -->
		<vm name="vm17" domain="an07_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm18" domain="an07_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm19" domain="an07_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm20" domain="an07_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
	</rm>
</cluster>

Save the file, then validate it. If it fails, address the errors and try again.

ip addr list | grep <ip>
rg_test test /etc/cluster/cluster.conf
ccs_config_validate
Configuration validates

Push it to the other node:

rsync -av /etc/cluster/cluster.conf root@an-node02:/etc/cluster/
sending incremental file list
cluster.conf

sent 781 bytes  received 31 bytes  541.33 bytes/sec
total size is 701  speedup is 0.86

Start:


DO NOT PROCEED UNTIL YOUR cluster.conf FILE VALIDATES!

Unless you have it perfect, your cluster will fail.

Once it validates, proceed.

Starting The Cluster For The First Time

By default, if you start one node only and you've enabled the <cman two_node="1" expected_votes="1"/> option as we have done, the lone server will effectively gain quorum. It will try to connect to the cluster, but there won't be a cluster to connect to, so it will fence the other node after a timeout period. This timeout is 6 seconds by default.

For now, we will leave the default as it is. If you're interested in changing it though, the argument you are looking for is post_join_delay.

This behaviour means that we'll want to start both nodes well within six seconds of one another, least the slower one get needlessly fenced.

Left off here

Note to help minimize dual-fences:

  • you could add FENCED_OPTS="-f 5" to /etc/sysconfig/cman on *one* node (ilo fence devices may need this)

DRBD Config

Install from source:

Both:

# Obliterate peer - fence via cman
wget -c https://alteeve.ca/files/an-cluster/sbin/obliterate-peer.sh -O /sbin/obliterate-peer.sh
chmod a+x /sbin/obliterate-peer.sh
ls -lah /sbin/obliterate-peer.sh

# Download, compile and install DRBD
wget -c http://oss.linbit.com/drbd/8.3/drbd-8.3.11.tar.gz
tar -xvzf drbd-8.3.11.tar.gz
cd drbd-8.3.11
./configure \
   --prefix=/usr \
   --localstatedir=/var \
   --sysconfdir=/etc \
   --with-utils \
   --with-km \
   --with-udev \
   --with-pacemaker \
   --with-rgmanager \
   --with-bashcompletion
make
make install
chkconfig --add drbd
chkconfig drbd off

Configure

an-node01:

# Configure DRBD's global value.
cp /etc/drbd.d/global_common.conf /etc/drbd.d/global_common.conf.orig
vim /etc/drbd.d/global_common.conf
diff -u /etc/drbd.d/global_common.conf
--- /etc/drbd.d/global_common.conf.orig	2011-08-01 21:58:46.000000000 -0400
+++ /etc/drbd.d/global_common.conf	2011-08-01 23:18:27.000000000 -0400
@@ -15,24 +15,35 @@
 		# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
 		# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
 		# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
+		fence-peer		"/sbin/obliterate-peer.sh";
 	}
 
 	startup {
 		# wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
+		become-primary-on	both;
+		wfc-timeout		300;
+		degr-wfc-timeout	120;
 	}
 
 	disk {
 		# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
 		# no-disk-drain no-md-flushes max-bio-bvecs
+		fencing			resource-and-stonith;
 	}
 
 	net {
 		# sndbuf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
 		# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
 		# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
+		allow-two-primaries;
+		after-sb-0pri		discard-zero-changes;
+		after-sb-1pri		discard-secondary;
+		after-sb-2pri		disconnect;
 	}
 
 	syncer {
 		# rate after al-extents use-rle cpu-mask verify-alg csums-alg
+		# This should be no more than 30% of the maximum sustainable write speed.
+		rate			20M;
 	}
 }
vim /etc/drbd.d/r0.res
resource r0 {
        device          /dev/drbd0;
        meta-disk       internal;
        on an-node01.alteeve.ca {
                address         192.168.2.71:7789;
                disk            /dev/sda5;
        }
        on an-node02.alteeve.ca {
                address         192.168.2.72:7789;
                disk            /dev/sda5;
        }
}
cp /etc/drbd.d/r0.res /etc/drbd.d/r1.res 
vim /etc/drbd.d/r1.res
resource r1 {
        device          /dev/drbd1;
        meta-disk       internal;
        on an-node01.alteeve.ca {
                address         192.168.2.71:7790;
                disk            /dev/sdb1;
        }
        on an-node02.alteeve.ca {
                address         192.168.2.72:7790;
                disk            /dev/sdb1;
        }
}
Note: If you have multiple DRBD resources on on (set of) backing disks, consider adding syncer { after <minor-1>; }. For example, tell /dev/drbd1 to wait for /dev/drbd0 by adding syncer { after 0; }. This will prevent simultaneous resync's which could seriously impact performance. Resources will wait in state until the defined resource has completed sync'ing.

Validate:

drbdadm dump
  --==  Thank you for participating in the global usage survey  ==--
The server's response is:

you are the 369th user to install this version
# /usr/etc/drbd.conf
common {
    protocol               C;
    net {
        allow-two-primaries;
        after-sb-0pri    discard-zero-changes;
        after-sb-1pri    discard-secondary;
        after-sb-2pri    disconnect;
    }
    disk {
        fencing          resource-and-stonith;
    }
    syncer {
        rate             20M;
    }
    startup {
        wfc-timeout      300;
        degr-wfc-timeout 120;
        become-primary-on both;
    }
    handlers {
        pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        local-io-error   "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
        fence-peer       /sbin/obliterate-peer.sh;
    }
}

# resource r0 on an-node01.alteeve.ca: not ignored, not stacked
resource r0 {
    on an-node01.alteeve.ca {
        device           /dev/drbd0 minor 0;
        disk             /dev/sda5;
        address          ipv4 192.168.2.71:7789;
        meta-disk        internal;
    }
    on an-node02.alteeve.ca {
        device           /dev/drbd0 minor 0;
        disk             /dev/sda5;
        address          ipv4 192.168.2.72:7789;
        meta-disk        internal;
    }
}

# resource r1 on an-node01.alteeve.ca: not ignored, not stacked
resource r1 {
    on an-node01.alteeve.ca {
        device           /dev/drbd1 minor 1;
        disk             /dev/sdb1;
        address          ipv4 192.168.2.71:7790;
        meta-disk        internal;
    }
    on an-node02.alteeve.ca {
        device           /dev/drbd1 minor 1;
        disk             /dev/sdb1;
        address          ipv4 192.168.2.72:7790;
        meta-disk        internal;
    }
}
rsync -av /etc/drbd.d root@an-node02:/etc/
drbd.d/
drbd.d/global_common.conf
drbd.d/global_common.conf.orig
drbd.d/r0.res
drbd.d/r1.res

sent 3523 bytes  received 110 bytes  7266.00 bytes/sec
total size is 3926  speedup is 1.08

Initialize and First start

Both:

Create the meta-data.

modprobe
drbdadm create-md r{0,1}
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success

Attach, connect and confirm (after both have attached and connected):

drbdadm attach r{0,1}
drbdadm connect r{0,1}
cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:441969960
 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:29309628

There is no data, so force both devices to be instantly UpToDate:

drbdadm -- --clear-bitmap new-current-uuid r{0,1}
cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
 0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Set both to primary and run a final check.

drbdadm primary r{0,1}
cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:672 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:672 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Update the cluster

vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="17" name="an-clusterA">
        <cman expected_votes="1" two_node="1"/>
        <totem rrp_mode="none" secauth="off"/>
        <clusternodes>
                <clusternode name="an-node01.alteeve.ca" nodeid="1">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node02.alteeve.ca" nodeid="2">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
        </fencedevices>
        <fence_daemon post_join_delay="30"/>
        <rm>
                <resources>
                        <ip address="192.168.2.100" monitor_link="on"/>
                        <script file="/etc/init.d/drbd" name="drbd"/>
                        <script file="/etc/init.d/clvmd" name="clvmd"/>
                        <script file="/etc/init.d/tgtd" name="tgtd"/>
                </resources>
                <failoverdomains>
                        <failoverdomain name="an1_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node01.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an2_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node02.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an1_primary" nofailback="1" ordered="1" restricted="0">
                                <failoverdomainnode name="an-node01.alteeve.ca" priority="1"/>
                                <failoverdomainnode name="an-node02.alteeve.ca" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <service autostart="0" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd"/>
                        </script>
                </service>
                <service autostart="0" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd"/>
                        </script>
                </service>
        </rm>
</cluster>
rg_test test /etc/cluster/cluster.conf
Running in test mode.
Loading resource rule from /usr/share/cluster/oralistener.sh
Loading resource rule from /usr/share/cluster/apache.sh
Loading resource rule from /usr/share/cluster/SAPDatabase
Loading resource rule from /usr/share/cluster/postgres-8.sh
Loading resource rule from /usr/share/cluster/lvm.sh
Loading resource rule from /usr/share/cluster/mysql.sh
Loading resource rule from /usr/share/cluster/lvm_by_vg.sh
Loading resource rule from /usr/share/cluster/service.sh
Loading resource rule from /usr/share/cluster/samba.sh
Loading resource rule from /usr/share/cluster/SAPInstance
Loading resource rule from /usr/share/cluster/checkquorum
Loading resource rule from /usr/share/cluster/ocf-shellfuncs
Loading resource rule from /usr/share/cluster/svclib_nfslock
Loading resource rule from /usr/share/cluster/script.sh
Loading resource rule from /usr/share/cluster/clusterfs.sh
Loading resource rule from /usr/share/cluster/fs.sh
Loading resource rule from /usr/share/cluster/oracledb.sh
Loading resource rule from /usr/share/cluster/nfsserver.sh
Loading resource rule from /usr/share/cluster/netfs.sh
Loading resource rule from /usr/share/cluster/orainstance.sh
Loading resource rule from /usr/share/cluster/vm.sh
Loading resource rule from /usr/share/cluster/lvm_by_lv.sh
Loading resource rule from /usr/share/cluster/tomcat-6.sh
Loading resource rule from /usr/share/cluster/ip.sh
Loading resource rule from /usr/share/cluster/nfsexport.sh
Loading resource rule from /usr/share/cluster/openldap.sh
Loading resource rule from /usr/share/cluster/ASEHAagent.sh
Loading resource rule from /usr/share/cluster/nfsclient.sh
Loading resource rule from /usr/share/cluster/named.sh
Loaded 24 resource rules
=== Resources List ===
Resource type: ip
Instances: 1/1
Agent: ip.sh
Attributes:
  address = 192.168.2.100 [ primary unique ]
  monitor_link = on
  nfslock [ inherit("service%nfslock") ]

Resource type: script
Agent: script.sh
Attributes:
  name = drbd [ primary unique ]
  file = /etc/init.d/drbd [ unique required ]
  service_name [ inherit("service%name") ]

Resource type: script
Agent: script.sh
Attributes:
  name = clvmd [ primary unique ]
  file = /etc/init.d/clvmd [ unique required ]
  service_name [ inherit("service%name") ]

Resource type: script
Agent: script.sh
Attributes:
  name = tgtd [ primary unique ]
  file = /etc/init.d/tgtd [ unique required ]
  service_name [ inherit("service%name") ]

Resource type: service [INLINE]
Instances: 1/1
Agent: service.sh
Attributes:
  name = an1_storage [ primary unique required ]
  domain = an1_only [ reconfig ]
  autostart = 0 [ reconfig ]
  exclusive = 0 [ reconfig ]
  nfslock = 0
  nfs_client_cache = 0
  recovery = restart [ reconfig ]
  depend_mode = hard
  max_restarts = 0
  restart_expire_time = 0
  priority = 0

Resource type: service [INLINE]
Instances: 1/1
Agent: service.sh
Attributes:
  name = an2_storage [ primary unique required ]
  domain = an2_only [ reconfig ]
  autostart = 0 [ reconfig ]
  exclusive = 0 [ reconfig ]
  nfslock = 0
  nfs_client_cache = 0
  recovery = restart [ reconfig ]
  depend_mode = hard
  max_restarts = 0
  restart_expire_time = 0
  priority = 0

Resource type: service [INLINE]
Instances: 1/1
Agent: service.sh
Attributes:
  name = san_ip [ primary unique required ]
  domain = an1_primary [ reconfig ]
  autostart = 0 [ reconfig ]
  exclusive = 0 [ reconfig ]
  nfslock = 0
  nfs_client_cache = 0
  recovery = relocate [ reconfig ]
  depend_mode = hard
  max_restarts = 0
  restart_expire_time = 0
  priority = 0

=== Resource Tree ===
service (S0) {
  name = "an1_storage";
  domain = "an1_only";
  autostart = "0";
  exclusive = "0";
  nfslock = "0";
  nfs_client_cache = "0";
  recovery = "restart";
  depend_mode = "hard";
  max_restarts = "0";
  restart_expire_time = "0";
  priority = "0";
  script (S0) {
    name = "drbd";
    file = "/etc/init.d/drbd";
    service_name = "an1_storage";
    script (S0) {
      name = "clvmd";
      file = "/etc/init.d/clvmd";
      service_name = "an1_storage";
    }
  }
}
service (S0) {
  name = "an2_storage";
  domain = "an2_only";
  autostart = "0";
  exclusive = "0";
  nfslock = "0";
  nfs_client_cache = "0";
  recovery = "restart";
  depend_mode = "hard";
  max_restarts = "0";
  restart_expire_time = "0";
  priority = "0";
  script (S0) {
    name = "drbd";
    file = "/etc/init.d/drbd";
    service_name = "an2_storage";
    script (S0) {
      name = "clvmd";
      file = "/etc/init.d/clvmd";
      service_name = "an2_storage";
    }
  }
}
service (S0) {
  name = "san_ip";
  domain = "an1_primary";
  autostart = "0";
  exclusive = "0";
  nfslock = "0";
  nfs_client_cache = "0";
  recovery = "relocate";
  depend_mode = "hard";
  max_restarts = "0";
  restart_expire_time = "0";
  priority = "0";
  ip (S0) {
    address = "192.168.2.100";
    monitor_link = "on";
    nfslock = "0";
  }
}
=== Failover Domains ===
Failover domain: an1_only
Flags: Restricted No Failback
  Node an-node01.alteeve.ca (id 1, priority 0)
Failover domain: an2_only
Flags: Restricted No Failback
  Node an-node02.alteeve.ca (id 2, priority 0)
Failover domain: an1_primary
Flags: Ordered No Failback
  Node an-node01.alteeve.ca (id 1, priority 1)
  Node an-node02.alteeve.ca (id 2, priority 2)
=== Event Triggers ===
Event Priority Level 100:
  Name: Default
    (Any event)
    File: /usr/share/cluster/default_event_script.sl
[root@an-node01 ~]# cman_tool version -r
You have not authenticated to the ricci daemon on an-node01.alteeve.ca
Password: 
[root@an-node01 ~]# clusvcadm -e service:an1_storage
Local machine trying to enable service:an1_storage...Success
service:an1_storage is now running on an-node01.alteeve.ca
[root@an-node01 ~]# cat /proc/drbd 
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:924 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:916 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
cman_tool version -r
You have not authenticated to the ricci daemon on an-node01.alteeve.ca
Password:

an-node01:

clusvcadm -e service:an1_storage
service:an1_storage is now running on an-node01.alteeve.ca

an-node02:

clusvcadm -e service:an2_storage
service:an2_storage is now running on an-node02.alteeve.ca

Either

cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:924 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:916 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Configure Clustered LVM

an-node01:

cp /etc/lvm/lvm.conf /etc/lvm/lvm.conf.orig
vim /etc/lvm/lvm.conf
diff -u /etc/lvm/lvm.conf.orig /etc/lvm/lvm.conf
--- /etc/lvm/lvm.conf.orig	2011-08-02 21:59:01.000000000 -0400
+++ /etc/lvm/lvm.conf	2011-08-02 22:00:17.000000000 -0400
@@ -50,7 +50,8 @@
 
 
     # By default we accept every block device:
-    filter = [ "a/.*/" ]
+    #filter = [ "a/.*/" ]
+    filter = [ "a|/dev/drbd*|", "r/.*/" ]
 
     # Exclude the cdrom drive
     # filter = [ "r|/dev/cdrom|" ]
@@ -308,7 +309,8 @@
     # Type 3 uses built-in clustered locking.
     # Type 4 uses read-only locking which forbids any operations that might 
     # change metadata.
-    locking_type = 1
+    #locking_type = 1
+    locking_type = 3
 
     # Set to 0 to fail when a lock request cannot be satisfied immediately.
     wait_for_locks = 1
@@ -324,7 +326,8 @@
     # to 1 an attempt will be made to use local file-based locking (type 1).
     # If this succeeds, only commands against local volume groups will proceed.
     # Volume Groups marked as clustered will be ignored.
-    fallback_to_local_locking = 1
+    #fallback_to_local_locking = 1
+    fallback_to_local_locking = 0
 
     # Local non-LV directory that holds file-based locks while commands are
     # in progress.  A directory like /tmp that may get wiped on reboot is OK.
rsync -av /etc/lvm/lvm.conf root@an-node02:/etc/lvm/
sending incremental file list
lvm.conf

sent 2412 bytes  received 247 bytes  5318.00 bytes/sec
total size is 24668  speedup is 9.28

Create the LVM PVs, VGs and LVs.

an-node01:

pvcreate /dev/drbd{0,1}
  Physical volume "/dev/drbd0" successfully created
  Physical volume "/dev/drbd1" successfully created

an-node02:

pvscan
  PV /dev/drbd0                      lvm2 [421.50 GiB]
  PV /dev/drbd1                      lvm2 [27.95 GiB]
  Total: 2 [449.45 GiB] / in use: 0 [0   ] / in no VG: 2 [449.45 GiB]

an-node01:

vgcreate -c y hdd_vg0 /dev/drbd0 && vgcreate -c y sdd_vg0 /dev/drbd1
  Clustered volume group "hdd_vg0" successfully created
  Clustered volume group "ssd_vg0" successfully created

an-node02:

vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "ssd_vg0" using metadata type lvm2
  Found volume group "hdd_vg0" using metadata type lvm2

an-node01:

lvcreate -l 100%FREE -n lun0 /dev/hdd_vg0 && lvcreate -l 100%FREE -n lun1 /dev/ssd_vg0
  Logical volume "lun0" created
  Logical volume "lun1" created

an-node02:

lvscan
  ACTIVE            '/dev/ssd_vg0/lun1' [27.95 GiB] inherit
  ACTIVE            '/dev/hdd_vg0/lun0' [421.49 GiB] inherit

iSCSI notes

IET vs tgt pros and cons needed.

default iscsi port is 3260

initiator: This is the client. target: This is the server side. sid: Session ID; Found with iscsiadm -m session -P 1. SID and sysfs path are not persistent, partially start-order based. iQN: iSCSI Qualified Name; This is a string that uniquely identifies targets and initiators.

Both:

yum install iscsi-initiator-utils scsi-target-utils

an-node01:

cp /etc/tgt/targets.conf /etc/tgt/targets.conf.orig
vim /etc/tgt/targets.conf
diff -u /etc/tgt/targets.conf.orig /etc/tgt/targets.conf
--- /etc/tgt/targets.conf.orig	2011-07-31 12:38:35.000000000 -0400
+++ /etc/tgt/targets.conf	2011-08-02 22:19:06.000000000 -0400
@@ -251,3 +251,9 @@
 #        vendor_id VENDOR1
 #    </direct-store>
 #</target>
+
+<target iqn.2011-08.com.alteeve:an-clusterA.target01>
+	direct-store /dev/drbd0
+	direct-store /dev/drbd1
+	vendor_id Alteeve
rsync -av /etc/tgt/targets.conf root@an-node02:/etc/tgt/
sending incremental file list
targets.conf

sent 909 bytes  received 97 bytes  670.67 bytes/sec
total size is 7093  speedup is 7.05

Update the cluster

               <service autostart="0" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd">
                                        <script ref="tgtd"/>
                                </script>
                        </script>
                </service>
                <service autostart="0" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd">
                                        <script ref="tgtd"/>
                                </script>
                        </script>
                </service>

Connect to the SAN from a VM node

an-node03+:

iscsiadm -m discovery -t sendtargets -p 192.168.2.100
192.168.2.100:3260,1 iqn.2011-08.com.alteeve:an-clusterA.target01
iscsiadm --mode node --portal 192.168.2.100 --target iqn.2011-08.com.alteeve:an-clusterA.target01 --login
Logging in to [iface: default, target: iqn.2011-08.com.alteeve:an-clusterA.target01, portal: 192.168.2.100,3260]
Login to [iface: default, target: iqn.2011-08.com.alteeve:an-clusterA.target01, portal: 192.168.2.100,3260] successful.
fdisk -l
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00062f4a

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          33      262144   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              33        5255    41943040   83  Linux
/dev/sda3            5255        5777     4194304   82  Linux swap / Solaris

Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/sdb doesn't contain a valid partition table

Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/sdc doesn't contain a valid partition table

Setup the VM Cluster

Install RPMs.

yum -y install lvm2-cluster cman fence-agents

Configure lvm.conf.

cp /etc/lvm/lvm.conf /etc/lvm/lvm.conf.orig
vim /etc/lvm/lvm.conf
diff -u /etc/lvm/lvm.conf.orig /etc/lvm/lvm.conf
--- /etc/lvm/lvm.conf.orig	2011-08-02 21:59:01.000000000 -0400
+++ /etc/lvm/lvm.conf	2011-08-03 00:35:45.000000000 -0400
@@ -308,7 +308,8 @@
     # Type 3 uses built-in clustered locking.
     # Type 4 uses read-only locking which forbids any operations that might 
     # change metadata.
-    locking_type = 1
+    #locking_type = 1
+    locking_type = 3
 
     # Set to 0 to fail when a lock request cannot be satisfied immediately.
     wait_for_locks = 1
@@ -324,7 +325,8 @@
     # to 1 an attempt will be made to use local file-based locking (type 1).
     # If this succeeds, only commands against local volume groups will proceed.
     # Volume Groups marked as clustered will be ignored.
-    fallback_to_local_locking = 1
+    #fallback_to_local_locking = 1
+    fallback_to_local_locking = 0
 
     # Local non-LV directory that holds file-based locks while commands are
     # in progress.  A directory like /tmp that may get wiped on reboot is OK.
rsync -av /etc/lvm/lvm.conf root@an-node04:/etc/lvm/
sending incremental file list
lvm.conf

sent 873 bytes  received 247 bytes  2240.00 bytes/sec
total size is 24625  speedup is 21.99
rsync -av /etc/lvm/lvm.conf root@an-node05:/etc/lvm/
sending incremental file list
lvm.conf

sent 873 bytes  received 247 bytes  2240.00 bytes/sec
total size is 24625  speedup is 21.99

Config the cluster.

vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="5" name="an-clusterB">
        <totem rrp_mode="none" secauth="off"/>
        <clusternodes>
                <clusternode name="an-node03.alteeve.ca" nodeid="1">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="3"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node04.alteeve.ca" nodeid="2">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="4"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node05.alteeve.ca" nodeid="3">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="5"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
        </fencedevices>
        <fence_daemon post_join_delay="30"/>
        <rm>
                <resources>
                        <script file="/etc/init.d/iscsi" name="iscsi" />
                        <script file="/etc/init.d/clvmd" name="clvmd" />
                </resources>
                <failoverdomains>
                        <failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node03.alteeve.ca" />
                        </failoverdomain>
                        <failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node04.alteeve.ca" />
                        </failoverdomain>
                        <failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node05.alteeve.ca" />
                        </failoverdomain>
                </failoverdomains>
                <service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd"/>
                        </script>
                </service>
                <service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd"/>
                        </script>
                </service>
                <service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd"/>
                        </script>
                </service>
        </rm>   
</cluster>
ccs_config_validate
Configuration validates

Make sure iscsi and clvmd do not start on boot, stop both, then make sure they start and stop cleanly.

chkconfig clvmd off; chkconfig iscsi off; /etc/init.d/iscsi stop && /etc/init.d/clvmd stop
Stopping iscsi:                                            [  OK  ]
/etc/init.d/clvmd start && /etc/init.d/iscsi start && /etc/init.d/iscsi stop && /etc/init.d/clvmd stop
Starting clvmd: 
Activating VG(s):   No volume groups found
                                                           [  OK  ]
Starting iscsi:                                            [  OK  ]
Stopping iscsi:                                            [  OK  ]
Signaling clvmd to exit                                    [  OK  ]
clvmd terminated                                           [  OK  ]

Use the cluster to stop (in case it autostarted before now) and then start the services.

# Disable (stop)
clusvcadm -d service:an3_storage
clusvcadm -d service:an4_storage
clusvcadm -d service:an5_storage
# Enable (start)
clusvcadm -e service:an3_storage -m an-node03.alteeve.ca
clusvcadm -e service:an4_storage -m an-node04.alteeve.ca
clusvcadm -e service:an5_storage -m an-node05.alteeve.ca
# Check
clustat
Cluster Status for an-clusterB @ Wed Aug  3 00:25:10 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node03.alteeve.ca                        1 Online, Local, rgmanager
 an-node04.alteeve.ca                        2 Online, rgmanager
 an-node05.alteeve.ca                        3 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an3_storage            an-node03.alteeve.ca           started       
 service:an4_storage            an-node04.alteeve.ca           started       
 service:an5_storage            an-node05.alteeve.ca           started

Flush iSCSI's Cache

If you remove an iQN (or change the name of one), the /etc/init.d/iscsi script will return errors. To flush it and re-scan:

I am sure there is a more elegant way.

/etc/init.d/iscsi stop && rm -rf /var/lib/iscsi/nodes/* && iscsiadm -m discovery -t sendtargets -p 192.168.2.100

Setup the VM Cluster's Clustered LVM

Partition the SAN disks

an-node03:

fdisk -l
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00062f4a

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          33      262144   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              33        5255    41943040   83  Linux
/dev/sda3            5255        5777     4194304   82  Linux swap / Solaris

Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/sdc doesn't contain a valid partition table

Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/sdb doesn't contain a valid partition table

Create partitions.

fdisk /dev/sdb
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x403f1fb8.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
         switch off the mode (command 'c') and change display units to
         sectors (command 'u').

Command (m for help): c
DOS Compatibility flag is not set

Command (m for help): u
Changing display/entry units to sectors

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-55022, default 1): 1
Last cylinder, +cylinders or +size{K,M,G} (1-55022, default 55022): 
Using default value 55022

Command (m for help): p

Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x403f1fb8

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       55022   441964183+  83  Linux

Command (m for help): t
Selected partition 1
Hex code (type L to list codes): 8e
Changed system type of partition 1 to 8e (Linux LVM)

Command (m for help): p

Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x403f1fb8

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       55022   441964183+  8e  Linux LVM

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
fdisk /dev/sdc
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0xba7503eb.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
         switch off the mode (command 'c') and change display units to
         sectors (command 'u').

Command (m for help): c
DOS Compatibility flag is not set

Command (m for help): u
Changing display/entry units to sectors

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First sector (2048-58613759, default 2048): 
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-58613759, default 58613759): 
Using default value 58613759

Command (m for help): t
Selected partition 1
Hex code (type L to list codes): 8e
Changed system type of partition 1 to 8e (Linux LVM)

Command (m for help): p

Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders, total 58613760 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xba7503eb

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1            2048    58613759    29305856   8e  Linux LVM

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
fdisk -l
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00062f4a

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          33      262144   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              33        5255    41943040   83  Linux
/dev/sda3            5255        5777     4194304   82  Linux swap / Solaris

Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xba7503eb

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               2       28620    29305856   8e  Linux LVM

Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x403f1fb8

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       55022   441964183+  8e  Linux LVM

Setup LVM devices

Create PV.

an-node03:

pvcreate /dev/sd{b,c}1
  Physical volume "/dev/sdb1" successfully created
  Physical volume "/dev/sdc1" successfully created

an-node04 and an-node05:

pvscan
  PV /dev/sdb1                      lvm2 [421.49 GiB]
  PV /dev/sdc1                      lvm2 [27.95 GiB]
  Total: 2 [449.44 GiB] / in use: 0 [0   ] / in no VG: 2 [449.44 GiB]

Create the VGs.

an-node03:

vgcreate -c y san_vg01 /dev/sdb1
  Clustered volume group "san_vg01" successfully created
vgcreate -c y san_vg02 /dev/sdc1
  Clustered volume group "san_vg02" successfully created

an-node04 and an-node05:

vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "san_vg02" using metadata type lvm2
  Found volume group "san_vg01" using metadata type lvm2

Create the first VM's LVs.

an-node03:

lvcreate -L 10G -n shared01 /dev/san_vg01
  Logical volume "shared01" created
lvcreate -L 50G -n vm0001_hdd1 /dev/san_vg01
  Logical volume "vm0001_hdd1" created
lvcreate -L 10G -n vm0001_ssd1 /dev/san_vg02
  Logical volume "vm0001_ssd1" created

an-node04 and an-node05:

lvscan
  ACTIVE            '/dev/san_vg01/shared01' [10.00 GiB] inherit
  ACTIVE            '/dev/san_vg02/vm0001_ssd1' [10.00 GiB] inherit
  ACTIVE            '/dev/san_vg01/vm0001_hdd1' [50.00 GiB] inherit

Create Shared GFS2 Partition

an-node03:

mkfs.gfs2 -p lock_dlm -j 5 -t an-clusterB:shared01 /dev/san_vg01/shared01
This will destroy any data on /dev/san_vg01/shared01.
It appears to contain: symbolic link to `../dm-2'

Are you sure you want to proceed? [y/n] y

Device:                    /dev/san_vg01/shared01
Blocksize:                 4096
Device Size                10.00 GB (2621440 blocks)
Filesystem Size:           10.00 GB (2621438 blocks)
Journals:                  5
Resource Groups:           40
Locking Protocol:          "lock_dlm"
Lock Table:                "an-clusterB:shared01"
UUID:                      6C0D7D1D-A1D3-ED79-705D-28EE3D674E75

Add it to /etc/fstab (needed for the gfs2 init script to find and mount):

an-node03 - an-node07:

echo `gfs2_edit -p sb /dev/san_vg01/shared01 | grep sb_uuid | sed -e "s/.*sb_uuid  *\(.*\)/UUID=\L\1\E \/shared01\t\tgfs2\trw,suid,dev,exec,nouser,async\t0 0/"` >> /etc/fstab 
cat /etc/fstab
#
# /etc/fstab
# Created by anaconda on Fri Jul  8 22:01:41 2011
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=2c1f4cb1-959f-4675-b9c7-5d753c303dd1 /                       ext3    defaults        1 1
UUID=9a0224dc-15b4-439e-8d7c-5f9dbcd05e3f /boot                   ext3    defaults        1 2
UUID=4f2a83e8-1769-40d8-ba2a-e1f535306848 swap                    swap    defaults        0 0
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
UUID=6c0d7d1d-a1d3-ed79-705d-28ee3d674e75 /shared01 gfs2 rw,suid,dev,exec,nouser,async 0 0

Make the mount point and mount it.

mkdir /shared01
/etc/init.d/gfs2 start
Mounting GFS2 filesystem (/shared01):                      [  OK  ]
df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              40G  3.3G   35G   9% /
tmpfs                 1.8G   32M  1.8G   2% /dev/shm
/dev/sda1             248M   85M  151M  36% /boot
/dev/mapper/san_vg01-shared01
                       10G  647M  9.4G   7% /shared01

Stop GFS2 on all five nodes and update the cluster.conf config.

/etc/init.d/gfs2 stop
Unmounting GFS2 filesystem (/shared01):                    [  OK  ]
df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              40G  3.3G   35G   9% /
tmpfs                 1.8G   32M  1.8G   2% /dev/shm
/dev/sda1             248M   85M  151M  36% /boot
/dev/mapper/san_vg01-shared01
                       10G  647M  9.4G   7% /shared01

an-node03:

<?xml version="1.0"?>
<cluster config_version="9" name="an-clusterB">
        <totem rrp_mode="none" secauth="off"/>
        <clusternodes>
                <clusternode name="an-node03.alteeve.ca" nodeid="3">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="3"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node04.alteeve.ca" nodeid="4">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="4"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node05.alteeve.ca" nodeid="5">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="5"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node06.alteeve.ca" nodeid="6">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="6"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node07.alteeve.ca" nodeid="7">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="7"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
        </fencedevices>
        <fence_daemon post_join_delay="30"/>
        <rm>
                <resources>
                        <script file="/etc/init.d/iscsi" name="iscsi"/>
                        <script file="/etc/init.d/clvmd" name="clvmd"/>
                        <script file="/etc/init.d/gfs2" name="gfs2"/>
                </resources>
                <failoverdomains>
                        <failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node03.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node04.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node05.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node06.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node07.alteeve.ca"/>
                        </failoverdomain>
                </failoverdomains>
                <service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
                <service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
                <service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
                <service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
                <service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
        </rm>
</cluster>
cman_tool version -r

Check that rgmanager picked up the updated config and remounted the GFS2 partition.

df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              40G  3.3G   35G   9% /
tmpfs                 1.8G   32M  1.8G   2% /dev/shm
/dev/sda1             248M   85M  151M  36% /boot
/dev/mapper/san_vg01-shared01
                       10G  647M  9.4G   7% /shared01

Configure KVM

Host network and VM hypervisor config.

Disable the 'qemu' Bridge

By default, libvirtd creates a bridge called virbr0 designed to connect virtual machines to the first eth0 interface. Our system will not need this, so we will remove it. This bridge is configured in the /etc/libvirt/qemu/networks/default.xml file.

So to remove this bridge, simply delete the contents of the file, stop the bridge, delete the bridge and then stop iptables to make sure any rules created for the bridge are flushed.

cat /dev/null >/etc/libvirt/qemu/networks/default.xml
ifconfig virbr0 down
brctl delbr virbr0
/etc/init.d/iptables stop

Configure Bridges

On an-node03 through an-node07:

vim /etc/sysconfig/network-scripts/ifcfg-{eth,vbr}0

ifcfg-eth0:

# Internet facing
HWADDR="bc:ae:c5:44:8a:de"
DEVICE="eth0"
BRIDGE="vbr0"
BOOTPROTO="static"
IPV6INIT="yes"
NM_CONTROLLED="no"
ONBOOT="yes"

Note that you can use what ever bridge names makes sense to you. However, the file name for the bridge configuration must sort after the ifcfg-ethX file. If the bridge file is read before the ethernet interface, it will fail to come up. Also, the bridge name as defined in the file does not need to match the one used it the actual file name. Personally, I like vbrX for "vm bridge".

ifcfg-vbr0:

# Bridge - IFN
DEVICE="vbr0"
TYPE="Bridge"
IPADDR=192.168.1.73
NETMASK=255.255.255.0
GATEWAY=192.168.1.254
DNS1=192.139.81.117
DNS2=192.139.81.1

You may wish to not make the Back-Channel Network accessible to the virtual machines, then there is no need to setup this second bridge.

vim /etc/sysconfig/network-scripts/ifcfg-{eth,vbr}2

ifcfg-eth2:

# Back-channel
HWADDR="00:1B:21:72:9B:56"
DEVICE="eth2"
BRIDGE="vbr2"
BOOTPROTO="static"
IPV6INIT="yes"
NM_CONTROLLED="no"
ONBOOT="yes"

ifcfg-vbr2:

# Bridge - BCN
DEVICE="vbr2"
TYPE="Bridge"
IPADDR=192.168.3.73
NETMASK=255.255.255.0

Leave the cluster, lest we be fenced.

/etc/init.d/rgmanager stop && /etc/init.d/cman stop

Restart networking and then check that the new bridges are up and that the proper ethernet devices are slaved to them.

/etc/init.d/network restart
Shutting down interface eth0:                              [  OK  ]
Shutting down interface eth1:                              [  OK  ]
Shutting down interface eth2:                              [  OK  ]
Shutting down loopback interface:                          [  OK  ]
Bringing up loopback interface:                            [  OK  ]
Bringing up interface eth0:                                [  OK  ]
Bringing up interface eth1:                                [  OK  ]
Bringing up interface eth2:                                [  OK  ]
Bringing up interface vbr0:                                [  OK  ]
Bringing up interface vbr2:                                [  OK  ]
brctl show
bridge name	bridge id		STP enabled	interfaces
vbr0		8000.bcaec5448ade	no		eth0
vbr2		8000.001b21729b56	no		eth2
ifconfig
eth0      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE  
          inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:4439 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2752 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:508352 (496.4 KiB)  TX bytes:494345 (482.7 KiB)
          Interrupt:31 Base address:0x8000 

eth1      Link encap:Ethernet  HWaddr 00:1B:21:72:96:E8  
          inet addr:192.168.2.73  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:96e8/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:617100 errors:0 dropped:0 overruns:0 frame:0
          TX packets:847718 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:772489353 (736.7 MiB)  TX bytes:740536232 (706.2 MiB)
          Interrupt:18 Memory:fe9e0000-fea00000 

eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56  
          inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:86586 errors:0 dropped:0 overruns:0 frame:0
          TX packets:80934 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:11366700 (10.8 MiB)  TX bytes:10091579 (9.6 MiB)
          Interrupt:17 Memory:feae0000-feb00000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:32 errors:0 dropped:0 overruns:0 frame:0
          TX packets:32 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:11507 (11.2 KiB)  TX bytes:11507 (11.2 KiB)

vbr0      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE  
          inet addr:192.168.1.73  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:165 errors:0 dropped:0 overruns:0 frame:0
          TX packets:89 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:25875 (25.2 KiB)  TX bytes:17081 (16.6 KiB)

vbr2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56  
          inet addr:192.168.3.73  Bcast:192.168.3.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:74 errors:0 dropped:0 overruns:0 frame:0
          TX packets:27 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:19021 (18.5 KiB)  TX bytes:4137 (4.0 KiB)

Rejoin the cluster.

/etc/init.d/cman start && /etc/init.d/rgmanager start


Repeat these configurations, altering for MAC and IP addresses as appropriate, for the other four VM cluster nodes.

Benchmarks

GFS2 partition on an-node07's /shared01 partition. Test #1, no optimization:

bonnie++ -d /shared01/ -s 8g -u root:root
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
an-node07.alteev 8G   388  95 22203   6 14875   8  2978  95 48406  10 107.3   5
Latency               312ms   44400ms   31355ms   41505us     540ms   11926ms
Version  1.96       ------Sequential Create------ --------Random Create--------
an-node07.alteeve.c -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1144  18 +++++ +++  8643  56   939  19 +++++ +++  8262  55
Latency               291ms     586us    2085us    3511ms      51us    3669us
1.96,1.96,an-node07.alteeve.ca,1,1312497509,8G,,388,95,22203,6,14875,8,2978,95,48406,10,107.3,5,16,,,,,1144,18,+++++,+++,8643,56,939,19,+++++,+++,8262,55,312ms,44400ms,31355ms,41505us,540ms,11926ms,291ms,586us,2085us,3511ms,51us,3669us

CentOS 5.6 x86_64 VM vm0001_labzilla's /root directory. Test #1, no optimization. VM provisioned using command in below section.

bonnie++ -d /root/ -s 8g -u root:root
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
labzilla-new.can 8G   674  98 15708   5 14875   7  1570  65 47806  10 119.1   7
Latency             66766us    7680ms    1588ms     187ms     269ms    1292ms
Version  1.96       ------Sequential Create------ --------Random Create--------
labzilla-new.candco -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 27666  39 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency             11360us    1904us     799us     290us      44us      41us
1.96,1.96,labzilla-new.candcoptical.com,1,1312522208,8G,,674,98,15708,5,14875,7,1570,65,47806,10,119.1,7,16,,,,,27666,39,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,66766us,7680ms,1588ms,187ms,269ms,1292ms,11360us,1904us,799us,290us,44us,41us

Provision vm0001

Created LV already, so:

virt-install --connect qemu:///system \
  --name vm0001_labzilla \
  --ram 1024 \
  --arch x86_64 \
  --vcpus 2 \
  --cpuset 1-3 \
  --location http://192.168.1.254/c5/x86_64/img/ \
  --extra-args "ks=http://192.168.1.254/c5/x86_64/ks/labzilla_c5.ks" \
  --os-type linux \
  --os-variant rhel5.4 \
  --disk path=/dev/san_vg01/vm0001_hdd1 \
  --network bridge=vbr0 \
  --vnc

Provision vm0002

Created LV already, so:

virt-install --connect qemu:///system \
  --name vm0002_innovations \
  --ram 1024 \
  --arch x86_64 \
  --vcpus 2 \
  --cpuset 1-3 \
  --cdrom /shared01/media/Win_Server_2008_Bis_x86_64.iso \
  --os-type windows \
  --os-variant win2k8 \
  --disk path=/dev/san_vg01/vm0002_hdd2 \
  --network bridge=vbr0 \
  --hvm \
  --vnc

Update the cluster.conf to add the VMs to the cluster.

<?xml version="1.0"?>
<cluster config_version="12" name="an-clusterB">
	<totem rrp_mode="none" secauth="off"/>
	<clusternodes>
		<clusternode name="an-node03.alteeve.ca" nodeid="3">
			<fence>
				<method name="apc_pdu">
					<device action="reboot" name="pdu2" port="3"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node04.alteeve.ca" nodeid="4">
			<fence>
				<method name="apc_pdu">
					<device action="reboot" name="pdu2" port="4"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node05.alteeve.ca" nodeid="5">
			<fence>
				<method name="apc_pdu">
					<device action="reboot" name="pdu2" port="5"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node06.alteeve.ca" nodeid="6">
			<fence>
				<method name="apc_pdu">
					<device action="reboot" name="pdu2" port="6"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node07.alteeve.ca" nodeid="7">
			<fence>
				<method name="apc_pdu">
					<device action="reboot" name="pdu2" port="7"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
	</fencedevices>
	<fence_daemon post_join_delay="30"/>
	<rm log_level="5">
		<resources>
			<script file="/etc/init.d/iscsi" name="iscsi"/>
			<script file="/etc/init.d/clvmd" name="clvmd"/>
			<script file="/etc/init.d/gfs2" name="gfs2"/>
		</resources>
		<failoverdomains>
			<failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca"/>
			</failoverdomain>
			<failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.ca"/>
			</failoverdomain>
			<failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.ca"/>
			</failoverdomain>
			<failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node06.alteeve.ca"/>
			</failoverdomain>
			<failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node07.alteeve.ca"/>
			</failoverdomain>
			<failoverdomain name="an3_primary" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca" priority="1"/>
				<failoverdomainnode name="an-node04.alteeve.ca" priority="2"/>
				<failoverdomainnode name="an-node05.alteeve.ca" priority="3"/>
				<failoverdomainnode name="an-node06.alteeve.ca" priority="4"/>
				<failoverdomainnode name="an-node07.alteeve.ca" priority="5"/>
			</failoverdomain>
			<failoverdomain name="an4_primary" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca" priority="5"/>
				<failoverdomainnode name="an-node04.alteeve.ca" priority="1"/>
				<failoverdomainnode name="an-node05.alteeve.ca" priority="2"/>
				<failoverdomainnode name="an-node06.alteeve.ca" priority="3"/>
				<failoverdomainnode name="an-node07.alteeve.ca" priority="4"/>
			</failoverdomain>
		</failoverdomains>
		<service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart">
			<script ref="iscsi">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart">
			<script ref="iscsi">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart">
			<script ref="iscsi">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart">
			<script ref="iscsi">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart">
			<script ref="iscsi">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
		<vm autostart="0" domain="an3_primary" exclusive="0" max_restarts="2" name="vm0001_labzilla" path="/shared01/definitions/" recovery="restart" restart_expire_time="600"/>
		<vm autostart="0" domain="an4_primary" exclusive="0" max_restarts="2" name="vm0002_innovations" path="/shared01/definitions/" recovery="restart" restart_expire_time="600"/>
	</rm>
</cluster>


Stuff

Multi-VM after primary SAN (violent) ejection from cluster. Both VMs remained up!

Two VMs (windows and Linux) running on the SAN. Initial testing of survivability of primary SAN failure completed successfully!

First build of the 2x7 Cluster.

First-ever successful build/install of the "Cluster Set" cluster configuration. Fully HA, fully home-brew on all open source software using only commodity hardware. Much tuning/testing to come!

Bonding and Trunking

The goal here is to take the network out as a single point of failure.

The design is to use two stacked switches, bonded connections in the nodes with each leg of the bond cabled through either switch. While both are up, the aggregate bandwidth will be achieved using trunking in the switch and the appropriate bond driver configuration. The recovery from failure will need to be configured in such a way that it will be faster than the cluster's token loss timeouts multiplied by the token retransmit loss count.

This tutorial uses 2x D-Link DGS-3100-24 switches. This is not to endorse these switches, per-se, but it does provide a relatively affordable, decent quality switches for those who'd like to replicate this setup.

Configure The Stack

First, stack the switches using a ring topology (both HDMI connectors/cables used). If both switches are brand new, simple cable them together and the switches will auto-negotiate the stack configuration. If you are adding a new switch, then power on the existing switch, cable up the second switch and then power on the second switch. After a short time, it's stack ID should increment and you should see the new switch appear in the existing switch's interface.

Configuring the Bonding Drivers

This tutorial uses four interfaces joined into two bonds of two NICs like so:

# Internet Facing Network:
eth0 + eth1 == bond0
# Storage and Cluster Communications:
eth2 + eth3 == bond1

This requires a few steps.

  • Create /etc/modprobe.d/bonding.conf and add an entry for the two bonding channels we will create.

Note: My eth0 device is an onboard controller with a maximum MTU of 7200 bytes. This means that the whole bond is restricted to this MTU.

vim /etc/modprobe.d/bonding.conf
alias bond0 bonding
alias bond1 bonding
  • Create the ifcfg-bondX configuration files.

Internet Facing configuration

touch /etc/sysconfig/network-scripts/ifcfg-bond{0,1}
vim /etc/sysconfig/network-scripts/ifcfg-eth{0,1} /etc/sysconfig/network-scripts/ifcfg-bond0
cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Internet Facing Network - Link 1
HWADDR="BC:AE:C5:44:8A:DE"
DEVICE="eth0"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond0"
SLAVE="yes"
MTU="7200"
cat /etc/sysconfig/network-scripts/ifcfg-eth1
# Internet Facing Network - Link 2
HWADDR="00:1B:21:72:96:E8"
DEVICE="eth1"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond0"
SLAVE="yes"
MTU="7200"
cat /etc/sysconfig/network-scripts/ifcfg-bond0
# Internet Facing Network - Bonded Interface
DEVICE="bond0"
BOOTPROTO="static"
NM_CONTROLLED="no"
ONBOOT="yes"
IPADDR="192.168.1.73"
NETMASK="255.255.255.0"
GATEWAY="192.168.1.254"
DNS1="192.139.81.117"
DNS2="192.139.81.1"
BONDING_OPTS="miimon=1000 mode=0"
MTU="7200"

Merged Storage Network and Back Channel Network configuration.

Note: The interfaces in this bond all support maximum MTU of 9000 bytes.

vim /etc/sysconfig/network-scripts/ifcfg-eth{2,3} /etc/sysconfig/network-scripts/ifcfg-bond1
cat /etc/sysconfig/network-scripts/ifcfg-eth2
# Storage and Back Channel Networks - Link 1
HWADDR="00:1B:21:72:9B:56"
DEVICE="eth2"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond1"
SLAVE="yes"
MTU="9000"
cat /etc/sysconfig/network-scripts/ifcfg-eth3
# Storage and Back Channel Networks - Link 2
HWADDR="00:0E:0C:59:46:E4"
DEVICE="eth3"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond1"
SLAVE="yes"
MTU="9000"
cat /etc/sysconfig/network-scripts/ifcfg-bond1
# Storage and Back Channel Networks - Bonded Interface
DEVICE="bond1"
BOOTPROTO="static"
NM_CONTROLLED="no"
ONBOOT="yes"
IPADDR="192.168.3.73"
NETMASK="255.255.255.0"
BONDING_OPTS="miimon=1000 mode=0"
MTU="9000"

Restart networking.

Note: I've noticed that this can error out and fail to start slaved devices at times when using /etc/init.d/network restart. If you have any trouble, you may need to completely stop all networking, then start it back up. This, of course, requires network-less access to the node's console (direct access, iKVM, console redirection, etc).

Some of the errors we will see below are because the network interface configuration changed while the interfaces were still up. To avoid this, if you have networkless access to the nodes, would be to stop the network interfaces prior to beginning editing.

/etc/init.d/network restart
Shutting down interface eth0:                              [  OK  ]
Shutting down interface eth1:                              [  OK  ]
Shutting down interface eth2:  /etc/sysconfig/network-scripts/ifdown-eth: line 99: /sys/class/net/bond1/bonding/slaves: No such file or directory
                                                           [  OK  ]
Shutting down interface eth3:  /etc/sysconfig/network-scripts/ifdown-eth: line 99: /sys/class/net/bond1/bonding/slaves: No such file or directory
                                                           [  OK  ]
Shutting down loopback interface:                          [  OK  ]
Bringing up loopback interface:                            [  OK  ]
Bringing up interface bond0:  RTNETLINK answers: File exists
Error adding address 192.168.1.73 for bond0.
RTNETLINK answers: File exists
                                                           [  OK  ]
Bringing up interface bond1:                               [  OK  ]

Confirm that we've got our new bonded interfaces

ifconfig
bond0     Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE  
          inet addr:192.168.1.73  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:7200  Metric:1
          RX packets:1021 errors:0 dropped:0 overruns:0 frame:0
          TX packets:502 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:128516 (125.5 KiB)  TX bytes:95092 (92.8 KiB)

bond1     Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56  
          inet addr:192.168.3.73  Bcast:192.168.3.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
          RX packets:787028 errors:0 dropped:0 overruns:0 frame:0
          TX packets:788651 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:65753950 (62.7 MiB)  TX bytes:1194295932 (1.1 GiB)

eth0      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:7200  Metric:1
          RX packets:535 errors:0 dropped:0 overruns:0 frame:0
          TX packets:261 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:66786 (65.2 KiB)  TX bytes:47749 (46.6 KiB)
          Interrupt:31 Base address:0x8000 

eth1      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:7200  Metric:1
          RX packets:486 errors:0 dropped:0 overruns:0 frame:0
          TX packets:241 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:61730 (60.2 KiB)  TX bytes:47343 (46.2 KiB)
          Interrupt:18 Memory:fe8e0000-fe900000 

eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:360190 errors:0 dropped:0 overruns:0 frame:0
          TX packets:394844 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:28756400 (27.4 MiB)  TX bytes:598159146 (570.4 MiB)
          Interrupt:17 Memory:fe9e0000-fea00000 

eth3      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:426838 errors:0 dropped:0 overruns:0 frame:0
          TX packets:393807 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:36997550 (35.2 MiB)  TX bytes:596136786 (568.5 MiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

Configuring High-Availability Networking

There are seven bonding modes, which you can read about in detail here. However, the RHCS stack only supports one of the modes, called Active/Passive Bonding, also known as mode=1.

Configuring Your Switches

This method provides for no performance gains, but instead treats the slaved interfaces as independent paths. One of which acts as the primary while the other sits dormant. On failure, the bond switches to the backup interface, promoting it to the primary interface. Which is normally primary, and under what conditions a restored link might return to the primary role, are configurable should you wish to do so.

Your managed switch will no doubt have one or more bonding, also known as trunking configuration options. Likewise, your switches may be stackable. It is strongly advised that you do *not* stack your switches. Being unstacked, it is obviously not possible to configure trunking. Should you decide to disregard this, be very sure to extensive test failure and recovery of both switches under real-world work loads.

Still on the topic of switches; Do not configure STP (spanning tree protocol) on any port connected to your cluster nodes! When a switch is added to the network, as is the case after restoring a lost switch, STP-enabled switches and port on those switches may block traffic for a period of time while STP renegotiates and reconfigures. This takes more than enough time to cause a cluster to partition. You may still enable and configure STP if you need to do so, simply ensure that you only do in on the appropriate ports.

Preparing The Bonding Driver

Before we modify the network, we will need to create the following file:

You will need to create a file called

With the switches unstacked and STP disabled, we can now configure your bonding interface.

vim /etc/modprobe.d/bonding.conf
alias bond0 bonding
alias bond1 bonding
alias bond2 bonding

If you only have four interfaces and plan to merge the SN and BCN networks, you can omit the bond2 entry.

You can then copy and paste the alias ... entries from the file above into the terminal to avoid the need to reboot.

Deciding Which NICs to Bond

If all of the interfaces in your server are identical, you can probably skip this step. Before you jump though, consider that not all of the PCIe interfaces may have all of their lanes connected, resulting in differing speeds. If you are unsure, I strongly recommend you run these tests.

TODO: Upload network_profiler.pl here and explain it's use.


Once you've determined the various capabilities on your interfaces, pair them off with their closest-performing partners.

Keep in mind:

  • Any interface piggy-backing on an IPMI interface *must* be part of the BCN bond!
  • The fasted interfaces should be paired for the SN bond.
  • The lowest latency interfaces should be used the BCN bond.
  • The lowest remaining two interfaces should be used in the IFN bond.

Creating the Bonds

Warning: This step will almost certainly leave you without a network access to your servers. It is *strongly* advised that you do the next steps when you have physical access to your servers. If that is simply not possible, then proceed with extreme caution.

In my case, I found that bonding the following optimal configuration:

  • eth0 and eth3 bonded as bond0.
  • eth1 and eth2 bonded as bond1.

I did not have enough interfaces for three bonds, so I will configure the following:

  • bond0 will be the IFN interface on the 192.168.1.0/24 subnet.
  • bond1 will be the merged BCN and SN interfaces on the 192.168.3.0/24 subnet.

TODO: Create/show the diffs for the following ifcfg-ethX files.

  • Create bond0 our of eth0 and eth3:
vim /etc/sysconfig/network-scripts/ifcfg-eth0
# Internet Facing Network - Link 1
HWADDR="BC:AE:C5:44:8A:DE"
DEVICE="eth0"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="no"
MASTER="bond0"
SLAVE="yes"
vim /etc/sysconfig/network-scripts/ifcfg-eth3
# Storage and Back Channel Networks - Link 2
HWADDR="00:0E:0C:59:46:E4"
DEVICE="eth3"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="no"
MASTER="bond0"
SLAVE="yes"
vim /etc/sysconfig/network-scripts/ifcfg-bond0
# Internet Facing Network - Bonded Interface
DEVICE="bond0"
BOOTPROTO="static"
NM_CONTROLLED="no"
ONBOOT="yes"
IPADDR="192.168.1.73"
NETMASK="255.255.255.0"
GATEWAY="192.168.1.254"
DNS1="192.139.81.117"
DNS2="192.139.81.1"
# Clustering *only* supports mode=1 (active-passive)
BONDING_OPTS="mode=1 miimon=100 use_carrier=0 updelay=0 downdelay=0"

GFS2

Try adding noatime to the /etc/fstab options. h/t to Dak1n1; "it avoids cluster reads from turning into unnecessary writes, and can improve performance".

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.