2x5 Scalable Cluster Tutorial: Difference between revisions

From Alteeve Wiki
Jump to navigation Jump to search
No edit summary
 
(28 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{howto_header}}
{{howto_header}}


{{warning|1=This tutorial is not even close to complete or accurate. It will be updated later, but so long as this warning is here, consider it defective and unusable. The only up to date clustering tutorial is: [[Red Hat Cluster Service 2 Tutorial]].}}
 
{{warning|1=This document is old, abandoned and very out of date. DON'T USE ANYTHING HERE! Consider it only as historical note taking.}}


= The Design =
= The Design =
== Storage ==
Storage, high-level:
<source lang="text">
[ Storage Cluster ]                                                     
    _____________________________            _____________________________
  | [ an-node01 ]              |          | [ an-node02 ]              |
  |  _____    _____            |          |            _____    _____  |
  | ( HDD )  ( SSD )            |          |            ( SSD )  ( HDD ) |
  | (_____)  (_____)  __________|          |__________  (_____)  (_____) |
  |    |        |    | Storage  =--\    /--=  Storage |    |        |    |
  |    |        \----| Network ||  |    |  || Network |----/        |    |
  |    \-------------|_________||  |    |  ||_________|-------------/    |
  |_____________________________|  |    |  |_____________________________|
                                  __|_____|__                             
                                |  HDD LUN  |                             
                                |  SDD LUN  |                             
                                |___________|                             
                                      |                                   
                                  _____|_____                           
                                | Floating  |                           
                                |  SAN IP  |                               
[ VM Cluster ]                  |___________|                               
  ______________________________  | | | | |  ______________________________
| [ an-node03 ]                |  | | | | |  |                [ an-node06 ] |
|  _________                  |  | | | | |  |                  _________  |
| | [ vmA ] |                  |  | | | | |  |                  | [ vmJ ] | |
| |  _____  |                  |  | | | | |  |                  |  _____  | |
| | (_hdd_)-=----\            |  | | | | |  |            /----=-(_hdd_) | |
| |_________|    |            |  | | | | |  |            |    |_________| |
|  _________    |            |  | | | | |  |            |    _________  |
| | [ vmB ] |    |            |  | | | | |  |            |    | [ vmK ] | |
| |  _____  |    |            |  | | | | |  |            |    |  _____  | |
| | (_hdd_)-=--\ |  __________|  | | | | |  |__________  | /--=-(_hdd_) | |
| |_________|  | \--| Storage  =--/ | | | \--=  Storage |--/ |  |_________| |
|  _________  \----| Network ||    | | |    || Network |----/  _________  |
| | [ vmC ] |  /----|_________||    | | |    ||_________|----\  | [ vmL ] | |
| |  _____  |  |              |    | | |    |              |  |  _____  | |
| | (_hdd_)-=--/              |    | | |    |              \--=-(_hdd_) | |
| |_________|                  |    | | |    |                  |_________| |
|______________________________|    | | |    |______________________________|           
  ______________________________    | | |    ______________________________
| [ an-node04 ]                |    | | |    |                [ an-node07 ] |
|  _________                  |    | | |    |                  _________  |
| | [ vmD ] |                  |    | | |    |                  | [ vmM ] | |
| |  _____  |                  |    | | |    |                  |  _____  | |
| | (_hdd_)-=----\            |    | | |    |            /----=-(_hdd_) | |
| |_________|    |            |    | | |    |            |    |_________| |
|  _________    |            |    | | |    |            |    _________  |
| | [ vmE ] |    |            |    | | |    |            |    | [ vmN ] | |
| |  _____  |    |            |    | | |    |            |    |  _____  | |
| | (_hdd_)-=--\ |  __________|    | | |    |__________  | /--=-(_hdd_) | |
| |_________|  | \--| Storage  =----/ | \----=  Storage |--/ |  |_________| |
|  _________  \----| Network ||      |      || Network |----/  _________  |
| | [ vmF ] |  /----|_________||      |      ||_________|----\  | [ vmO ] | |
| |  _____  |  |              |      |      |              |  |  _____  | |
| | (_hdd_)-=--+              |      |      |              \--=-(_hdd_) | |
| | (_ssd_)-=--/              |      |      |                  |_________| |
| |_________|                  |      |      |                              |
|______________________________|      |      |______________________________|           
  ______________________________      |                                     
| [ an-node05 ]                |      |                                     
|  _________                  |      |                                     
| | [ vmG ] |                  |      |                                     
| |  _____  |                  |      |                                     
| | (_hdd_)-=----\            |      |                                     
| |_________|    |            |      |                                     
|  _________    |            |      |                                     
| | [ vmH ] |    |            |      |                                     
| |  _____  |    |            |      |                                     
| | (_hdd_)-=--\ |            |      |                                     
| | (_sdd_)-=--+ |  __________|      |                                     
| |_________|  | \--| Storage  =------/                                     
|  _________  \----| Network ||                                           
| | [ vmI ] |  /----|_________||                                           
| |  _____  |  |              |                                           
| | (_hdd_)-=--/              |                                           
| |_________|                  |                                           
|______________________________|                                                       
</source>
== Long View ==
{{note|1=Yes, this is a big graphic, but this is also a big project. I am no artist though, and any help making this clearer is greatly appreciated!}}
[[Image:2x5_the-plan_01.png|center|thumb|800px|The planned network. This shows separate IPMI and full redundancy through-out the cluster. This is the way a production cluster should be built, but is not expected for dev/test clusters.]]
== Failure Mapping ==
VM Cluster; Guest VM failure migration planning;
* Each node can host 5 VMs @ 2GB/VM.
* This is an N-1 cluster with five nodes; 20 VMs total.
<source lang="text">
          |    All    | an-node03 | an-node04 | an-node05 | an-node06 | an-node07 |
          | on-line  |  down    |  down    |  down    |  down    |  down    |
----------+-----------+-----------+-----------+-----------+-----------+-----------+
an-node03 |  vm01    |    --    |  vm01    |  vm01    |  vm01    |  vm01    |
          |  vm02    |    --    |  vm02    |  vm02    |  vm02    |  vm02    |
          |  vm03    |    --    |  vm03    |  vm03    |  vm03    |  vm03    |
          |  vm04    |    --    |  vm04    |  vm04    |  vm04    |  vm04    |
          |    --    |    --    |  vm05    |  vm09    |  vm13    |  vm17    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node04 |  vm05    |  vm05    |    --    |  vm05    |  vm05    |  vm05    |
          |  vm06    |  vm06    |    --    |  vm06    |  vm06    |  vm06    |
          |  vm07    |  vm07    |    --    |  vm07    |  vm07    |  vm07    |
          |  vm08    |  vm08    |    --    |  vm08    |  vm08    |  vm08    |
          |    --    |  vm01    |    --    |  vm10    |  vm14    |  vm18    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node05 |  vm09    |  vm09    |  vm09    |    --    |  vm09    |  vm09    |
          |  vm10    |  vm10    |  vm10    |    --    |  vm10    |  vm10    |
          |  vm11    |  vm11    |  vm11    |    --    |  vm11    |  vm11    |
          |  vm12    |  vm12    |  vm12    |    --    |  vm12    |  vm12    |
          |    --    |  vm02    |  vm06    |    --    |  vm15    |  vm19    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node06 |  vm13    |  vm13    |  vm13    |  vm13    |    --    |  vm13    |
          |  vm14    |  vm14    |  vm14    |  vm14    |    --    |  vm14    |
          |  vm15    |  vm15    |  vm15    |  vm15    |    --    |  vm15    |
          |  vm16    |  vm16    |  vm16    |  vm16    |    --    |  vm16    |
          |    --    |  vm03    |  vm07    |  vm11    |    --    |  vm20    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node07 |  vm17    |  vm17    |  vm17    |  vm17    |  vm17    |    --    |
          |  vm18    |  vm18    |  vm18    |  vm18    |  vm18    |    --    |
          |  vm19    |  vm19    |  vm19    |  vm19    |  vm19    |    --    |
          |  vm20    |  vm20    |  vm20    |  vm20    |  vm20    |    --    |
          |    --    |  vm04    |  vm08    |  vm12    |  vm16    |    --    |
----------+-----------+-----------+-----------+-----------+-----------+-----------+
</source>
== Cluster Overview ==
{{note|1=This is not programatically accurate!}}
This is meant to show, at a logical level, how the parts of a cluster work together. It is the first draft and is likely defective in terrible ways.


<source lang="text">
<source lang="text">
All nodes have IPs as follows:
[ Resource Managment ]                                                                                                         
  * eth0 == Internet Facing Network == 192.168.1.x
  ___________    ___________                                                                                               
  * eth1 == Storage Network        == 192.168.2.x
|          |  |          |                                                                                             
  * eth2 == Back Channel Network   == 192.168.3.x
| Service A |  | Service B |                                                                                             
  * Where 'x' = the node ID (ie: an-node01 -> x=1)
|___________|  |___________|                                                                                             
            |    |        |                                                                                                   
          __|_____|__    ___|_______________                                                                                   
        |          |  |                  |                                                                                   
        | RGManager | | Clustered Storage |================================================.                                  
        |___________|  |___________________|                                                |                                 
              |                  |                                                          |                                 
              |__________________|______________                                            |                               
                            |                    \                                          |                               
        _________      ____|____                |                                          |                               
        |        |    |        |                |                                          |                               
/------| Fencing |----| Locking |                |                                          |                               
|      |_________|    |_________|                |                                          |                               
_|___________|_____________|______________________|__________________________________________|_____
|          |            |                      |                                          |                                 
|    ______|_____    ____|___                  |                                          |                                 
|    |            |  |        |                  |                                          |                                 
|    | Membership |  | Quorum |                  |                                          |                                 
|    |____________|  |________|                  |                                          |                                 
|          |____________|                      |                                          |                                 
|                      __|__                    |                                          |                                 
|                    /    \                    |                                          |                                 
|                    { Totem }                  |                                          |                                 
|                    \_____/                    |                                          |                                 
|      __________________|_______________________|_______________ ______________            |                                   
|    |-----------|-----------|----------------|-----------------|--------------|          |                                   
|  ___|____    ___|____    ___|____        ___|____        _____|_____    _____|_____    __|___                               
| |        |  |        |  |        |      |        |      |          |  |          |  |      |                               
| | Node 1 |  | Node 2 |  | Node 3 |  ...  | Node N |      | Storage 1 |==| Storage 2 |==| DRBD |                               
| |________|  |________|  |________|      |________|      |___________|  |___________|  |______|                               
  \_____|___________|___________|________________|_________________|______________|                                               
                                                                                                                               
[ Cluster Communication ]                                                                                                       
</source>
 
== Network IPs ==


* If a node has an IPMI (or similar) interface piggy-backed on a network
<source lang="text">
  interface, it will be shared with eth2. If it has a dedicated interface, it
SAN: 10.10.1.1
  will be connected to the BCN.
* Node management interfaces will be on 192.168.3.(x+100)
* All subnets are /24 (255.255.255.0)


  Storage node use 2x SATA drives ('sda' and 'sdb') plus 2x SSD drives ('sdc'
Node:
  and 'sdd').
          | IFN        | SN        | BCN      | IPMI      |
----------+-------------+------------+-----------+-----------+
an-node01 | 10.255.0.1  | 10.10.0.1  | 10.20.0.1 | 10.20.1.1 |                                               
an-node02 | 10.255.0.2  | 10.10.0.2  | 10.20.0.2 | 10.20.1.2 |                                               
an-node03 | 10.255.0.3  | 10.10.0.3  | 10.20.0.3 | 10.20.1.3 |                                               
an-node04 | 10.255.0.4  | 10.10.0.4  | 10.20.0.4 | 10.20.1.4 |                                               
an-node05 | 10.255.0.5  | 10.10.0.5  | 10.20.0.5 | 10.20.1.5 |                                               
an-node06 | 10.255.0.6 | 10.10.0.6  | 10.20.0.6 | 10.20.1.6 |                                               
an-node07 | 10.255.0.7 | 10.10.0.7  | 10.20.0.7 | 10.20.1.7 |                                               
----------+-------------+------------+-----------+-----------+


Logical map:
Aux Equipment:
  ___________________________________________                    ___________________________________________
          | BCN        |
| [ an-node01 ]                      ______|                  |______                      [ an-node02 ] |
----------+-------------+
|  ______    _____    _______        | eth0 =------\    /------= eth0 |        _______    _____    ______  |
pdu1     | 10.20.2.1   |                                                                                                
| [_sda1_]--[_md0_]--[_/boot_]      |_____||      |    |      ||_____|      [_/boot_]--[_md0_]--[_sda1_] |
pdu2     | 10.20.2.2   |                                                                                                
| [_sdb1_]                                  |      |    |      |                                  [_sdb1_] |
switch1   | 10.20.2.|                                                                                                
|  ______    _____    ______          ______|      |    |      |______          ______    _____    ______  |
switch2  | 10.20.2.|                                                                                                
| [_sda2_]--[_md1_]--[_swap_]  /----| eth1 =----\ |    | /----= eth1 |----\  [_swap_]--[_md1_]--[_sda2_] |
ups1     | 10.20.2.5   |                                                                                        
| [_sdb2_]                      | /--|_____||    | |    | |    ||_____|--\ |                      [_sdb2_] |
ups2     | 10.20.2.6   |                                                                                        
|  ______    _____    ___      | |        |    | |    | |    |        | |      ___    _____    ______  |
----------+-------------+
| [_sda3_]--[_md2_]--[_/_]     | |   ______|    | |    | |    |______  | |      [_/_]--[_md2_]--[_sda3_] |
                                                                                                                 
| [_sdb3_]                      | |  | eth2 =--\ | |    | | /--= eth2 |  | |                      [_sdb3_] |
VMs:                                                                                                             
|  ______    _____    _______   | |  |_____||  | | |    | | |  ||_____|  | |  _______    _____    ______  |
          | VMN         |                                                                                        
| [_sda5_]--[_md3_]--[_drbd0_]--/ |        |  | | |    | | |  |        | \--[_drbd0_]--[_md3_]--[_sda5_] |
----------+-------------+
| [_sdb5_]                        |        |  | | |    | | |  |        |                        [_sdb5_] |
vm01      | 10.254.0.1 |                                                                                        
|  ______    _____    _______    |        |  | | |    | | |  |        |    _______    _____    ______  |
vm02     | 10.254.0.2 |                                                                                        
| [_sdc1_]--[_md4_]--[_drbd1_]----/        |  | | |    | | |  |        \----[_drbd1_]--[_md4_]--[_sdc1_] |
vm03      | 10.254.0.3 |                                                                                        
| [_sdd1_]                                  |  | | |    | | |  |                                  [_sdd1_] |
vm04      | 10.254.0.4 |                                                                                        
|___________________________________________|  | | |    | | |  |___________________________________________|
vm05      | 10.254.0.5 |                                                                                        
                                                | | |    | | |
vm06      | 10.254.0.6 |                                                                                        
                        /---------------------------/    | | |
vm07      | 10.254.0.7 |                                                                                        
                        |                      | |      | | |
vm08     | 10.254.0.8 |                                                                                        
                        | /-------------------------------/ | |
vm09      | 10.254.0.9 |                                                                                        
                        | |                    | |        | \-----------------------\
vm10      | 10.254.0.10 |                                                                                        
                        | |                    | |        |                        |
vm11      | 10.254.0.11 |                                                                                        
                        | |                    \-----------------------------------\ |
vm12      | 10.254.0.12 |                                                                                        
                        | |                      |        |                      | |
vm13      | 10.254.0.13 |                                                                                        
                        | |   ____________________|_________|____________________  | |
vm14      | 10.254.0.14 |                                                                                        
                        | |  [ iqn.2011-08.com.alteeve:an-clusterA.target01.hdd  ]  | |
vm15      | 10.254.0.15 |                                                                                        
                        | |  [ iqn.2011-08.com.alteeve:an-clusterA.target02.sdd  ]  | |
vm16      | 10.254.0.16 |                                                                                        
  _________________    | |  [    drbd0 == hdd == vg01          Floating IP    ]  | |     ___________________
vm17      | 10.254.0.17 |                                                                                        
  [ Internet Facing ]  | |  [____drbd1_==_sdd_==_vg02__________192.168.2.100____]  | | /---[ Internal Managed  ]
vm18      | 10.254.0.18 |                                                                                        
  [_____Routers_____]  | |                            | |                        | | |   [  Private Network  ]
vm19      | 10.254.0.19 |                                                                                        
                  |    | \-----------\                | |                  /-----/ | |  [_and_fence_devices_]
vm20     | 10.254.0.20 |                                                                                        
                  |    \-----------\ |                | |                  | /-----/ |
----------+-------------+
                  \---------------\ | |                | |                  | | /-----/
                                _|_|_|___________    _|_|_____________    _|_|_|___________
[ Storage Cluster ]            [ Internet Facing ]  [ Storage Network ]  [  Back-Channel  ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~[_____Network_____]~~~[_________________]~~~[_____Network_____]~~~~~~~~~~~~~~~
[    VM Cluster  ]              | | | | |            | | | | |            | | | | | 
                                  | | | | |            | | | | |            | | | | |
  __________________________     | | | | |            | | | | |            | | | | |
| [ an-node03 ]      ______|    | | | | |            | | | | |            | | | | |
|                  | eth0 =-----/ | | | |            | | | | |            | | | | |
|                  |_____||      | | | |            | | | | |            | | | | |
|                          |      | | | |            | | | | |            | | | | |
|                    ______|      | | | |            | | | | |            | | | | |
|                  | eth1 =---------------------------/ | | | |            | | | | |
|                  |_____||      | | | |              | | | |            | | | | |
|                          |      | | | |              | | | |            | | | | |
|                    ______|      | | | |              | | | |            | | | | |
|                  | eth2 =-------------------------------------------------/ | | | |
|                  |_____||      | | | |              | | | |              | | | |
|__________________________|      | | | |              | | | |              | | | |
                                    | | | |              | | | |              | | | |
   __________________________        | | | |              | | | |              | | | |
| [ an-node04 ]      ______|      | | | |              | | | |              | | | |
|                  | eth0 =-------/ | | |              | | | |              | | | |
|                  |_____||        | | |              | | | |              | | | |
|                          |        | | |              | | | |              | | | |
|                    ______|        | | |              | | | |              | | | |
|                  | eth1 =-----------------------------/ | | |              | | | |
|                  |_____||        | | |                | | |              | | | |
|                          |        | | |                | | |              | | | |
|                    ______|        | | |                | | |              | | | |
|                  | eth2 =---------------------------------------------------/ | | |
|                   |_____||        | | |                | | |                | | |
  |__________________________|        | | |                | | |                | | |
                                      | | |                | | |                | | |
  __________________________          | | |                | | |                | | |
| [ an-node05 ]     ______|         | | |                | | |                | | |
  |                   | eth0 =---------/ | |                | | |                | | |
|                   |_____||          | |                | | |                | | |
  |                         |          | |                | | |                | | |
|                   ______|          | |                | | |                | | |
  |                   | eth1 =-------------------------------/ | |                | | |
|                   |_____||          | |                  | |                | | |
  |                         |          | |                  | |                | | |
|                   ______|          | |                  | |                | | |
  |                   | eth2 =-----------------------------------------------------/ | |
|                   |_____||          | |                  | |                  | |
  |__________________________|          | |                  | |                  | |
                                        | |                  | |                  | |
  __________________________            | |                  | |                  | |
| [ an-node06 ]     ______|           | |                  | |                  | |
  |                   | eth0 =-----------/ |                  | |                  | |
|                   |_____||            |                  | |                  | |
  |                         |            |                  | |                  | |
|                    ______|            |                  | |                  | |
|                  | eth1 =---------------------------------/ |                  | |
|                  |_____||            |                    |                  | |
|                          |            |                    |                  | |
|                    ______|            |                    |                  | |
|                   | eth2 =-------------------------------------------------------/ |
|                   |_____||            |                    |                    |
|__________________________|            |                    |                    |
                                          |                     |                    |
  __________________________              |                     |                    |
| [ an-node07 ]     ______|             |                    |                    |
|                  | eth0 =-------------/                    |                    |
|                  |_____||                                  |                    |
|                          |                                  |                    |
|                    ______|                                  |                    |
|                  | eth1 =-----------------------------------/                    |
|                  |_____||                                                        |
|                          |                                                        |
|                    ______|                                                        |
|                  | eth2 =---------------------------------------------------------/
|                  |_____||
|__________________________|
</source>
</source>


Line 140: Line 243:


<source lang="bash">
<source lang="bash">
yum install cman fence-agents rgmanager resource-agents lvm2-cluster gfs2-utils python-virtinst libvirt qemu-kvm-tools qemu-kvm virt-manager virt-viewer
yum install cman fence-agents rgmanager resource-agents lvm2-cluster gfs2-utils python-virtinst libvirt qemu-kvm-tools qemu-kvm virt-manager virt-viewer virtio-win
</source>
</source>


Line 166: Line 269:
<source lang="xml">
<source lang="xml">
<?xml version="1.0"?>
<?xml version="1.0"?>
<cluster name="an-clusterA" config_version="1">
<cluster config_version="1" name="an-cluster">
        <cman two_node="1" expected_votes="1" />
<cman expected_votes="1" two_node="1" />
        <totem secauth="off" rrp_mode="none" />
<clusternodes>
        <clusternodes>
<clusternode name="an-node01.alteeve.ca" nodeid="1">
                <clusternode name="an-node01.alteeve.com" nodeid="1">
<fence>
                        <fence>
<method name="ipmi">
                                <method name="PDU">
<device action="reboot" name="ipmi_an01" />
                                        <device name="pdu2" action="reboot" port="1" />
</method>
                                </method>
<method name="pdu">
                        </fence>
<device action="reboot" name="pdu1" port="1" />
                </clusternode>
<device action="reboot" name="pdu2" port="1" />
                <clusternode name="an-node02.alteeve.com" nodeid="2">
</method>
                        <fence>
</fence>
                                <method name="PDU">
</clusternode>
                                        <device name="pdu2" action="reboot" port="2" />
<clusternode name="an-node02.alteeve.ca" nodeid="2">
                                </method>
<fence>
                        </fence>
<method name="ipmi">
                </clusternode>
<device action="reboot" name="ipmi_an02" />
        </clusternodes>
</method>
        <fencedevices>
<method name="pdu">
                <fencedevice name="pdu2" agent="fence_apc" ipaddr="192.168.1.6" login="apc" passwd="secret" />
<device action="reboot" name="pdu1" port="2" />
        </fencedevices>
<device action="reboot" name="pdu2" port="2" />
        <rm>
</method>
                <resources>
</fence>
                        <ip address="192.168.2.100" monitor_link="on"/>
</clusternode>
                </resources>
<clusternode name="an-node03.alteeve.ca" nodeid="3">
                <failoverdomains>
<fence>
                        <failoverdomain name="an1_primary" nofailback="0" ordered="0" restricted="1">
<method name="ipmi">
                                <failoverdomainnode name="an-node01.alteeve.com" priority="1"/>
<device action="reboot" name="ipmi_an03" />
                                <failoverdomainnode name="an-node02.alteeve.com" priority="1"/>
</method>
                        </failoverdomain>
<method name="pdu">
                </failoverdomains>
<device action="reboot" name="pdu1" port="3" />
                <service autostart="1" name="san_ip1" domain="an1_primary">
<device action="reboot" name="pdu2" port="3" />
                        <ip ref="192.168.2.100"/>
</method>
                </service>
</fence>
        </rm>
</clusternode>
</cluster>
<clusternode name="an-node04.alteeve.ca" nodeid="4">
</source>
<fence>
 
<method name="ipmi">
Save the file, then validate it. If it fails, address the errors and try again.
<device action="reboot" name="ipmi_an04" />
 
</method>
<source lang="bash">
<method name="pdu">
ip addr list | grep <ip>
<device action="reboot" name="pdu1" port="4" />
rg_test test /etc/cluster/cluster.conf
<device action="reboot" name="pdu2" port="4" />
ccs_config_validate
</method>
</source>
</fence>
<source lang="text">
</clusternode>
Configuration validates
<clusternode name="an-node05.alteeve.ca" nodeid="5">
</source>
<fence>
 
<method name="ipmi">
Push it to the other node:
<device action="reboot" name="ipmi_an05" />
 
</method>
<source lang="bash">
<method name="pdu">
rsync -av /etc/cluster/cluster.conf root@an-node02:/etc/cluster/
<device action="reboot" name="pdu1" port="5" />
</source>
<device action="reboot" name="pdu2" port="5" />
<source lang="text">
</method>
sending incremental file list
</fence>
cluster.conf
</clusternode>
 
<clusternode name="an-node06.alteeve.ca" nodeid="6">
sent 781 bytes  received 31 bytes  541.33 bytes/sec
<fence>
total size is 701  speedup is 0.86
<method name="ipmi">
</source>
<device action="reboot" name="ipmi_an06" />
 
</method>
Start:
<method name="pdu">
 
<device action="reboot" name="pdu1" port="6" />
 
<device action="reboot" name="pdu2" port="6" />
 
</method>
'''''DO NOT PROCEED UNTIL YOUR cluster.conf FILE VALIDATES!'''''
</fence>
 
</clusternode>
Unless you have it perfect, your cluster will fail.
<clusternode name="an-node07.alteeve.ca" nodeid="7">
 
<fence>
Once it validates, proceed.
<method name="ipmi">
 
<device action="reboot" name="ipmi_an07" />
== Starting The Cluster For The First Time ==
</method>
 
<method name="pdu">
By default, if you start one node only and you've enabled the <span class="code"><cman two_node="1" expected_votes="1"/></span> option as we have done, the lone server will effectively gain quorum. It will try to connect to the cluster, but there won't be a cluster to connect to, so it will fence the other node after a timeout period. This timeout is <span class="code">6</span> seconds by default.
<device action="reboot" name="pdu1" port="7" />
 
<device action="reboot" name="pdu2" port="7" />
For now, we will leave the default as it is. If you're interested in changing it though, the argument you are looking for is <span class="code">[[RHCS v3 cluster.conf#post_join_delay|post_join_delay]]</span>.
</method>
 
</fence>
This behaviour means that we'll want to start both nodes well within six seconds of one another, least the slower one get needlessly fenced.
</clusternode>
 
</clusternodes>
'''Left off here'''
<fencedevices>
 
<fencedevice agent="fence_ipmilan" ipaddr="an-node01.ipmi" login="root" name="ipmi_an01" passwd="secret" />
Note to help minimize dual-fences:
<fencedevice agent="fence_ipmilan" ipaddr="an-node02.ipmi" login="root" name="ipmi_an02" passwd="secret" />
* <span class="code">you could add FENCED_OPTS="-f 5" to /etc/sysconfig/cman on *one* node</span> (ilo fence devices may need this)
<fencedevice agent="fence_ipmilan" ipaddr="an-node03.ipmi" login="root" name="ipmi_an03" passwd="secret" />
 
<fencedevice agent="fence_ipmilan" ipaddr="an-node04.ipmi" login="root" name="ipmi_an04" passwd="secret" />
== DRBD Config ==
<fencedevice agent="fence_ipmilan" ipaddr="an-node05.ipmi" login="root" name="ipmi_an05" passwd="secret" />
 
<fencedevice agent="fence_ipmilan" ipaddr="an-node06.ipmi" login="root" name="ipmi_an06" passwd="secret" />
Install from source:
<fencedevice agent="fence_ipmilan" ipaddr="an-node07.ipmi" login="root" name="ipmi_an07" passwd="secret" />
 
<fencedevice agent="fence_apc_snmp" ipaddr="pdu1.alteeve.ca" name="pdu1" />
'''Both''':
<fencedevice agent="fence_apc_snmp" ipaddr="pdu2.alteeve.ca" name="pdu2" />
 
</fencedevices>
<source lang="bash">
<fence_daemon post_join_delay="30" />
# Obliterate peer - fence via cman
<totem rrp_mode="none" secauth="off" />
wget -c https://alteeve.com/files/an-cluster/sbin/obliterate-peer.sh -O /sbin/obliterate-peer.sh
<rm>
chmod a+x /sbin/obliterate-peer.sh
<resources>
ls -lah /sbin/obliterate-peer.sh
<ip address="10.10.1.1" monitor_link="on" />
 
<script file="/etc/init.d/tgtd" name="tgtd" />
# Download, compile and install DRBD
<script file="/etc/init.d/drbd" name="drbd" />
wget -c http://oss.linbit.com/drbd/8.3/drbd-8.3.11.tar.gz
<script file="/etc/init.d/clvmd" name="clvmd" />
tar -xvzf drbd-8.3.11.tar.gz
<script file="/etc/init.d/gfs2" name="gfs2" />
cd drbd-8.3.11
<script file="/etc/init.d/libvirtd" name="libvirtd" />
./configure \
</resources>
  --prefix=/usr \
<failoverdomains>
  --localstatedir=/var \
<!-- Used for storage -->
  --sysconfdir=/etc \
<!-- SAN Nodes -->
  --with-utils \
<failoverdomain name="an1_only" nofailback="1" ordered="0" restricted="1">
  --with-km \
<failoverdomainnode name="an-node01.alteeve.ca" />
  --with-udev \
</failoverdomain>
  --with-pacemaker \
<failoverdomain name="an2_only" nofailback="1" ordered="0" restricted="1">
  --with-rgmanager \
<failoverdomainnode name="an-node02.alteeve.ca" />
  --with-bashcompletion
</failoverdomain>
make
make install
<!-- VM Nodes -->
</source>
<failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
 
<failoverdomainnode name="an-node03.alteeve.ca" />
== Configure ==
</failoverdomain>
 
<failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
'''<span class="code">an-node01</span>''':
<failoverdomainnode name="an-node04.alteeve.ca" />
</failoverdomain>
<failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" />
</failoverdomain>
<failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca" />
</failoverdomain>
<failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca" />
</failoverdomain>
<!-- Domain for the SAN -->
<failoverdomain name="an1_primary" nofailback="1" ordered="1" restricted="0">
<failoverdomainnode name="an-node01.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node02.alteeve.ca" priority="2" />
</failoverdomain>
<!-- Domains for VMs running primarily on an-node03 -->
<failoverdomain name="an3_an4" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an3_an5" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an3_an6" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an3_an7" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
</failoverdomain>
<!-- Domains for VMs running primarily on an-node04 -->
<failoverdomain name="an4_an3" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an4_an5" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an4_an6" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an4_an7" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
</failoverdomain>
<!-- Domains for VMs running primarily on an-node05 -->
<failoverdomain name="an5_an3" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an5_an4" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an5_an6" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an5_an7" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
</failoverdomain>
<!-- Domains for VMs running primarily on an-node06 -->
<failoverdomain name="an6_an3" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an6_an4" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an6_an5" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an6_an7" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
</failoverdomain>
<!-- Domains for VMs running primarily on an-node07 -->
<failoverdomain name="an7_an3" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an7_an4" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an7_an5" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an7_an6" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
</failoverdomain>
</failoverdomains>
<!-- SAN Services -->
<service autostart="1" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="tgtd" />
</script>
</script>
</service>
<service autostart="1" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="tgtd" />
</script>
</script>
</service>
<service autostart="1" domain="an1_primary" name="san_ip" recovery="relocate">
<ip ref="10.10.1.1" />
</service>
<!-- VM Storage services. -->
<service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd" />
</script>
</script>
</script>
</service>
<service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd" />
</script>
</script>
</script>
</service>
<service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd" />
</script>
</script>
</script>
</service>
<service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd" />
</script>
</script>
</script>
</service>
<service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd" />
</script>
</script>
</script>
</service>
<!-- VM Services -->
<!-- VMs running primarily on an-node03 -->
<vm name="vm01" domain="an03_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm02" domain="an03_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm03" domain="an03_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm04" domain="an03_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<!-- VMs running primarily on an-node04 -->
<vm name="vm05" domain="an04_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm06" domain="an04_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm07" domain="an04_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm08" domain="an04_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<!-- VMs running primarily on an-node05 -->
<vm name="vm09" domain="an05_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm10" domain="an05_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm11" domain="an05_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm12" domain="an05_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<!-- VMs running primarily on an-node06 -->
<vm name="vm13" domain="an06_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm14" domain="an06_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm15" domain="an06_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm16" domain="an06_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<!-- VMs running primarily on an-node07 -->
<vm name="vm17" domain="an07_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm18" domain="an07_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm19" domain="an07_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm20" domain="an07_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
</rm>
</cluster>
</source>
 
Save the file, then validate it. If it fails, address the errors and try again.


<source lang="bash">
<source lang="bash">
# Configure DRBD's global value.
ip addr list | grep <ip>
cp /etc/drbd.d/global_common.conf /etc/drbd.d/global_common.conf.orig
rg_test test /etc/cluster/cluster.conf
vim /etc/drbd.d/global_common.conf
ccs_config_validate
diff -u /etc/drbd.d/global_common.conf
</source>
<source lang="text">
Configuration validates
</source>
</source>
<source lang="diff">
 
--- /etc/drbd.d/global_common.conf.orig 2011-08-01 21:58:46.000000000 -0400
Push it to the other node:
+++ /etc/drbd.d/global_common.conf 2011-08-01 23:18:27.000000000 -0400
 
@@ -15,24 +15,35 @@
<source lang="bash">
# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
rsync -av /etc/cluster/cluster.conf root@an-node02:/etc/cluster/
# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
+ fence-peer "/sbin/obliterate-peer.sh";
}
startup {
# wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
+ become-primary-on both;
+ wfc-timeout 300;
+ degr-wfc-timeout 120;
}
disk {
# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
# no-disk-drain no-md-flushes max-bio-bvecs
+ fencing resource-and-stonith;
}
net {
# sndbuf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
+ allow-two-primaries;
+ after-sb-0pri discard-zero-changes;
+ after-sb-1pri discard-secondary;
+ after-sb-2pri disconnect;
}
syncer {
# rate after al-extents use-rle cpu-mask verify-alg csums-alg
+ # This should be no more than 30% of the maximum sustainable write speed.
+ rate 20M;
}
}
</source>
</source>
<source lang="bash">
<source lang="text">
vim /etc/drbd.d/r0.res
sending incremental file list
</source>
cluster.conf
<source lang="text">
 
resource r0 {
sent 781 bytes  received 31 bytes  541.33 bytes/sec
        device          /dev/drbd0;
total size is 701  speedup is 0.86
        meta-disk      internal;
        on an-node01.alteeve.com {
                address        192.168.2.71:7789;
                disk            /dev/sda5;
        }
        on an-node02.alteeve.com {
                address        192.168.2.72:7789;
                disk            /dev/sda5;
        }
}
</source>
<source lang="bash">
cp /etc/drbd.d/r0.res /etc/drbd.d/r1.res
vim /etc/drbd.d/r1.res
</source>
<source lang="text">
resource r1 {
        device          /dev/drbd1;
        meta-disk      internal;
        on an-node01.alteeve.com {
                address        192.168.2.71:7790;
                disk            /dev/sdb1;
        }
        on an-node02.alteeve.com {
                address        192.168.2.72:7790;
                disk            /dev/sdb1;
        }
}
</source>
</source>


{{note|1=If you have multiple DRBD resources on on (set of) backing disks, consider adding <span class="code">syncer { after <minor-1>; }</span>. For example, tell <span class="code">/dev/drbd1</span> to wait for <span class="code">/dev/drbd0</span> by adding <span class="code">syncer { after 0; }</span>. This will prevent simultaneous resync's which could seriously impact performance. Resources will wait in <span class="code"></span> state until the defined resource has completed sync'ing.}}
Start:
 
 
 
'''''DO NOT PROCEED UNTIL YOUR cluster.conf FILE VALIDATES!'''''
 
Unless you have it perfect, your cluster will fail.
 
Once it validates, proceed.
 
== Starting The Cluster For The First Time ==
 
By default, if you start one node only and you've enabled the <span class="code"><cman two_node="1" expected_votes="1"/></span> option as we have done, the lone server will effectively gain quorum. It will try to connect to the cluster, but there won't be a cluster to connect to, so it will fence the other node after a timeout period. This timeout is <span class="code">6</span> seconds by default.
 
For now, we will leave the default as it is. If you're interested in changing it though, the argument you are looking for is <span class="code">[[RHCS v3 cluster.conf#post_join_delay|post_join_delay]]</span>.
 
This behaviour means that we'll want to start both nodes well within six seconds of one another, least the slower one get needlessly fenced.  


Validate:
'''Left off here'''


<source lang="bash">
Note to help minimize dual-fences:
drbdadm dump
* <span class="code">you could add FENCED_OPTS="-f 5" to /etc/sysconfig/cman on *one* node</span> (ilo fence devices may need this)
</source>
<source lang="text">
  --==  Thank you for participating in the global usage survey  ==--
The server's response is:


you are the 369th user to install this version
== DRBD Config ==
# /usr/etc/drbd.conf
 
common {
Install from source:
    protocol              C;
    net {
        allow-two-primaries;
        after-sb-0pri    discard-zero-changes;
        after-sb-1pri    discard-secondary;
        after-sb-2pri    disconnect;
    }
    disk {
        fencing          resource-and-stonith;
    }
    syncer {
        rate            20M;
    }
    startup {
        wfc-timeout      300;
        degr-wfc-timeout 120;
        become-primary-on both;
    }
    handlers {
        pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        local-io-error  "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
        fence-peer      /sbin/obliterate-peer.sh;
    }
}


# resource r0 on an-node01.alteeve.com: not ignored, not stacked
'''Both''':
resource r0 {
    on an-node01.alteeve.com {
        device          /dev/drbd0 minor 0;
        disk            /dev/sda5;
        address          ipv4 192.168.2.71:7789;
        meta-disk        internal;
    }
    on an-node02.alteeve.com {
        device          /dev/drbd0 minor 0;
        disk            /dev/sda5;
        address          ipv4 192.168.2.72:7789;
        meta-disk        internal;
    }
}


# resource r1 on an-node01.alteeve.com: not ignored, not stacked
<source lang="bash">
resource r1 {
# Obliterate peer - fence via cman
    on an-node01.alteeve.com {
wget -c https://alteeve.ca/files/an-cluster/sbin/obliterate-peer.sh -O /sbin/obliterate-peer.sh
        device          /dev/drbd1 minor 1;
chmod a+x /sbin/obliterate-peer.sh
        disk            /dev/sdb1;
ls -lah /sbin/obliterate-peer.sh
        address          ipv4 192.168.2.71:7790;
        meta-disk        internal;
    }
    on an-node02.alteeve.com {
        device          /dev/drbd1 minor 1;
        disk            /dev/sdb1;
        address          ipv4 192.168.2.72:7790;
        meta-disk        internal;
    }
}
</source>


<source lang="bash">
# Download, compile and install DRBD
rsync -av /etc/drbd.d root@an-node02:/etc/
wget -c http://oss.linbit.com/drbd/8.3/drbd-8.3.11.tar.gz
</source>
tar -xvzf drbd-8.3.11.tar.gz
<source lang="text">
cd drbd-8.3.11
drbd.d/
./configure \
drbd.d/global_common.conf
  --prefix=/usr \
drbd.d/global_common.conf.orig
  --localstatedir=/var \
drbd.d/r0.res
  --sysconfdir=/etc \
drbd.d/r1.res
  --with-utils \
 
  --with-km \
sent 3523 bytes  received 110 bytes  7266.00 bytes/sec
  --with-udev \
total size is 3926  speedup is 1.08
  --with-pacemaker \
  --with-rgmanager \
  --with-bashcompletion
make
make install
chkconfig --add drbd
chkconfig drbd off
</source>
</source>


== Initialize and First start ==
== Configure ==


'''Both''':
'''<span class="code">an-node01</span>''':
 
Create the meta-data.


<source lang="bash">
<source lang="bash">
drbdadm create-md r{0,1}
# Configure DRBD's global value.
cp /etc/drbd.d/global_common.conf /etc/drbd.d/global_common.conf.orig
vim /etc/drbd.d/global_common.conf
diff -u /etc/drbd.d/global_common.conf
</source>
<source lang="diff">
--- /etc/drbd.d/global_common.conf.orig 2011-08-01 21:58:46.000000000 -0400
+++ /etc/drbd.d/global_common.conf 2011-08-01 23:18:27.000000000 -0400
@@ -15,24 +15,35 @@
# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
+ fence-peer "/sbin/obliterate-peer.sh";
}
startup {
# wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
+ become-primary-on both;
+ wfc-timeout 300;
+ degr-wfc-timeout 120;
}
disk {
# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
# no-disk-drain no-md-flushes max-bio-bvecs
+ fencing resource-and-stonith;
}
net {
# sndbuf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
+ allow-two-primaries;
+ after-sb-0pri discard-zero-changes;
+ after-sb-1pri discard-secondary;
+ after-sb-2pri disconnect;
}
syncer {
# rate after al-extents use-rle cpu-mask verify-alg csums-alg
+ # This should be no more than 30% of the maximum sustainable write speed.
+ rate 20M;
}
}
</source>
</source>
<source lang="text">
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
</source>
Attach, connect and confirm (after both have attached and connected):
<source lang="bash">
<source lang="bash">
drbdadm attach r{0,1}
vim /etc/drbd.d/r0.res
drbdadm connect r{0,1}
cat /proc/drbd
</source>
</source>
<source lang="text">
<source lang="text">
version: 8.3.11 (api:88/proto:86-96)
resource r0 {
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.com, 2011-08-01 22:04:32
        device          /dev/drbd0;
0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
        meta-disk      internal;
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:441969960
        on an-node01.alteeve.ca {
1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
                address        192.168.2.71:7789;
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:29309628
                disk            /dev/sda5;
        }
        on an-node02.alteeve.ca {
                address        192.168.2.72:7789;
                disk            /dev/sda5;
        }
}
</source>
</source>
There is no data, so force both devices to be instantly UpToDate:
<source lang="bash">
<source lang="bash">
drbdadm -- --clear-bitmap new-current-uuid r{0,1}
cp /etc/drbd.d/r0.res /etc/drbd.d/r1.res
cat /proc/drbd  
vim /etc/drbd.d/r1.res
</source>
</source>
<source lang="text">
<source lang="text">
version: 8.3.11 (api:88/proto:86-96)
resource r1 {
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.com, 2011-08-01 22:04:32
        device          /dev/drbd1;
0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
        meta-disk      internal;
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
        on an-node01.alteeve.ca {
1: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
                address        192.168.2.71:7790;
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
                disk            /dev/sdb1;
</source>
        }
        on an-node02.alteeve.ca {
                address        192.168.2.72:7790;
                disk            /dev/sdb1;
        }
}
</source>
 
{{note|1=If you have multiple DRBD resources on on (set of) backing disks, consider adding <span class="code">syncer { after <minor-1>; }</span>. For example, tell <span class="code">/dev/drbd1</span> to wait for <span class="code">/dev/drbd0</span> by adding <span class="code">syncer { after 0; }</span>. This will prevent simultaneous resync's which could seriously impact performance. Resources will wait in <span class="code"></span> state until the defined resource has completed sync'ing.}}


Set both to primary and run a final check.
Validate:


<source lang="bash">
<source lang="bash">
drbdadm primary r{0,1}
drbdadm dump
cat /proc/drbd
</source>
</source>
<source lang="text">
<source lang="text">
version: 8.3.11 (api:88/proto:86-96)
  --== Thank you for participating in the global usage survey ==--
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.com, 2011-08-01 22:04:32
The server's response is:
  0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:672 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
  1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:672 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
</source>


== Update the cluster ==
you are the 369th user to install this version
 
# /usr/etc/drbd.conf
<source lang="bash">
common {
vim /etc/cluster/cluster.conf
    protocol              C;
</source>
    net {
<source lang="xml">
        allow-two-primaries;
<?xml version="1.0"?>
         after-sb-0pri    discard-zero-changes;
<cluster config_version="17" name="an-clusterA">
         after-sb-1pri    discard-secondary;
         <cman expected_votes="1" two_node="1"/>
         after-sb-2pri    disconnect;
         <totem rrp_mode="none" secauth="off"/>
    }
         <clusternodes>
    disk {
                <clusternode name="an-node01.alteeve.com" nodeid="1">
        fencing          resource-and-stonith;
                        <fence>
    }
                                <method name="apc_pdu">
    syncer {
                                        <device action="reboot" name="pdu2" port="1"/>
        rate            20M;
                                </method>
    }
                        </fence>
    startup {
                </clusternode>
        wfc-timeout      300;
                <clusternode name="an-node02.alteeve.com" nodeid="2">
        degr-wfc-timeout 120;
                        <fence>
        become-primary-on both;
                                <method name="apc_pdu">
    }
                                        <device action="reboot" name="pdu2" port="2"/>
    handlers {
                                </method>
        pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                        </fence>
         pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                </clusternode>
        local-io-error  "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
        </clusternodes>
        fence-peer      /sbin/obliterate-peer.sh;
        <fencedevices>
    }
                <fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
}
        </fencedevices>
 
         <fence_daemon post_join_delay="30"/>
# resource r0 on an-node01.alteeve.ca: not ignored, not stacked
        <rm>
resource r0 {
                <resources>
    on an-node01.alteeve.ca {
                        <ip address="192.168.2.100" monitor_link="on"/>
        device          /dev/drbd0 minor 0;
                        <script file="/etc/init.d/drbd" name="drbd"/>
        disk            /dev/sda5;
                        <script file="/etc/init.d/clvmd" name="clvmd"/>
        address          ipv4 192.168.2.71:7789;
                        <script file="/etc/init.d/tgtd" name="tgtd"/>
        meta-disk        internal;
                </resources>
    }
                <failoverdomains>
    on an-node02.alteeve.ca {
                        <failoverdomain name="an1_only" nofailback="1" ordered="0" restricted="1">
        device          /dev/drbd0 minor 0;
                                <failoverdomainnode name="an-node01.alteeve.com"/>
        disk            /dev/sda5;
                        </failoverdomain>
        address          ipv4 192.168.2.72:7789;
                        <failoverdomain name="an2_only" nofailback="1" ordered="0" restricted="1">
        meta-disk        internal;
                                <failoverdomainnode name="an-node02.alteeve.com"/>
    }
                        </failoverdomain>
}
                        <failoverdomain name="an1_primary" nofailback="1" ordered="1" restricted="0">
 
                                <failoverdomainnode name="an-node01.alteeve.com" priority="1"/>
# resource r1 on an-node01.alteeve.ca: not ignored, not stacked
                                <failoverdomainnode name="an-node02.alteeve.com" priority="2"/>
resource r1 {
                        </failoverdomain>
    on an-node01.alteeve.ca {
                </failoverdomains>
        device          /dev/drbd1 minor 1;
                <service autostart="0" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
        disk            /dev/sdb1;
                        <script ref="drbd">
        address          ipv4 192.168.2.71:7790;
                                <script ref="clvmd"/>
        meta-disk        internal;
                        </script>
    }
                </service>
    on an-node02.alteeve.ca {
                <service autostart="0" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
        device          /dev/drbd1 minor 1;
                        <script ref="drbd">
        disk            /dev/sdb1;
                                <script ref="clvmd"/>
        address          ipv4 192.168.2.72:7790;
                        </script>
         meta-disk        internal;
                </service>
    }
         </rm>
}
</cluster>
</source>
</source>
<source lang="bash">
<source lang="bash">
rg_test test /etc/cluster/cluster.conf
rsync -av /etc/drbd.d root@an-node02:/etc/
</source>
</source>
<source lang="text">
<source lang="text">
Running in test mode.
drbd.d/
Loading resource rule from /usr/share/cluster/oralistener.sh
drbd.d/global_common.conf
Loading resource rule from /usr/share/cluster/apache.sh
drbd.d/global_common.conf.orig
Loading resource rule from /usr/share/cluster/SAPDatabase
drbd.d/r0.res
Loading resource rule from /usr/share/cluster/postgres-8.sh
drbd.d/r1.res
Loading resource rule from /usr/share/cluster/lvm.sh
 
Loading resource rule from /usr/share/cluster/mysql.sh
sent 3523 bytes  received 110 bytes  7266.00 bytes/sec
Loading resource rule from /usr/share/cluster/lvm_by_vg.sh
total size is 3926  speedup is 1.08
Loading resource rule from /usr/share/cluster/service.sh
</source>
Loading resource rule from /usr/share/cluster/samba.sh
 
Loading resource rule from /usr/share/cluster/SAPInstance
== Initialize and First start ==
Loading resource rule from /usr/share/cluster/checkquorum
 
Loading resource rule from /usr/share/cluster/ocf-shellfuncs
'''Both''':
Loading resource rule from /usr/share/cluster/svclib_nfslock
 
Loading resource rule from /usr/share/cluster/script.sh
Create the meta-data.
Loading resource rule from /usr/share/cluster/clusterfs.sh
Loading resource rule from /usr/share/cluster/fs.sh
Loading resource rule from /usr/share/cluster/oracledb.sh
Loading resource rule from /usr/share/cluster/nfsserver.sh
Loading resource rule from /usr/share/cluster/netfs.sh
Loading resource rule from /usr/share/cluster/orainstance.sh
Loading resource rule from /usr/share/cluster/vm.sh
Loading resource rule from /usr/share/cluster/lvm_by_lv.sh
Loading resource rule from /usr/share/cluster/tomcat-6.sh
Loading resource rule from /usr/share/cluster/ip.sh
Loading resource rule from /usr/share/cluster/nfsexport.sh
Loading resource rule from /usr/share/cluster/openldap.sh
Loading resource rule from /usr/share/cluster/ASEHAagent.sh
Loading resource rule from /usr/share/cluster/nfsclient.sh
Loading resource rule from /usr/share/cluster/named.sh
Loaded 24 resource rules
=== Resources List ===
Resource type: ip
Instances: 1/1
Agent: ip.sh
Attributes:
  address = 192.168.2.100 [ primary unique ]
  monitor_link = on
  nfslock [ inherit("service%nfslock") ]


Resource type: script
<source lang="bash">
Agent: script.sh
modprobe
Attributes:
drbdadm create-md r{0,1}
  name = drbd [ primary unique ]
</source>
  file = /etc/init.d/drbd [ unique required ]
<source lang="text">
  service_name [ inherit("service%name") ]
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
</source>


Resource type: script
Attach, connect and confirm (after both have attached and connected):
Agent: script.sh
Attributes:
  name = clvmd [ primary unique ]
  file = /etc/init.d/clvmd [ unique required ]
  service_name [ inherit("service%name") ]


Resource type: script
<source lang="bash">
Agent: script.sh
drbdadm attach r{0,1}
Attributes:
drbdadm connect r{0,1}
  name = tgtd [ primary unique ]
cat /proc/drbd
  file = /etc/init.d/tgtd [ unique required ]
</source>
  service_name [ inherit("service%name") ]
<source lang="text">
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:441969960
1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:29309628
</source>


Resource type: service [INLINE]
There is no data, so force both devices to be instantly UpToDate:
Instances: 1/1
Agent: service.sh
Attributes:
  name = an1_storage [ primary unique required ]
  domain = an1_only [ reconfig ]
  autostart = 0 [ reconfig ]
  exclusive = 0 [ reconfig ]
  nfslock = 0
  nfs_client_cache = 0
  recovery = restart [ reconfig ]
  depend_mode = hard
  max_restarts = 0
  restart_expire_time = 0
  priority = 0


Resource type: service [INLINE]
<source lang="bash">
Instances: 1/1
drbdadm -- --clear-bitmap new-current-uuid r{0,1}
Agent: service.sh
cat /proc/drbd
Attributes:
</source>
  name = an2_storage [ primary unique required ]
<source lang="text">
  domain = an2_only [ reconfig ]
version: 8.3.11 (api:88/proto:86-96)
  autostart = 0 [ reconfig ]
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
  exclusive = 0 [ reconfig ]
0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
  nfslock = 0
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
  nfs_client_cache = 0
1: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
  recovery = restart [ reconfig ]
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
  depend_mode = hard
</source>
  max_restarts = 0
 
  restart_expire_time = 0
Set both to primary and run a final check.
  priority = 0


Resource type: service [INLINE]
<source lang="bash">
Instances: 1/1
drbdadm primary r{0,1}
Agent: service.sh
cat /proc/drbd
Attributes:
</source>
  name = san_ip [ primary unique required ]
<source lang="text">
  domain = an1_primary [ reconfig ]
version: 8.3.11 (api:88/proto:86-96)
  autostart = 0 [ reconfig ]
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
  exclusive = 0 [ reconfig ]
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
  nfslock = 0
    ns:0 nr:0 dw:0 dr:672 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
  nfs_client_cache = 0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
  recovery = relocate [ reconfig ]
    ns:0 nr:0 dw:0 dr:672 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
  depend_mode = hard
</source>
  max_restarts = 0
 
  restart_expire_time = 0
== Update the cluster ==
  priority = 0


=== Resource Tree ===
<source lang="bash">
service (S0) {
vim /etc/cluster/cluster.conf
  name = "an1_storage";
</source>
  domain = "an1_only";
<source lang="xml">
  autostart = "0";
<?xml version="1.0"?>
  exclusive = "0";
<cluster config_version="17" name="an-clusterA">
  nfslock = "0";
        <cman expected_votes="1" two_node="1"/>
  nfs_client_cache = "0";
        <totem rrp_mode="none" secauth="off"/>
  recovery = "restart";
        <clusternodes>
  depend_mode = "hard";
                <clusternode name="an-node01.alteeve.ca" nodeid="1">
  max_restarts = "0";
                        <fence>
  restart_expire_time = "0";
                                <method name="apc_pdu">
  priority = "0";
                                        <device action="reboot" name="pdu2" port="1"/>
  script (S0) {
                                </method>
    name = "drbd";
                        </fence>
    file = "/etc/init.d/drbd";
                </clusternode>
    service_name = "an1_storage";
                <clusternode name="an-node02.alteeve.ca" nodeid="2">
    script (S0) {
                        <fence>
      name = "clvmd";
                                <method name="apc_pdu">
      file = "/etc/init.d/clvmd";
                                        <device action="reboot" name="pdu2" port="2"/>
      service_name = "an1_storage";
                                </method>
    }
                        </fence>
  }
                </clusternode>
}
        </clusternodes>
service (S0) {
        <fencedevices>
  name = "an2_storage";
                <fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
  domain = "an2_only";
        </fencedevices>
  autostart = "0";
        <fence_daemon post_join_delay="30"/>
  exclusive = "0";
        <rm>
  nfslock = "0";
                <resources>
  nfs_client_cache = "0";
                        <ip address="192.168.2.100" monitor_link="on"/>
  recovery = "restart";
                        <script file="/etc/init.d/drbd" name="drbd"/>
  depend_mode = "hard";
                        <script file="/etc/init.d/clvmd" name="clvmd"/>
  max_restarts = "0";
                        <script file="/etc/init.d/tgtd" name="tgtd"/>
  restart_expire_time = "0";
                </resources>
  priority = "0";
                <failoverdomains>
  script (S0) {
                        <failoverdomain name="an1_only" nofailback="1" ordered="0" restricted="1">
    name = "drbd";
                                <failoverdomainnode name="an-node01.alteeve.ca"/>
    file = "/etc/init.d/drbd";
                        </failoverdomain>
    service_name = "an2_storage";
                        <failoverdomain name="an2_only" nofailback="1" ordered="0" restricted="1">
    script (S0) {
                                <failoverdomainnode name="an-node02.alteeve.ca"/>
      name = "clvmd";
                        </failoverdomain>
      file = "/etc/init.d/clvmd";
                        <failoverdomain name="an1_primary" nofailback="1" ordered="1" restricted="0">
      service_name = "an2_storage";
                                <failoverdomainnode name="an-node01.alteeve.ca" priority="1"/>
    }
                                <failoverdomainnode name="an-node02.alteeve.ca" priority="2"/>
  }
                        </failoverdomain>
}
                </failoverdomains>
service (S0) {
                <service autostart="0" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
  name = "san_ip";
                        <script ref="drbd">
  domain = "an1_primary";
                                <script ref="clvmd"/>
  autostart = "0";
                        </script>
  exclusive = "0";
                </service>
  nfslock = "0";
                <service autostart="0" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
  nfs_client_cache = "0";
                        <script ref="drbd">
  recovery = "relocate";
                                <script ref="clvmd"/>
  depend_mode = "hard";
                        </script>
  max_restarts = "0";
                </service>
  restart_expire_time = "0";
        </rm>
  priority = "0";
</cluster>
  ip (S0) {
</source>
    address = "192.168.2.100";
<source lang="bash">
    monitor_link = "on";
rg_test test /etc/cluster/cluster.conf
    nfslock = "0";
  }
}
=== Failover Domains ===
Failover domain: an1_only
Flags: Restricted No Failback
  Node an-node01.alteeve.com (id 1, priority 0)
Failover domain: an2_only
Flags: Restricted No Failback
  Node an-node02.alteeve.com (id 2, priority 0)
Failover domain: an1_primary
Flags: Ordered No Failback
  Node an-node01.alteeve.com (id 1, priority 1)
  Node an-node02.alteeve.com (id 2, priority 2)
=== Event Triggers ===
Event Priority Level 100:
  Name: Default
    (Any event)
    File: /usr/share/cluster/default_event_script.sl
[root@an-node01 ~]# cman_tool version -r
You have not authenticated to the ricci daemon on an-node01.alteeve.com
Password:
[root@an-node01 ~]# clusvcadm -e service:an1_storage
Local machine trying to enable service:an1_storage...Success
service:an1_storage is now running on an-node01.alteeve.com
[root@an-node01 ~]# cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.com, 2011-08-01 22:04:32
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:924 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:916 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
</source>
</source>
<source lang="bash">
<source lang="text">
cman_tool version -r
Running in test mode.
</source>
Loading resource rule from /usr/share/cluster/oralistener.sh
<source lang="text">
Loading resource rule from /usr/share/cluster/apache.sh
You have not authenticated to the ricci daemon on an-node01.alteeve.com
Loading resource rule from /usr/share/cluster/SAPDatabase
Password:
Loading resource rule from /usr/share/cluster/postgres-8.sh
</source>
Loading resource rule from /usr/share/cluster/lvm.sh
 
Loading resource rule from /usr/share/cluster/mysql.sh
'''<span class="code">an-node01</span>''':
Loading resource rule from /usr/share/cluster/lvm_by_vg.sh
 
Loading resource rule from /usr/share/cluster/service.sh
<source lang="bash">
Loading resource rule from /usr/share/cluster/samba.sh
clusvcadm -e service:an1_storage
Loading resource rule from /usr/share/cluster/SAPInstance
</source>
Loading resource rule from /usr/share/cluster/checkquorum
<source lang="text">
Loading resource rule from /usr/share/cluster/ocf-shellfuncs
service:an1_storage is now running on an-node01.alteeve.com
Loading resource rule from /usr/share/cluster/svclib_nfslock
</source>
Loading resource rule from /usr/share/cluster/script.sh
 
Loading resource rule from /usr/share/cluster/clusterfs.sh
'''<span class="code">an-node02</span>''':
Loading resource rule from /usr/share/cluster/fs.sh
Loading resource rule from /usr/share/cluster/oracledb.sh
Loading resource rule from /usr/share/cluster/nfsserver.sh
Loading resource rule from /usr/share/cluster/netfs.sh
Loading resource rule from /usr/share/cluster/orainstance.sh
Loading resource rule from /usr/share/cluster/vm.sh
Loading resource rule from /usr/share/cluster/lvm_by_lv.sh
Loading resource rule from /usr/share/cluster/tomcat-6.sh
Loading resource rule from /usr/share/cluster/ip.sh
Loading resource rule from /usr/share/cluster/nfsexport.sh
Loading resource rule from /usr/share/cluster/openldap.sh
Loading resource rule from /usr/share/cluster/ASEHAagent.sh
Loading resource rule from /usr/share/cluster/nfsclient.sh
Loading resource rule from /usr/share/cluster/named.sh
Loaded 24 resource rules
=== Resources List ===
Resource type: ip
Instances: 1/1
Agent: ip.sh
Attributes:
  address = 192.168.2.100 [ primary unique ]
  monitor_link = on
  nfslock [ inherit("service%nfslock") ]


<source lang="bash">
Resource type: script
clusvcadm -e service:an2_storage
Agent: script.sh
</source>
Attributes:
<source lang="text">
  name = drbd [ primary unique ]
service:an2_storage is now running on an-node02.alteeve.com
  file = /etc/init.d/drbd [ unique required ]
</source>
  service_name [ inherit("service%name") ]


'''Either'''
Resource type: script
Agent: script.sh
Attributes:
  name = clvmd [ primary unique ]
  file = /etc/init.d/clvmd [ unique required ]
  service_name [ inherit("service%name") ]


<source lang="bash">
Resource type: script
cat /proc/drbd
Agent: script.sh
</source>
Attributes:
<source lang="text">
  name = tgtd [ primary unique ]
version: 8.3.11 (api:88/proto:86-96)
  file = /etc/init.d/tgtd [ unique required ]
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.com, 2011-08-01 22:04:32
  service_name [ inherit("service%name") ]
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:924 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:916 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
</source>


== Configure Clustered LVM ==
Resource type: service [INLINE]
 
Instances: 1/1
'''span class="code">an-node01</span>''':
Agent: service.sh
 
Attributes:
<source lang="bash">
  name = an1_storage [ primary unique required ]
cp /etc/lvm/lvm.conf /etc/lvm/lvm.conf.orig
  domain = an1_only [ reconfig ]
vim /etc/lvm/lvm.conf
  autostart = 0 [ reconfig ]
diff -u /etc/lvm/lvm.conf.orig /etc/lvm/lvm.conf
  exclusive = 0 [ reconfig ]
</source>
  nfslock = 0
<source lang="diff">
  nfs_client_cache = 0
--- /etc/lvm/lvm.conf.orig 2011-08-02 21:59:01.000000000 -0400
  recovery = restart [ reconfig ]
+++ /etc/lvm/lvm.conf 2011-08-02 22:00:17.000000000 -0400
  depend_mode = hard
@@ -50,7 +50,8 @@
  max_restarts = 0
  restart_expire_time = 0
  priority = 0
    # By default we accept every block device:
 
-    filter = [ "a/.*/" ]
Resource type: service [INLINE]
+    #filter = [ "a/.*/" ]
Instances: 1/1
+    filter = [ "a|/dev/drbd*|", "r/.*/" ]
Agent: service.sh
Attributes:
    # Exclude the cdrom drive
  name = an2_storage [ primary unique required ]
    # filter = [ "r|/dev/cdrom|" ]
  domain = an2_only [ reconfig ]
@@ -308,7 +309,8 @@
  autostart = 0 [ reconfig ]
    # Type 3 uses built-in clustered locking.
  exclusive = 0 [ reconfig ]
    # Type 4 uses read-only locking which forbids any operations that might
  nfslock = 0
    # change metadata.
  nfs_client_cache = 0
-    locking_type = 1
  recovery = restart [ reconfig ]
+    #locking_type = 1
  depend_mode = hard
+    locking_type = 3
  max_restarts = 0
  restart_expire_time = 0
    # Set to 0 to fail when a lock request cannot be satisfied immediately.
  priority = 0
    wait_for_locks = 1
 
@@ -324,7 +326,8 @@
Resource type: service [INLINE]
    # to 1 an attempt will be made to use local file-based locking (type 1).
Instances: 1/1
    # If this succeeds, only commands against local volume groups will proceed.
Agent: service.sh
    # Volume Groups marked as clustered will be ignored.
Attributes:
-    fallback_to_local_locking = 1
  name = san_ip [ primary unique required ]
+    #fallback_to_local_locking = 1
  domain = an1_primary [ reconfig ]
+    fallback_to_local_locking = 0
  autostart = 0 [ reconfig ]
  exclusive = 0 [ reconfig ]
    # Local non-LV directory that holds file-based locks while commands are
  nfslock = 0
    # in progress.  A directory like /tmp that may get wiped on reboot is OK.
  nfs_client_cache = 0
</source>
  recovery = relocate [ reconfig ]
<source lang="bash">
  depend_mode = hard
rsync -av /etc/lvm/lvm.conf root@an-node02:/etc/lvm/
  max_restarts = 0
</source>
  restart_expire_time = 0
<source lang="text">
  priority = 0
sending incremental file list
lvm.conf


sent 2412 bytes  received 247 bytes  5318.00 bytes/sec
=== Resource Tree ===
total size is 24668  speedup is 9.28
service (S0) {
</source>
  name = "an1_storage";
 
  domain = "an1_only";
Create the LVM PVs, VGs and LVs.
  autostart = "0";
 
  exclusive = "0";
'''<span class="code">an-node01</span>''':
  nfslock = "0";
 
  nfs_client_cache = "0";
<source lang="bash">
  recovery = "restart";
pvcreate /dev/drbd{0,1}
  depend_mode = "hard";
</source>
  max_restarts = "0";
<source lang="text">
  restart_expire_time = "0";
   Physical volume "/dev/drbd0" successfully created
  priority = "0";
  Physical volume "/dev/drbd1" successfully created
   script (S0) {
</source>
    name = "drbd";
 
    file = "/etc/init.d/drbd";
'''<span class="code">an-node02</span>''':
    service_name = "an1_storage";
 
    script (S0) {
<source lang="bash">
      name = "clvmd";
pvscan
      file = "/etc/init.d/clvmd";
</source>
      service_name = "an1_storage";
<source lang="text">
    }
   PV /dev/drbd0                      lvm2 [421.50 GiB]
  }
   PV /dev/drbd1                      lvm2 [27.95 GiB]
}
   Total: 2 [449.45 GiB] / in use: 0 [] / in no VG: 2 [449.45 GiB]
service (S0) {
</source>
  name = "an2_storage";
 
  domain = "an2_only";
'''<span class="code">an-node01</span>''':
  autostart = "0";
 
  exclusive = "0";
<source lang="bash">
  nfslock = "0";
vgcreate -c y hdd_vg0 /dev/drbd0 && vgcreate -c y sdd_vg0 /dev/drbd1
  nfs_client_cache = "0";
</source>
  recovery = "restart";
<source lang="text">
   depend_mode = "hard";
   Clustered volume group "hdd_vg0" successfully created
   max_restarts = "0";
   Clustered volume group "ssd_vg0" successfully created
   restart_expire_time = "0";
</source>
  priority = "0";
 
   script (S0) {
'''<span class="code">an-node02</span>''':
    name = "drbd";
 
    file = "/etc/init.d/drbd";
<source lang="bash">
    service_name = "an2_storage";
vgscan
    script (S0) {
</source>
      name = "clvmd";
<source lang="text">
      file = "/etc/init.d/clvmd";
   Reading all physical volumes.  This may take a while...
      service_name = "an2_storage";
  Found volume group "ssd_vg0" using metadata type lvm2
    }
  Found volume group "hdd_vg0" using metadata type lvm2
  }
</source>
}
 
service (S0) {
'''<span class="code">an-node01</span>''':
  name = "san_ip";
 
   domain = "an1_primary";
<source lang="bash">
   autostart = "0";
lvcreate -l 100%FREE -n lun0 /dev/hdd_vg0 && lvcreate -l 100%FREE -n lun1 /dev/ssd_vg0
  exclusive = "0";
</source>
  nfslock = "0";
<source lang="text">
  nfs_client_cache = "0";
  Logical volume "lun0" created
  recovery = "relocate";
  Logical volume "lun1" created
  depend_mode = "hard";
</source>
  max_restarts = "0";
 
  restart_expire_time = "0";
'''<span class="code">an-node02</span>''':
  priority = "0";
 
   ip (S0) {
<source lang="bash">
    address = "192.168.2.100";
lvscan
    monitor_link = "on";
</source>
    nfslock = "0";
<source lang="text">
  }
  ACTIVE            '/dev/ssd_vg0/lun1' [27.95 GiB] inherit
}
  ACTIVE            '/dev/hdd_vg0/lun0' [421.49 GiB] inherit
=== Failover Domains ===
</source>
Failover domain: an1_only
 
Flags: Restricted No Failback
= iSCSI notes =
  Node an-node01.alteeve.ca (id 1, priority 0)
Failover domain: an2_only
Flags: Restricted No Failback
  Node an-node02.alteeve.ca (id 2, priority 0)
Failover domain: an1_primary
Flags: Ordered No Failback
  Node an-node01.alteeve.ca (id 1, priority 1)
  Node an-node02.alteeve.ca (id 2, priority 2)
=== Event Triggers ===
Event Priority Level 100:
  Name: Default
    (Any event)
    File: /usr/share/cluster/default_event_script.sl
[root@an-node01 ~]# cman_tool version -r
You have not authenticated to the ricci daemon on an-node01.alteeve.ca
Password:
[root@an-node01 ~]# clusvcadm -e service:an1_storage
Local machine trying to enable service:an1_storage...Success
service:an1_storage is now running on an-node01.alteeve.ca
[root@an-node01 ~]# cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:924 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:916 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
</source>
<source lang="bash">
cman_tool version -r
</source>
<source lang="text">
You have not authenticated to the ricci daemon on an-node01.alteeve.ca
Password:
</source>


IET vs tgt pros and cons needed.
'''<span class="code">an-node01</span>''':


default iscsi port is 3260
<source lang="bash">
clusvcadm -e service:an1_storage
</source>
<source lang="text">
service:an1_storage is now running on an-node01.alteeve.ca
</source>


''initiator'': This is the client.
'''<span class="code">an-node02</span>''':
''target'': This is the server side.
''sid'': Session ID; Found with <span class="code">iscsiadm -m session -P 1</span>. SID and sysfs path are not persistent, partially start-order based.
''iQN'': iSCSI Qualified Name; This is a string that uniquely identifies targets and initiators.
 
'''Both''':


<source lang="bash">
<source lang="bash">
yum install iscsi-initiator-utils scsi-target-utils
clusvcadm -e service:an2_storage
</source>
<source lang="text">
service:an2_storage is now running on an-node02.alteeve.ca
</source>
</source>


'''<span class="code">an-node01</span>''':
'''Either'''


<source lang="bash">
<source lang="bash">
cp /etc/tgt/targets.conf /etc/tgt/targets.conf.orig
cat /proc/drbd
vim /etc/tgt/targets.conf
</source>
diff -u /etc/tgt/targets.conf.orig /etc/tgt/targets.conf
<source lang="text">
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:924 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:916 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
</source>
</source>
== Configure Clustered LVM ==
<span class="code">an-node01</span>''':
<source lang="bash">
<source lang="bash">
--- /etc/tgt/targets.conf.orig 2011-07-31 12:38:35.000000000 -0400
cp /etc/lvm/lvm.conf /etc/lvm/lvm.conf.orig
+++ /etc/tgt/targets.conf 2011-08-02 22:19:06.000000000 -0400
vim /etc/lvm/lvm.conf
@@ -251,3 +251,9 @@
diff -u /etc/lvm/lvm.conf.orig /etc/lvm/lvm.conf
#        vendor_id VENDOR1
#    </direct-store>
#</target>
+
+<target iqn.2011-08.com.alteeve:an-clusterA.target01>
+ direct-store /dev/drbd0
+ direct-store /dev/drbd1
+ vendor_id Alteeve
</source>
</source>
<source lang="bash">
<source lang="diff">
rsync -av /etc/tgt/targets.conf root@an-node02:/etc/tgt/
--- /etc/lvm/lvm.conf.orig 2011-08-02 21:59:01.000000000 -0400
</source>
+++ /etc/lvm/lvm.conf 2011-08-02 22:00:17.000000000 -0400
<source lang="text">
@@ -50,7 +50,8 @@
sending incremental file list
targets.conf
 
    # By default we accept every block device:
sent 909 bytes received 97 bytes 670.67 bytes/sec
-    filter = [ "a/.*/" ]
total size is 7093  speedup is 7.05
+    #filter = [ "a/.*/" ]
+    filter = [ "a|/dev/drbd*|", "r/.*/" ]
    # Exclude the cdrom drive
    # filter = [ "r|/dev/cdrom|" ]
@@ -308,7 +309,8 @@
    # Type 3 uses built-in clustered locking.
    # Type 4 uses read-only locking which forbids any operations that might
    # change metadata.
-    locking_type = 1
+    #locking_type = 1
+    locking_type = 3
    # Set to 0 to fail when a lock request cannot be satisfied immediately.
    wait_for_locks = 1
@@ -324,7 +326,8 @@
    # to 1 an attempt will be made to use local file-based locking (type 1).
    # If this succeeds, only commands against local volume groups will proceed.
    # Volume Groups marked as clustered will be ignored.
-    fallback_to_local_locking = 1
+    #fallback_to_local_locking = 1
+    fallback_to_local_locking = 0
   
    # Local non-LV directory that holds file-based locks while commands are
    # in progress. A directory like /tmp that may get wiped on reboot is OK.
</source>
<source lang="bash">
rsync -av /etc/lvm/lvm.conf root@an-node02:/etc/lvm/
</source>
</source>
<source lang="text">
sending incremental file list
lvm.conf


=== Update the cluster ===
sent 2412 bytes  received 247 bytes  5318.00 bytes/sec
 
total size is 24668  speedup is 9.28
<source lang="xml">
              <service autostart="0" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd">
                                        <script ref="tgtd"/>
                                </script>
                        </script>
                </service>
                <service autostart="0" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd">
                                        <script ref="tgtd"/>
                                </script>
                        </script>
                </service>
</source>
</source>


= Connect to the SAN from a VM node =
Create the LVM PVs, VGs and LVs.


'''<span class="code">an-node03+</span>''':
'''<span class="code">an-node01</span>''':


<source lang="bash">
<source lang="bash">
iscsiadm -m discovery -t sendtargets -p 192.168.2.100
pvcreate /dev/drbd{0,1}
</source>
</source>
<source lang="text">
<source lang="text">
192.168.2.100:3260,1 iqn.2011-08.com.alteeve:an-clusterA.target01
  Physical volume "/dev/drbd0" successfully created
  Physical volume "/dev/drbd1" successfully created
</source>
</source>
'''<span class="code">an-node02</span>''':


<source lang="bash">
<source lang="bash">
iscsiadm --mode node --portal 192.168.2.100 --target iqn.2011-08.com.alteeve:an-clusterA.target01 --login
pvscan
</source>
</source>
<source lang="text">
<source lang="text">
Logging in to [iface: default, target: iqn.2011-08.com.alteeve:an-clusterA.target01, portal: 192.168.2.100,3260]
  PV /dev/drbd0                      lvm2 [421.50 GiB]
Login to [iface: default, target: iqn.2011-08.com.alteeve:an-clusterA.target01, portal: 192.168.2.100,3260] successful.
  PV /dev/drbd1                      lvm2 [27.95 GiB]
  Total: 2 [449.45 GiB] / in use: 0 [0  ] / in no VG: 2 [449.45 GiB]
</source>
</source>
'''<span class="code">an-node01</span>''':


<source lang="bash">
<source lang="bash">
fdisk -l
vgcreate -c y hdd_vg0 /dev/drbd0 && vgcreate -c y sdd_vg0 /dev/drbd1
</source>
</source>
<source lang="text">
<source lang="text">
Disk /dev/sda: 500.1 GB, 500107862016 bytes
  Clustered volume group "hdd_vg0" successfully created
255 heads, 63 sectors/track, 60801 cylinders
  Clustered volume group "ssd_vg0" successfully created
Units = cylinders of 16065 * 512 = 8225280 bytes
</source>
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00062f4a


  Device Boot      Start        End      Blocks  Id  System
'''<span class="code">an-node02</span>''':
/dev/sda1  *          1          33      262144  83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              33        5255    41943040  83  Linux
/dev/sda3            5255        5777    4194304  82  Linux swap / Solaris


Disk /dev/sdb: 452.6 GB, 452573790208 bytes
<source lang="bash">
255 heads, 63 sectors/track, 55022 cylinders
vgscan
Units = cylinders of 16065 * 512 = 8225280 bytes
</source>
Sector size (logical/physical): 512 bytes / 512 bytes
<source lang="text">
I/O size (minimum/optimal): 512 bytes / 512 bytes
  Reading all physical volumes.  This may take a while...
Disk identifier: 0x00000000
  Found volume group "ssd_vg0" using metadata type lvm2
 
  Found volume group "hdd_vg0" using metadata type lvm2
Disk /dev/sdb doesn't contain a valid partition table
 
Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
 
Disk /dev/sdc doesn't contain a valid partition table
</source>
</source>


== Setup the VM Cluster ==
'''<span class="code">an-node01</span>''':
 
Install RPMs.


<source lang="bash">
<source lang="bash">
yum -y install lvm2-cluster cman fence-agents
lvcreate -l 100%FREE -n lun0 /dev/hdd_vg0 && lvcreate -l 100%FREE -n lun1 /dev/ssd_vg0
</source>
<source lang="text">
  Logical volume "lun0" created
  Logical volume "lun1" created
</source>
</source>


Configure <span class="code">lvm.conf</span>.
'''<span class="code">an-node02</span>''':


<source lang="bash">
<source lang="bash">
cp /etc/lvm/lvm.conf /etc/lvm/lvm.conf.orig
lvscan
vim /etc/lvm/lvm.conf
diff -u /etc/lvm/lvm.conf.orig /etc/lvm/lvm.conf
</source>
</source>
<source lang="diff">
<source lang="text">
--- /etc/lvm/lvm.conf.orig 2011-08-02 21:59:01.000000000 -0400
  ACTIVE            '/dev/ssd_vg0/lun1' [27.95 GiB] inherit
+++ /etc/lvm/lvm.conf 2011-08-03 00:35:45.000000000 -0400
  ACTIVE            '/dev/hdd_vg0/lun0' [421.49 GiB] inherit
@@ -308,7 +308,8 @@
    # Type 3 uses built-in clustered locking.
    # Type 4 uses read-only locking which forbids any operations that might
    # change metadata.
-    locking_type = 1
+    #locking_type = 1
+    locking_type = 3
    # Set to 0 to fail when a lock request cannot be satisfied immediately.
    wait_for_locks = 1
@@ -324,7 +325,8 @@
    # to 1 an attempt will be made to use local file-based locking (type 1).
    # If this succeeds, only commands against local volume groups will proceed.
    # Volume Groups marked as clustered will be ignored.
-    fallback_to_local_locking = 1
+    #fallback_to_local_locking = 1
+    fallback_to_local_locking = 0
    # Local non-LV directory that holds file-based locks while commands are
    # in progress.  A directory like /tmp that may get wiped on reboot is OK.
</source>
</source>


<source lang="bash">
= iSCSI notes =
rsync -av /etc/lvm/lvm.conf root@an-node04:/etc/lvm/
 
</source>
IET vs tgt pros and cons needed.
<source lang="text">
 
sending incremental file list
default iscsi port is 3260
lvm.conf
 
''initiator'': This is the client.
''target'': This is the server side.
''sid'': Session ID; Found with <span class="code">iscsiadm -m session -P 1</span>. SID and sysfs path are not persistent, partially start-order based.
''iQN'': iSCSI Qualified Name; This is a string that uniquely identifies targets and initiators.
 
'''Both''':


sent 873 bytes  received 247 bytes  2240.00 bytes/sec
total size is 24625  speedup is 21.99
</source>
<source lang="bash">
<source lang="bash">
rsync -av /etc/lvm/lvm.conf root@an-node05:/etc/lvm/
yum install iscsi-initiator-utils scsi-target-utils
</source>
<source lang="text">
sending incremental file list
lvm.conf
 
sent 873 bytes  received 247 bytes  2240.00 bytes/sec
total size is 24625  speedup is 21.99
</source>
</source>


Config the cluster.
'''<span class="code">an-node01</span>''':


<source lang="bash">
<source lang="bash">
vim /etc/cluster/cluster.conf
cp /etc/tgt/targets.conf /etc/tgt/targets.conf.orig
vim /etc/tgt/targets.conf
diff -u /etc/tgt/targets.conf.orig /etc/tgt/targets.conf
</source>
</source>
<source lang="xml">
<source lang="bash">
<?xml version="1.0"?>
--- /etc/tgt/targets.conf.orig 2011-07-31 12:38:35.000000000 -0400
<cluster config_version="5" name="an-clusterB">
+++ /etc/tgt/targets.conf 2011-08-02 22:19:06.000000000 -0400
        <totem rrp_mode="none" secauth="off"/>
@@ -251,3 +251,9 @@
        <clusternodes>
#        vendor_id VENDOR1
                <clusternode name="an-node03.alteeve.com" nodeid="1">
#    </direct-store>
                        <fence>
#</target>
                                <method name="apc_pdu">
+
                                        <device action="reboot" name="pdu2" port="3"/>
+<target iqn.2011-08.com.alteeve:an-clusterA.target01>
                                </method>
+ direct-store /dev/drbd0
                        </fence>
+ direct-store /dev/drbd1
                </clusternode>
+ vendor_id Alteeve
                <clusternode name="an-node04.alteeve.com" nodeid="2">
</source>
                        <fence>
<source lang="bash">
                                <method name="apc_pdu">
rsync -av /etc/tgt/targets.conf root@an-node02:/etc/tgt/
                                        <device action="reboot" name="pdu2" port="4"/>
</source>
                                </method>
<source lang="text">
                        </fence>
sending incremental file list
                </clusternode>
targets.conf
                <clusternode name="an-node05.alteeve.com" nodeid="3">
 
                        <fence>
sent 909 bytes  received 97 bytes  670.67 bytes/sec
                                <method name="apc_pdu">
total size is 7093  speedup is 7.05
                                        <device action="reboot" name="pdu2" port="5"/>
</source>
                                </method>
 
                        </fence>
=== Update the cluster ===
                </clusternode>
 
        </clusternodes>
<source lang="xml">
        <fencedevices>
              <service autostart="0" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
                <fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
                         <script ref="drbd">
        </fencedevices>
                                 <script ref="clvmd">
        <fence_daemon post_join_delay="30"/>
                                        <script ref="tgtd"/>
        <rm>
                                </script>
                <resources>
                         </script>
                        <script file="/etc/init.d/iscsi" name="iscsi" />
                 </service>
                        <script file="/etc/init.d/clvmd" name="clvmd" />
                 <service autostart="0" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
                </resources>
                         <script ref="drbd">
                <failoverdomains>
                                 <script ref="clvmd">
                        <failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
                                        <script ref="tgtd"/>
                                <failoverdomainnode name="an-node03.alteeve.com" />
                                 </script>
                        </failoverdomain>
                        <failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node04.alteeve.com" />
                        </failoverdomain>
                        <failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node05.alteeve.com" />
                        </failoverdomain>
                </failoverdomains>
                <service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart">
                         <script ref="iscsi">
                                 <script ref="clvmd"/>
                         </script>
                 </service>
                 <service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart">
                         <script ref="iscsi">
                                 <script ref="clvmd"/>
                        </script>
                </service>
                <service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart">
                        <script ref="iscsi">
                                 <script ref="clvmd"/>
                         </script>
                         </script>
                 </service>
                 </service>
        </rm> 
</cluster>
</source>
</source>
= Connect to the SAN from a VM node =
'''<span class="code">an-node03+</span>''':
<source lang="bash">
<source lang="bash">
ccs_config_validate
iscsiadm -m discovery -t sendtargets -p 192.168.2.100
</source>
</source>
<source lang="text">
<source lang="text">
Configuration validates
192.168.2.100:3260,1 iqn.2011-08.com.alteeve:an-clusterA.target01
</source>
</source>
Make sure iscsi and clvmd do not start on boot, stop both, then make sure they start and stop cleanly.


<source lang="bash">
<source lang="bash">
chkconfig clvmd off; chkconfig iscsi off; /etc/init.d/iscsi stop && /etc/init.d/clvmd stop
iscsiadm --mode node --portal 192.168.2.100 --target iqn.2011-08.com.alteeve:an-clusterA.target01 --login
</source>
</source>
<source lang="text">
<source lang="text">
Stopping iscsi:                                           [ OK  ]
Logging in to [iface: default, target: iqn.2011-08.com.alteeve:an-clusterA.target01, portal: 192.168.2.100,3260]
Login to [iface: default, target: iqn.2011-08.com.alteeve:an-clusterA.target01, portal: 192.168.2.100,3260] successful.
</source>
</source>
<source lang="bash">
<source lang="bash">
/etc/init.d/clvmd start && /etc/init.d/iscsi start && /etc/init.d/iscsi stop && /etc/init.d/clvmd stop
fdisk -l
</source>
</source>
<source lang="text">
<source lang="text">
Starting clvmd:  
Disk /dev/sda: 500.1 GB, 500107862016 bytes
Activating VG(s):   No volume groups found
255 heads, 63 sectors/track, 60801 cylinders
                                                          [  OK  ]
Units = cylinders of 16065 * 512 = 8225280 bytes
Starting iscsi:                                           [  OK  ]
Sector size (logical/physical): 512 bytes / 512 bytes
Stopping iscsi:                                           [  OK  ]
I/O size (minimum/optimal): 512 bytes / 512 bytes
Signaling clvmd to exit                                    [  OK  ]
Disk identifier: 0x00062f4a
clvmd terminated                                          [  OK  ]
</source>


Use the cluster to stop (in case it autostarted before now) and then start the services.
  Device Boot      Start        End      Blocks  Id  System
 
/dev/sda1  *          1          33      262144  83  Linux
<source lang="bash">
Partition 1 does not end on cylinder boundary.
# Disable (stop)
/dev/sda2              33        5255    41943040  83  Linux
clusvcadm -d service:an3_storage
/dev/sda3            5255        5777    4194304  82 Linux swap / Solaris
clusvcadm -d service:an4_storage
clusvcadm -d service:an5_storage
# Enable (start)
clusvcadm -e service:an3_storage -m an-node03.alteeve.com
clusvcadm -e service:an4_storage -m an-node04.alteeve.com
clusvcadm -e service:an5_storage -m an-node05.alteeve.com
# Check
clustat
</source>
<source lang="text">
Cluster Status for an-clusterB @ Wed Aug 3 00:25:10 2011
Member Status: Quorate


Member Name                            ID  Status
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
------ ----                            ---- ------
255 heads, 63 sectors/track, 55022 cylinders
an-node03.alteeve.com                      1 Online, Local, rgmanager
Units = cylinders of 16065 * 512 = 8225280 bytes
an-node04.alteeve.com                      2 Online, rgmanager
Sector size (logical/physical): 512 bytes / 512 bytes
an-node05.alteeve.com                      3 Online, rgmanager
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000


Service Name                  Owner (Last)                  State       
Disk /dev/sdb doesn't contain a valid partition table
------- ----                  ----- ------                  -----       
service:an3_storage            an-node03.alteeve.com          started     
service:an4_storage            an-node04.alteeve.com          started     
service:an5_storage            an-node05.alteeve.com          started     
</source>


== Flush iSCSI's Cache ==
Disk /dev/sdc: 30.0 GB, 30010245120 bytes
 
64 heads, 32 sectors/track, 28620 cylinders
If you remove an iQN (or change the name of one), the <span class="code">/etc/init.d/iscsi</span> script will return errors. To flush it and re-scan:
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000


I am sure there is a more elegant way.
Disk /dev/sdc doesn't contain a valid partition table
 
<source lang="bash">
/etc/init.d/iscsi stop && rm -rf /var/lib/iscsi/nodes/* && iscsiadm -m discovery -t sendtargets -p 192.168.2.100
</source>
</source>


== Setup the VM Cluster's Clustered LVM ==
== Setup the VM Cluster ==


=== Partition the SAN disks ===
Install RPMs.
 
'''<span class="code">an-node03</span>''':


<source lang="bash">
<source lang="bash">
fdisk -l
yum -y install lvm2-cluster cman fence-agents
</source>
</source>
<source lang="text">
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00062f4a


  Device Boot      Start        End      Blocks  Id  System
Configure <span class="code">lvm.conf</span>.
/dev/sda1  *          1          33      262144  83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              33        5255    41943040  83  Linux
/dev/sda3            5255        5777    4194304  82  Linux swap / Solaris


Disk /dev/sdc: 30.0 GB, 30010245120 bytes
<source lang="bash">
64 heads, 32 sectors/track, 28620 cylinders
cp /etc/lvm/lvm.conf /etc/lvm/lvm.conf.orig
Units = cylinders of 2048 * 512 = 1048576 bytes
vim /etc/lvm/lvm.conf
Sector size (logical/physical): 512 bytes / 512 bytes
diff -u /etc/lvm/lvm.conf.orig /etc/lvm/lvm.conf
I/O size (minimum/optimal): 512 bytes / 512 bytes
</source>
Disk identifier: 0x00000000
<source lang="diff">
 
--- /etc/lvm/lvm.conf.orig 2011-08-02 21:59:01.000000000 -0400
Disk /dev/sdc doesn't contain a valid partition table
+++ /etc/lvm/lvm.conf 2011-08-03 00:35:45.000000000 -0400
 
@@ -308,7 +308,8 @@
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
    # Type 3 uses built-in clustered locking.
255 heads, 63 sectors/track, 55022 cylinders
    # Type 4 uses read-only locking which forbids any operations that might
Units = cylinders of 16065 * 512 = 8225280 bytes
    # change metadata.
Sector size (logical/physical): 512 bytes / 512 bytes
-    locking_type = 1
I/O size (minimum/optimal): 512 bytes / 512 bytes
+    #locking_type = 1
Disk identifier: 0x00000000
+    locking_type = 3
 
Disk /dev/sdb doesn't contain a valid partition table
    # Set to 0 to fail when a lock request cannot be satisfied immediately.
</source>
    wait_for_locks = 1
 
@@ -324,7 +325,8 @@
Create partitions.
    # to 1 an attempt will be made to use local file-based locking (type 1).
 
    # If this succeeds, only commands against local volume groups will proceed.
<source lang="bash">
    # Volume Groups marked as clustered will be ignored.
fdisk /dev/sdb
-    fallback_to_local_locking = 1
+    #fallback_to_local_locking = 1
+    fallback_to_local_locking = 0
    # Local non-LV directory that holds file-based locks while commands are
    # in progress.  A directory like /tmp that may get wiped on reboot is OK.
</source>
 
<source lang="bash">
rsync -av /etc/lvm/lvm.conf root@an-node04:/etc/lvm/
</source>
</source>
<source lang="text">
<source lang="text">
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
sending incremental file list
Building a new DOS disklabel with disk identifier 0x403f1fb8.
lvm.conf
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.


Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
sent 873 bytes  received 247 bytes  2240.00 bytes/sec
 
total size is 24625  speedup is 21.99
WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
</source>
        switch off the mode (command 'c') and change display units to
<source lang="bash">
        sectors (command 'u').
rsync -av /etc/lvm/lvm.conf root@an-node05:/etc/lvm/
</source>
<source lang="text">
sending incremental file list
lvm.conf


Command (m for help): c
sent 873 bytes  received 247 bytes  2240.00 bytes/sec
DOS Compatibility flag is not set
total size is 24625  speedup is 21.99
</source>


Command (m for help): u
Config the cluster.
Changing display/entry units to sectors


Command (m for help): n
<source lang="bash">
Command action
vim /etc/cluster/cluster.conf
  e  extended
</source>
  p  primary partition (1-4)
<source lang="xml">
p
<?xml version="1.0"?>
Partition number (1-4): 1
<cluster config_version="5" name="an-clusterB">
First cylinder (1-55022, default 1): 1
        <totem rrp_mode="none" secauth="off"/>
Last cylinder, +cylinders or +size{K,M,G} (1-55022, default 55022):
        <clusternodes>
Using default value 55022
                <clusternode name="an-node03.alteeve.ca" nodeid="1">
 
                        <fence>
Command (m for help): p
                                <method name="apc_pdu">
 
                                        <device action="reboot" name="pdu2" port="3"/>
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
                                </method>
255 heads, 63 sectors/track, 55022 cylinders
                        </fence>
Units = cylinders of 16065 * 512 = 8225280 bytes
                </clusternode>
Sector size (logical/physical): 512 bytes / 512 bytes
                <clusternode name="an-node04.alteeve.ca" nodeid="2">
I/O size (minimum/optimal): 512 bytes / 512 bytes
                        <fence>
Disk identifier: 0x403f1fb8
                                <method name="apc_pdu">
 
                                        <device action="reboot" name="pdu2" port="4"/>
  Device Boot      Start         End      Blocks  Id  System
                                </method>
/dev/sdb1              1      55022  441964183+  83  Linux
                        </fence>
 
                </clusternode>
Command (m for help): t
                <clusternode name="an-node05.alteeve.ca" nodeid="3">
Selected partition 1
                        <fence>
Hex code (type L to list codes): 8e
                                <method name="apc_pdu">
Changed system type of partition 1 to 8e (Linux LVM)
                                        <device action="reboot" name="pdu2" port="5"/>
 
                                </method>
Command (m for help): p
                        </fence>
 
                </clusternode>
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
        </clusternodes>
255 heads, 63 sectors/track, 55022 cylinders
        <fencedevices>
Units = cylinders of 16065 * 512 = 8225280 bytes
                <fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
Sector size (logical/physical): 512 bytes / 512 bytes
        </fencedevices>
I/O size (minimum/optimal): 512 bytes / 512 bytes
        <fence_daemon post_join_delay="30"/>
Disk identifier: 0x403f1fb8
         <rm>
 
                <resources>
  Device Boot      Start        End      Blocks  Id  System
                        <script file="/etc/init.d/iscsi" name="iscsi" />
/dev/sdb1              1       55022  441964183+  8e  Linux LVM
                        <script file="/etc/init.d/clvmd" name="clvmd" />
 
                </resources>
Command (m for help): w
                <failoverdomains>
The partition table has been altered!
                        <failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
 
                                <failoverdomainnode name="an-node03.alteeve.ca" />
Calling ioctl() to re-read partition table.
                        </failoverdomain>
Syncing disks.
                        <failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node04.alteeve.ca" />
                        </failoverdomain>
                        <failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node05.alteeve.ca" />
                        </failoverdomain>
                </failoverdomains>
                <service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd"/>
                        </script>
                </service>
                <service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd"/>
                        </script>
                </service>
                <service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd"/>
                        </script>
                </service>
        </rm> 
</cluster>
</source>
</source>
<source lang="bash">
<source lang="bash">
fdisk /dev/sdc
ccs_config_validate
</source>
</source>
<source lang="text">
<source lang="text">
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Configuration validates
Building a new DOS disklabel with disk identifier 0xba7503eb.
</source>
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.


Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
Make sure iscsi and clvmd do not start on boot, stop both, then make sure they start and stop cleanly.


WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
<source lang="bash">
        switch off the mode (command 'c') and change display units to
chkconfig clvmd off; chkconfig iscsi off; /etc/init.d/iscsi stop && /etc/init.d/clvmd stop
        sectors (command 'u').
</source>
<source lang="text">
Stopping iscsi:                                           [  OK  ]
</source>
<source lang="bash">
/etc/init.d/clvmd start && /etc/init.d/iscsi start && /etc/init.d/iscsi stop && /etc/init.d/clvmd stop
</source>
<source lang="text">
Starting clvmd:
Activating VG(s):  No volume groups found
                                                          [  OK  ]
Starting iscsi:                                            [  OK  ]
Stopping iscsi:                                            [  OK  ]
Signaling clvmd to exit                                    [  OK  ]
clvmd terminated                                          [  OK  ]
</source>


Command (m for help): c
Use the cluster to stop (in case it autostarted before now) and then start the services.
DOS Compatibility flag is not set


Command (m for help): u
<source lang="bash">
Changing display/entry units to sectors
# Disable (stop)
clusvcadm -d service:an3_storage
clusvcadm -d service:an4_storage
clusvcadm -d service:an5_storage
# Enable (start)
clusvcadm -e service:an3_storage -m an-node03.alteeve.ca
clusvcadm -e service:an4_storage -m an-node04.alteeve.ca
clusvcadm -e service:an5_storage -m an-node05.alteeve.ca
# Check
clustat
</source>
<source lang="text">
Cluster Status for an-clusterB @ Wed Aug  3 00:25:10 2011
Member Status: Quorate


Command (m for help): n
Member Name                            ID   Status
Command action
------ ----                            ---- ------
  e   extended
an-node03.alteeve.ca                        1 Online, Local, rgmanager
  p  primary partition (1-4)
an-node04.alteeve.ca                        2 Online, rgmanager
p
an-node05.alteeve.ca                        3 Online, rgmanager
Partition number (1-4): 1
First sector (2048-58613759, default 2048):
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-58613759, default 58613759):
Using default value 58613759


Command (m for help): t
Service Name                  Owner (Last)                   State       
Selected partition 1
------- ----                  ----- ------                  -----       
Hex code (type L to list codes): 8e
service:an3_storage            an-node03.alteeve.ca          started     
Changed system type of partition 1 to 8e (Linux LVM)
service:an4_storage            an-node04.alteeve.ca          started     
service:an5_storage            an-node05.alteeve.ca          started     
</source>


Command (m for help): p
== Flush iSCSI's Cache ==


Disk /dev/sdc: 30.0 GB, 30010245120 bytes
If you remove an iQN (or change the name of one), the <span class="code">/etc/init.d/iscsi</span> script will return errors. To flush it and re-scan:
64 heads, 32 sectors/track, 28620 cylinders, total 58613760 sectors
 
Units = sectors of 1 * 512 = 512 bytes
I am sure there is a more elegant way.
Sector size (logical/physical): 512 bytes / 512 bytes
 
I/O size (minimum/optimal): 512 bytes / 512 bytes
<source lang="bash">
Disk identifier: 0xba7503eb
/etc/init.d/iscsi stop && rm -rf /var/lib/iscsi/nodes/* && iscsiadm -m discovery -t sendtargets -p 192.168.2.100
</source>


  Device Boot      Start        End      Blocks  Id  System
== Setup the VM Cluster's Clustered LVM ==
/dev/sdc1            2048    58613759    29305856  8e  Linux LVM


Command (m for help): w
=== Partition the SAN disks ===
The partition table has been altered!


Calling ioctl() to re-read partition table.
'''<span class="code">an-node03</span>''':
Syncing disks.
</source>


<source lang="bash">
<source lang="bash">
Line 1,462: Line 1,695:
Sector size (logical/physical): 512 bytes / 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xba7503eb
Disk identifier: 0x00000000


  Device Boot      Start        End      Blocks  Id  System
Disk /dev/sdc doesn't contain a valid partition table
/dev/sdc1              2      28620    29305856  8e  Linux LVM


Disk /dev/sdb: 452.6 GB, 452573790208 bytes
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
Line 1,472: Line 1,704:
Sector size (logical/physical): 512 bytes / 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x403f1fb8
Disk identifier: 0x00000000


  Device Boot      Start        End      Blocks  Id  System
Disk /dev/sdb doesn't contain a valid partition table
/dev/sdb1              1      55022  441964183+  8e  Linux LVM
</source>
</source>


=== Setup LVM devices ===
Create partitions.
 
Create PV.
 
'''<span class="code">an-node03</span>''':


<source lang="bash">
<source lang="bash">
pvcreate /dev/sd{b,c}1
fdisk /dev/sdb
</source>
</source>
<source lang="text">
<source lang="text">
  Physical volume "/dev/sdb1" successfully created
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
  Physical volume "/dev/sdc1" successfully created
Building a new DOS disklabel with disk identifier 0x403f1fb8.
</source>
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.


'''<span class="code">an-node04</span> and <span class="code">an-node05</span>''':
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)


<source lang="bash">
WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
pvscan
        switch off the mode (command 'c') and change display units to
</source>
        sectors (command 'u').
<source lang="text">
  PV /dev/sdb1                      lvm2 [421.49 GiB]
  PV /dev/sdc1                      lvm2 [27.95 GiB]
  Total: 2 [449.44 GiB] / in use: 0 [0  ] / in no VG: 2 [449.44 GiB]
</source>


Create the VGs.
Command (m for help): c
DOS Compatibility flag is not set


'''<span class="code">an-node03</span>''':
Command (m for help): u
Changing display/entry units to sectors


<source lang="bash">
Command (m for help): n
vgcreate -c y san_vg01 /dev/sdb1
Command action
</source>
  e  extended
<source lang="text">
  p   primary partition (1-4)
   Clustered volume group "san_vg01" successfully created
p
</source>
Partition number (1-4): 1
<source lang="bash">
First cylinder (1-55022, default 1): 1
vgcreate -c y san_vg02 /dev/sdc1
Last cylinder, +cylinders or +size{K,M,G} (1-55022, default 55022):
</source>
Using default value 55022
<source lang="text">
  Clustered volume group "san_vg02" successfully created
</source>


'''<span class="code">an-node04</span> and <span class="code">an-node05</span>''':
Command (m for help): p


<source lang="bash">
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
vgscan
255 heads, 63 sectors/track, 55022 cylinders
</source>
Units = cylinders of 16065 * 512 = 8225280 bytes
<source lang="text">
Sector size (logical/physical): 512 bytes / 512 bytes
   Reading all physical volumes. This may take a while...
I/O size (minimum/optimal): 512 bytes / 512 bytes
   Found volume group "san_vg02" using metadata type lvm2
Disk identifier: 0x403f1fb8
  Found volume group "san_vg01" using metadata type lvm2
 
</source>
  Device Boot      Start        End      Blocks   Id System
/dev/sdb1              1      55022   441964183+  83  Linux
 
Command (m for help): t
Selected partition 1
Hex code (type L to list codes): 8e
Changed system type of partition 1 to 8e (Linux LVM)


Create the first VM's LVs.
Command (m for help): p


'''<span class="code">an-node03</span>''':
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x403f1fb8


<source lang="bash">
  Device Boot      Start        End      Blocks   Id  System
lvcreate -L 10G -n shared01 /dev/san_vg01
/dev/sdb1              1      55022   441964183+  8e  Linux LVM
</source>
<source lang="text">
   Logical volume "shared01" created
</source>
<source lang="bash">
lvcreate -L 50G -n vm0001_hdd1 /dev/san_vg01
</source>
<source lang="text">
   Logical volume "vm0001_hdd1" created
</source>
<source lang="bash">
lvcreate -L 10G -n vm0001_ssd1 /dev/san_vg02
</source>
<source lang="text">
  Logical volume "vm0001_ssd1" created
</source>


'''<span class="code">an-node04</span> and <span class="code">an-node05</span>''':
Command (m for help): w
The partition table has been altered!


<source lang="bash">
Calling ioctl() to re-read partition table.
lvscan
Syncing disks.
</source>
<source lang="text">
  ACTIVE            '/dev/san_vg01/shared01' [10.00 GiB] inherit
  ACTIVE            '/dev/san_vg02/vm0001_ssd1' [10.00 GiB] inherit
  ACTIVE            '/dev/san_vg01/vm0001_hdd1' [50.00 GiB] inherit
</source>
</source>
== Create Shared GFS2 Partition ==
'''<span class="code">an-node03</span>''':


<source lang="bash">
<source lang="bash">
mkfs.gfs2 -p lock_dlm -j 5 -t an-clusterB:shared01 /dev/san_vg01/shared01
fdisk /dev/sdc
</source>
</source>
<source lang="text">
<source lang="text">
This will destroy any data on /dev/san_vg01/shared01.
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
It appears to contain: symbolic link to `../dm-2'
Building a new DOS disklabel with disk identifier 0xba7503eb.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.


Are you sure you want to proceed? [y/n] y
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)


Device:                   /dev/san_vg01/shared01
WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
Blocksize:                4096
        switch off the mode (command 'c') and change display units to
Device Size                10.00 GB (2621440 blocks)
        sectors (command 'u').
Filesystem Size:          10.00 GB (2621438 blocks)
Journals:                  5
Resource Groups:          40
Locking Protocol:          "lock_dlm"
Lock Table:                "an-clusterB:shared01"
UUID:                      6C0D7D1D-A1D3-ED79-705D-28EE3D674E75
</source>


Add it to <span class="code">/etc/fstab</span> (needed for the <span class="code">gfs2</span> init script to find and mount):
Command (m for help): c
DOS Compatibility flag is not set


'''<span class="code">an-node03</span> - <span class="code">an-node07</span>''':
Command (m for help): u
Changing display/entry units to sectors


<source lang="bash">
Command (m for help): n
echo `gfs2_edit -p sb /dev/san_vg01/shared01 | grep sb_uuid | sed -e "s/.*sb_uuid  *\(.*\)/UUID=\L\1\E \/shared01\t\tgfs2\trw,suid,dev,exec,nouser,async\t0 0/"` >> /etc/fstab
Command action
cat /etc/fstab
  e  extended
</source>
  p  primary partition (1-4)
<source lang="bash">
p
#
Partition number (1-4): 1
# /etc/fstab
First sector (2048-58613759, default 2048):  
# Created by anaconda on Fri Jul  8 22:01:41 2011
Using default value 2048
#
Last sector, +sectors or +size{K,M,G} (2048-58613759, default 58613759):
# Accessible filesystems, by reference, are maintained under '/dev/disk'
Using default value 58613759
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
 
#
Command (m for help): t
UUID=2c1f4cb1-959f-4675-b9c7-5d753c303dd1 /                      ext3    defaults        1 1
Selected partition 1
UUID=9a0224dc-15b4-439e-8d7c-5f9dbcd05e3f /boot                  ext3    defaults        1 2
Hex code (type L to list codes): 8e
UUID=4f2a83e8-1769-40d8-ba2a-e1f535306848 swap                    swap    defaults        0 0
Changed system type of partition 1 to 8e (Linux LVM)
tmpfs                  /dev/shm                tmpfs  defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                  /sys                    sysfs  defaults        0 0
proc                    /proc                  proc    defaults        0 0
UUID=6c0d7d1d-a1d3-ed79-705d-28ee3d674e75 /shared01 gfs2 rw,suid,dev,exec,nouser,async 0 0
</source>


Make the mount point and mount it.
Command (m for help): p


<source lang="bash">
Disk /dev/sdc: 30.0 GB, 30010245120 bytes
mkdir /shared01
64 heads, 32 sectors/track, 28620 cylinders, total 58613760 sectors
/etc/init.d/gfs2 start
Units = sectors of 1 * 512 = 512 bytes
</source>
Sector size (logical/physical): 512 bytes / 512 bytes
<source lang="text">
I/O size (minimum/optimal): 512 bytes / 512 bytes
Mounting GFS2 filesystem (/shared01):                     [  OK  ]
Disk identifier: 0xba7503eb
 
  Device Boot      Start        End      Blocks  Id  System
/dev/sdc1            2048    58613759    29305856  8e  Linux LVM
 
Command (m for help): w
The partition table has been altered!
 
Calling ioctl() to re-read partition table.
Syncing disks.
</source>
</source>
<source lang="bash">
<source lang="bash">
df -h
fdisk -l
</source>
</source>
<source lang="text">
<source lang="text">
Filesystem            Size Used Avail Use% Mounted on
Disk /dev/sda: 500.1 GB, 500107862016 bytes
/dev/sda2              40G  3.3G   35G  9% /
255 heads, 63 sectors/track, 60801 cylinders
tmpfs                1.8G   32M 1.8G  2% /dev/shm
Units = cylinders of 16065 * 512 = 8225280 bytes
/dev/sda1            248M   85M 151M  36% /boot
Sector size (logical/physical): 512 bytes / 512 bytes
/dev/mapper/san_vg01-shared01
I/O size (minimum/optimal): 512 bytes / 512 bytes
                      10G 647M 9.4G   7% /shared01
Disk identifier: 0x00062f4a
 
  Device Boot      Start        End      Blocks  Id System
/dev/sda1   *          1         33      262144   83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              33        5255    41943040  83  Linux
/dev/sda3            5255        5777    4194304   82 Linux swap / Solaris
 
Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xba7503eb
 
  Device Boot      Start        End      Blocks  Id System
/dev/sdc1              2      28620    29305856  8e Linux LVM
 
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x403f1fb8
 
  Device Boot      Start        End      Blocks   Id  System
/dev/sdb1              1      55022  441964183+  8e  Linux LVM
</source>
</source>


Stop GFS2 on all five nodes and update the cluster.conf config.
=== Setup LVM devices ===


<source lang="bash">
Create PV.
/etc/init.d/gfs2 stop
 
'''<span class="code">an-node03</span>''':
 
<source lang="bash">
pvcreate /dev/sd{b,c}1
</source>
</source>
<source lang="text">
<source lang="text">
Unmounting GFS2 filesystem (/shared01):                    [  OK  ]
  Physical volume "/dev/sdb1" successfully created
  Physical volume "/dev/sdc1" successfully created
</source>
</source>
'''<span class="code">an-node04</span> and <span class="code">an-node05</span>''':
<source lang="bash">
<source lang="bash">
df -h
pvscan
</source>
</source>
<source lang="text">
<source lang="text">
Filesystem            Size  Used Avail Use% Mounted on
  PV /dev/sdb1                      lvm2 [421.49 GiB]
/dev/sda2              40G  3.3G  35G  9% /
   PV /dev/sdc1                      lvm2 [27.95 GiB]
tmpfs                1.8G  32M  1.8G   2% /dev/shm
  Total: 2 [449.44 GiB] / in use: 0 [0   ] / in no VG: 2 [449.44 GiB]
/dev/sda1            248M   85M  151M  36% /boot
/dev/mapper/san_vg01-shared01
                      10G  647M  9.4G  7% /shared01
</source>
</source>
Create the VGs.


'''<span class="code">an-node03</span>''':
'''<span class="code">an-node03</span>''':


<source lang="xml">
<source lang="bash">
<?xml version="1.0"?>
vgcreate -c y san_vg01 /dev/sdb1
<cluster config_version="9" name="an-clusterB">
</source>
        <totem rrp_mode="none" secauth="off"/>
<source lang="text">
        <clusternodes>
  Clustered volume group "san_vg01" successfully created
                <clusternode name="an-node03.alteeve.com" nodeid="3">
</source>
                        <fence>
<source lang="bash">
                                <method name="apc_pdu">
vgcreate -c y san_vg02 /dev/sdc1
                                        <device action="reboot" name="pdu2" port="3"/>
</source>
                                </method>
<source lang="text">
                        </fence>
  Clustered volume group "san_vg02" successfully created
                </clusternode>
</source>
                <clusternode name="an-node04.alteeve.com" nodeid="4">
 
                        <fence>
'''<span class="code">an-node04</span> and <span class="code">an-node05</span>''':
                                <method name="apc_pdu">
 
                                        <device action="reboot" name="pdu2" port="4"/>
<source lang="bash">
                                </method>
vgscan
                        </fence>
</source>
                </clusternode>
<source lang="text">
                <clusternode name="an-node05.alteeve.com" nodeid="5">
  Reading all physical volumes.  This may take a while...
                        <fence>
  Found volume group "san_vg02" using metadata type lvm2
                                <method name="apc_pdu">
  Found volume group "san_vg01" using metadata type lvm2
                                        <device action="reboot" name="pdu2" port="5"/>
</source>
                                </method>
 
                        </fence>
Create the first VM's LVs.
                </clusternode>
 
                <clusternode name="an-node06.alteeve.com" nodeid="6">
'''<span class="code">an-node03</span>''':
                        <fence>
 
                                <method name="apc_pdu">
<source lang="bash">
                                        <device action="reboot" name="pdu2" port="6"/>
lvcreate -L 10G -n shared01 /dev/san_vg01
                                </method>
</source>
                        </fence>
<source lang="text">
                </clusternode>
  Logical volume "shared01" created
                <clusternode name="an-node07.alteeve.com" nodeid="7">
</source>
                        <fence>
<source lang="bash">
                                <method name="apc_pdu">
lvcreate -L 50G -n vm0001_hdd1 /dev/san_vg01
                                        <device action="reboot" name="pdu2" port="7"/>
</source>
                                </method>
<source lang="text">
                        </fence>
  Logical volume "vm0001_hdd1" created
                </clusternode>
</source>
        </clusternodes>
<source lang="bash">
        <fencedevices>
lvcreate -L 10G -n vm0001_ssd1 /dev/san_vg02
                <fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
</source>
        </fencedevices>
<source lang="text">
        <fence_daemon post_join_delay="30"/>
  Logical volume "vm0001_ssd1" created
        <rm>
</source>
                <resources>
 
                        <script file="/etc/init.d/iscsi" name="iscsi"/>
'''<span class="code">an-node04</span> and <span class="code">an-node05</span>''':
                        <script file="/etc/init.d/clvmd" name="clvmd"/>
 
                        <script file="/etc/init.d/gfs2" name="gfs2"/>
<source lang="bash">
                </resources>
lvscan
                <failoverdomains>
</source>
                        <failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
<source lang="text">
                                <failoverdomainnode name="an-node03.alteeve.com"/>
  ACTIVE            '/dev/san_vg01/shared01' [10.00 GiB] inherit
                        </failoverdomain>
  ACTIVE            '/dev/san_vg02/vm0001_ssd1' [10.00 GiB] inherit
                        <failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
  ACTIVE            '/dev/san_vg01/vm0001_hdd1' [50.00 GiB] inherit
                                <failoverdomainnode name="an-node04.alteeve.com"/>
</source>
                        </failoverdomain>
 
                        <failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
== Create Shared GFS2 Partition ==
                                <failoverdomainnode name="an-node05.alteeve.com"/>
 
                        </failoverdomain>
'''<span class="code">an-node03</span>''':
                        <failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
 
                                <failoverdomainnode name="an-node06.alteeve.com"/>
<source lang="bash">
                        </failoverdomain>
mkfs.gfs2 -p lock_dlm -j 5 -t an-clusterB:shared01 /dev/san_vg01/shared01
                        <failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
</source>
                                <failoverdomainnode name="an-node07.alteeve.com"/>
<source lang="text">
                        </failoverdomain>
This will destroy any data on /dev/san_vg01/shared01.
                </failoverdomains>
It appears to contain: symbolic link to `../dm-2'
                <service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart">
 
                        <script ref="iscsi">
Are you sure you want to proceed? [y/n] y
                                <script ref="clvmd">
 
                                        <script ref="gfs2"/>
Device:                    /dev/san_vg01/shared01
                                </script>
Blocksize:                 4096
                        </script>
Device Size                10.00 GB (2621440 blocks)
                </service>
Filesystem Size:          10.00 GB (2621438 blocks)
                <service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart">
Journals:                  5
                        <script ref="iscsi">
Resource Groups:          40
                                <script ref="clvmd">
Locking Protocol:          "lock_dlm"
                                        <script ref="gfs2"/>
Lock Table:                "an-clusterB:shared01"
                                </script>
UUID:                      6C0D7D1D-A1D3-ED79-705D-28EE3D674E75
                        </script>
</source>
                </service>
 
                <service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart">
Add it to <span class="code">/etc/fstab</span> (needed for the <span class="code">gfs2</span> init script to find and mount):
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
                <service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                 </service>
                <service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
        </rm>
</cluster>
</source>
<source lang="bash">
cman_tool version -r
</source>


Check that <span class="code">rgmanager</span> picked up the updated config and remounted the GFS2 partition.
'''<span class="code">an-node03</span> - <span class="code">an-node07</span>''':


<source lang="bash">
<source lang="bash">
df -h
echo `gfs2_edit -p sb /dev/san_vg01/shared01 | grep sb_uuid | sed -e "s/.*sb_uuid  *\(.*\)/UUID=\L\1\E \/shared01\t\tgfs2\trw,suid,dev,exec,nouser,async\t0 0/"` >> /etc/fstab
cat /etc/fstab
</source>
</source>
<source lang="text">
<source lang="bash">
Filesystem            Size Used Avail Use% Mounted on
#
/dev/sda2              40G  3.3G  35G  9% /
# /etc/fstab
tmpfs                1.8G  32M  1.8G  2% /dev/shm
# Created by anaconda on Fri Jul 8 22:01:41 2011
/dev/sda1            248M  85M 151M 36% /boot
#
/dev/mapper/san_vg01-shared01
# Accessible filesystems, by reference, are maintained under '/dev/disk'
                      10G  647M  9.4G  7% /shared01
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=2c1f4cb1-959f-4675-b9c7-5d753c303dd1 /                      ext3    defaults        1 1
UUID=9a0224dc-15b4-439e-8d7c-5f9dbcd05e3f /boot                  ext3    defaults        1 2
UUID=4f2a83e8-1769-40d8-ba2a-e1f535306848 swap                    swap    defaults        0 0
tmpfs                  /dev/shm               tmpfs  defaults        0 0
devpts                  /dev/pts                devpts gid=5,mode=620 0 0
sysfs                  /sys                    sysfs  defaults        0 0
proc                    /proc                  proc    defaults        0 0
UUID=6c0d7d1d-a1d3-ed79-705d-28ee3d674e75 /shared01 gfs2 rw,suid,dev,exec,nouser,async 0 0
</source>
</source>


= Configure KVM =
Make the mount point and mount it.
 
Host network and VM hypervisor config.
 
== Configure Bridges ==
 
On '''<span class="code">an-node03</span>''' through '''<span class="code">an-node07</span>''':


<source lang="bash">
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-{eth,vbr}0
mkdir /shared01
/etc/init.d/gfs2 start
</source>
<source lang="text">
Mounting GFS2 filesystem (/shared01):                      [  OK  ]
</source>
</source>
''<span class="code">ifcfg-eth0</span>'':
<source lang="bash">
<source lang="bash">
# Internet facing
df -h
HWADDR="bc:ae:c5:44:8a:de"
</source>
DEVICE="eth0"
<source lang="text">
BRIDGE="vbr0"
Filesystem            Size  Used Avail Use% Mounted on
BOOTPROTO="static"
/dev/sda2              40G  3.3G  35G  9% /
IPV6INIT="yes"
tmpfs                1.8G  32M  1.8G  2% /dev/shm
NM_CONTROLLED="no"
/dev/sda1            248M  85M  151M  36% /boot
ONBOOT="yes"
/dev/mapper/san_vg01-shared01
                      10G  647M  9.4G  7% /shared01
</source>
</source>


Note that you can use what ever bridge names makes sense to you. However, the file name for the bridge configuration must sort after the <span class="code">ifcfg-ethX</span> file. If the bridge file is read before the ethernet interface, it will fail to come up. Also, the bridge name as defined in the file does not need to match the one used it the actual file name. Personally, I like <span class="code">vbrX</span> for "''v''m ''br''idge".
Stop GFS2 on all five nodes and update the cluster.conf config.


''<span class="code">ifcfg-vbr0</span>'':
<source lang="bash">
<source lang="bash">
# Bridge - IFN
/etc/init.d/gfs2 stop
DEVICE="vbr0"
</source>
TYPE="Bridge"
<source lang="text">
IPADDR=192.168.1.73
Unmounting GFS2 filesystem (/shared01):                    [  OK  ]
NETMASK=255.255.255.0
GATEWAY=192.168.1.254
DNS1=192.139.81.117
DNS2=192.139.81.1
</source>
</source>
You may wish to not make the Back-Channel Network accessible to the virtual machines, then there is no need to setup this second bridge.
<source lang="bash">
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-{eth,vbr}2
df -h
</source>
</source>
 
<source lang="text">
''<span class="code">ifcfg-eth2</span>'':
Filesystem            Size  Used Avail Use% Mounted on
<source lang="bash">
/dev/sda2              40G  3.3G  35G  9% /
# Back-channel
tmpfs                1.8G  32M  1.8G  2% /dev/shm
HWADDR="00:1B:21:72:9B:56"
/dev/sda1            248M  85M  151M  36% /boot
DEVICE="eth2"
/dev/mapper/san_vg01-shared01
BRIDGE="vbr2"
                      10G  647M  9.4G  7% /shared01
BOOTPROTO="static"
IPV6INIT="yes"
NM_CONTROLLED="no"
ONBOOT="yes"
</source>
</source>


''<span class="code">ifcfg-vbr2</span>'':
'''<span class="code">an-node03</span>''':
<source lang="bash">
# Bridge - BCN
DEVICE="vbr2"
TYPE="Bridge"
IPADDR=192.168.3.73
NETMASK=255.255.255.0
</source>


Leave the cluster, lest we be fenced.
<source lang="xml">
 
<?xml version="1.0"?>
<source lang="bash">
<cluster config_version="9" name="an-clusterB">
/etc/init.d/rgmanager stop && /etc/init.d/cman stop
        <totem rrp_mode="none" secauth="off"/>
</source>
        <clusternodes>
 
                <clusternode name="an-node03.alteeve.ca" nodeid="3">
Restart networking and then check that the new bridges are up and that the proper ethernet devices are slaved to them.
                        <fence>
 
                                <method name="apc_pdu">
<source lang="bash">
                                        <device action="reboot" name="pdu2" port="3"/>
/etc/init.d/network restart
                                </method>
</source>
                        </fence>
<source lang="text">
                </clusternode>
Shutting down interface eth0:                              [  OK  ]
                <clusternode name="an-node04.alteeve.ca" nodeid="4">
Shutting down interface eth1:                              [  OK  ]
                        <fence>
Shutting down interface eth2:                              [  OK  ]
                                <method name="apc_pdu">
Shutting down loopback interface:                          [  OK  ]
                                        <device action="reboot" name="pdu2" port="4"/>
Bringing up loopback interface:                            [  OK  ]
                                </method>
Bringing up interface eth0:                                [  OK  ]
                        </fence>
Bringing up interface eth1:                                [  OK  ]
                </clusternode>
Bringing up interface eth2:                                [  OK  ]
                <clusternode name="an-node05.alteeve.ca" nodeid="5">
Bringing up interface vbr0:                                [  OK  ]
                        <fence>
Bringing up interface vbr2:                                [  OK  ]
                                <method name="apc_pdu">
</source>
                                        <device action="reboot" name="pdu2" port="5"/>
 
                                </method>
<source lang="bash">
                        </fence>
brctl show
                </clusternode>
</source>
                <clusternode name="an-node06.alteeve.ca" nodeid="6">
<source lang="text">
                        <fence>
bridge name bridge id STP enabled interfaces
                                <method name="apc_pdu">
vbr0 8000.bcaec5448ade no eth0
                                        <device action="reboot" name="pdu2" port="6"/>
vbr2 8000.001b21729b56 no eth2
                                </method>
</source>
                        </fence>
 
                </clusternode>
<source lang="bash">
                <clusternode name="an-node07.alteeve.ca" nodeid="7">
ifconfig
                        <fence>
</source>
                                <method name="apc_pdu">
<source lang="text">
                                        <device action="reboot" name="pdu2" port="7"/>
eth0      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE 
                                </method>
          inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
                        </fence>
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
                </clusternode>
          RX packets:4439 errors:0 dropped:0 overruns:0 frame:0
        </clusternodes>
          TX packets:2752 errors:0 dropped:0 overruns:0 carrier:0
        <fencedevices>
          collisions:0 txqueuelen:1000
                <fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
          RX bytes:508352 (496.4 KiB)  TX bytes:494345 (482.7 KiB)
        </fencedevices>
          Interrupt:31 Base address:0x8000
        <fence_daemon post_join_delay="30"/>
 
        <rm>
eth1      Link encap:Ethernet  HWaddr 00:1B:21:72:96:E8 
                <resources>
          inet addr:192.168.2.73  Bcast:192.168.2.255  Mask:255.255.255.0
                        <script file="/etc/init.d/iscsi" name="iscsi"/>
          inet6 addr: fe80::21b:21ff:fe72:96e8/64 Scope:Link
                        <script file="/etc/init.d/clvmd" name="clvmd"/>
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
                        <script file="/etc/init.d/gfs2" name="gfs2"/>
          RX packets:617100 errors:0 dropped:0 overruns:0 frame:0
                </resources>
          TX packets:847718 errors:0 dropped:0 overruns:0 carrier:0
                <failoverdomains>
          collisions:0 txqueuelen:1000
                        <failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
          RX bytes:772489353 (736.7 MiB)  TX bytes:740536232 (706.2 MiB)
                                <failoverdomainnode name="an-node03.alteeve.ca"/>
          Interrupt:18 Memory:fe9e0000-fea00000
                        </failoverdomain>
 
                        <failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56 
                                <failoverdomainnode name="an-node04.alteeve.ca"/>
          inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
                        </failoverdomain>
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
                        <failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
          RX packets:86586 errors:0 dropped:0 overruns:0 frame:0
                                <failoverdomainnode name="an-node05.alteeve.ca"/>
          TX packets:80934 errors:0 dropped:0 overruns:0 carrier:0
                        </failoverdomain>
          collisions:0 txqueuelen:1000
                        <failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
          RX bytes:11366700 (10.8 MiB)  TX bytes:10091579 (9.6 MiB)
                                <failoverdomainnode name="an-node06.alteeve.ca"/>
          Interrupt:17 Memory:feae0000-feb00000
                        </failoverdomain>
 
                        <failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
lo        Link encap:Local Loopback 
                                <failoverdomainnode name="an-node07.alteeve.ca"/>
          inet addr:127.0.0.1 Mask:255.0.0.0
                        </failoverdomain>
          inet6 addr: ::1/128 Scope:Host
                </failoverdomains>
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
                <service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart">
          RX packets:32 errors:0 dropped:0 overruns:0 frame:0
                        <script ref="iscsi">
          TX packets:32 errors:0 dropped:0 overruns:0 carrier:0
                                <script ref="clvmd">
          collisions:0 txqueuelen:0
                                        <script ref="gfs2"/>
          RX bytes:11507 (11.2 KiB)  TX bytes:11507 (11.2 KiB)
                                </script>
 
                        </script>
vbr0      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE 
                </service>
          inet addr:192.168.1.73  Bcast:192.168.1.255  Mask:255.255.255.0
                <service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart">
          inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
                        <script ref="iscsi">
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
                                <script ref="clvmd">
          RX packets:165 errors:0 dropped:0 overruns:0 frame:0
                                        <script ref="gfs2"/>
          TX packets:89 errors:0 dropped:0 overruns:0 carrier:0
                                </script>
          collisions:0 txqueuelen:0
                        </script>
          RX bytes:25875 (25.2 KiB)  TX bytes:17081 (16.6 KiB)
                </service>
 
                <service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart">
vbr2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56 
                        <script ref="iscsi">
          inet addr:192.168.3.73  Bcast:192.168.3.255  Mask:255.255.255.0
                                <script ref="clvmd">
          inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
                                        <script ref="gfs2"/>
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
                                </script>
          RX packets:74 errors:0 dropped:0 overruns:0 frame:0
                        </script>
          TX packets:27 errors:0 dropped:0 overruns:0 carrier:0
                </service>
          collisions:0 txqueuelen:0
                <service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart">
          RX bytes:19021 (18.5 KiB)  TX bytes:4137 (4.0 KiB)
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
                <service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
        </rm>
</cluster>
</source>
</source>
Rejoin the cluster.
<source lang="bash">
<source lang="bash">
/etc/init.d/cman start && /etc/init.d/rgmanager start
cman_tool version -r
</source>
</source>


Check that <span class="code">rgmanager</span> picked up the updated config and remounted the GFS2 partition.


Repeat these configurations, altering for [[MAC]] and [[IP]] addresses as appropriate, for the other four VM cluster nodes.
<span class="code"></span>
<source lang="bash">
<source lang="bash">
df -h
</source>
</source>
<source lang="text">
<source lang="text">
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              40G  3.3G  35G  9% /
tmpfs                1.8G  32M  1.8G  2% /dev/shm
/dev/sda1            248M  85M  151M  36% /boot
/dev/mapper/san_vg01-shared01
                      10G  647M  9.4G  7% /shared01
</source>
</source>


== Benchmarks ==
= Configure KVM =
 
Host network and VM hypervisor config.
 
== Disable the 'qemu' Bridge ==
 
By default, <span class="code">[[libvirtd]]</span> creates a bridge called <span class="code">virbr0</span> designed to connect virtual machines to the first <span class="code">eth0</span> interface. Our system will not need this, so we will remove it. This bridge is configured in the <span class="code">/etc/libvirt/qemu/networks/default.xml</span> file.


GFS2 partition on <span class="code">an-node07</span>'s <span class="code">/shared01</span> partition. Test #1, no optimization:
So to remove this bridge, simply delete the contents of the file, stop the bridge, delete the bridge and then stop <span class="code">iptables</span> to make sure any rules created for the bridge are flushed.


<source lang="bash">
<source lang="bash">
bonnie++ -d /shared01/ -s 8g -u root:root
cat /dev/null >/etc/libvirt/qemu/networks/default.xml
</source>
ifconfig virbr0 down
<source lang="text">
brctl delbr virbr0
Version  1.96      ------Sequential Output------ --Sequential Input- --Random-
/etc/init.d/iptables stop
Concurrency  1    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
an-node07.alteev 8G  388  95 22203  6 14875  8  2978  95 48406  10 107.3  5
Latency              312ms  44400ms  31355ms  41505us    540ms  11926ms
Version  1.96      ------Sequential Create------ --------Random Create--------
an-node07.alteeve.c -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                16  1144  18 +++++ +++  8643  56  939  19 +++++ +++  8262  55
Latency              291ms    586us    2085us    3511ms      51us    3669us
1.96,1.96,an-node07.alteeve.com,1,1312497509,8G,,388,95,22203,6,14875,8,2978,95,48406,10,107.3,5,16,,,,,1144,18,+++++,+++,8643,56,939,19,+++++,+++,8262,55,312ms,44400ms,31355ms,41505us,540ms,11926ms,291ms,586us,2085us,3511ms,51us,3669us
</source>
</source>


CentOS 5.6 x86_64 VM <span class="code">vm0001_labzilla</span>'s <span class="code">/root</span> directory. Test #1, no optimization. VM provisioned using command in below section.
== Configure Bridges ==
 
On '''<span class="code">an-node03</span>''' through '''<span class="code">an-node07</span>''':


<source lang="bash">
<source lang="bash">
bonnie++ -d /root/ -s 8g -u root:root
vim /etc/sysconfig/network-scripts/ifcfg-{eth,vbr}0
</source>
</source>
<source lang="text">
 
Version  1.96      ------Sequential Output------ --Sequential Input- --Random-
''<span class="code">ifcfg-eth0</span>'':
Concurrency  1    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
<source lang="bash">
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
# Internet facing
labzilla-new.can 8G  674  98 15708  5 14875  7  1570  65 47806  10 119.1  7
HWADDR="bc:ae:c5:44:8a:de"
Latency            66766us    7680ms    1588ms    187ms    269ms    1292ms
DEVICE="eth0"
Version  1.96      ------Sequential Create------ --------Random Create--------
BRIDGE="vbr0"
labzilla-new.candco -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
BOOTPROTO="static"
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
IPV6INIT="yes"
                16 27666  39 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
NM_CONTROLLED="no"
Latency            11360us    1904us    799us    290us      44us      41us
ONBOOT="yes"
1.96,1.96,labzilla-new.candcoptical.com,1,1312522208,8G,,674,98,15708,5,14875,7,1570,65,47806,10,119.1,7,16,,,,,27666,39,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,66766us,7680ms,1588ms,187ms,269ms,1292ms,11360us,1904us,799us,290us,44us,41us
</source>
</source>


== Provision vm0001 ==
Note that you can use what ever bridge names makes sense to you. However, the file name for the bridge configuration must sort after the <span class="code">ifcfg-ethX</span> file. If the bridge file is read before the ethernet interface, it will fail to come up. Also, the bridge name as defined in the file does not need to match the one used it the actual file name. Personally, I like <span class="code">vbrX</span> for "''v''m ''br''idge".
 
Created LV already, so:


''<span class="code">ifcfg-vbr0</span>'':
<source lang="bash">
<source lang="bash">
virt-install --connect qemu:///system \
# Bridge - IFN
  --name vm0001_labzilla \
DEVICE="vbr0"
  --ram 1024 \
TYPE="Bridge"
  --arch x86_64 \
IPADDR=192.168.1.73
  --vcpus 2 \
NETMASK=255.255.255.0
  --cpuset 1-3 \
GATEWAY=192.168.1.254
  --location http://192.168.1.254/c5/x86_64/img/ \
DNS1=192.139.81.117
  --extra-args "ks=http://192.168.1.254/c5/x86_64/ks/labzilla_c5.ks" \
DNS2=192.139.81.1
  --os-type linux \
  --os-variant rhel5.4 \
  --disk path=/dev/san_vg01/vm0001_hdd1 \
  --network bridge=vbr0 \
  --vnc
</source>
</source>


== Provision vm0002 ==
You may wish to not make the Back-Channel Network accessible to the virtual machines, then there is no need to setup this second bridge.


Created LV already, so:
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-{eth,vbr}2
</source>


''<span class="code">ifcfg-eth2</span>'':
<source lang="bash">
<source lang="bash">
virt-install --connect qemu:///system \
# Back-channel
  --name vm0002_innovations \
HWADDR="00:1B:21:72:9B:56"
  --ram 1024 \
DEVICE="eth2"
  --arch x86_64 \
BRIDGE="vbr2"
  --vcpus 2 \
BOOTPROTO="static"
  --cpuset 1-3 \
IPV6INIT="yes"
  --cdrom /shared01/media/Win_Server_2008_Bis_x86_64.iso \
NM_CONTROLLED="no"
  --os-type windows \
ONBOOT="yes"
  --os-variant win2k8 \
</source>
  --disk path=/dev/san_vg01/vm0002_hdd2 \
 
  --network bridge=vbr0 \
''<span class="code">ifcfg-vbr2</span>'':
  --hvm \
<source lang="bash">
  --vnc
# Bridge - BCN
DEVICE="vbr2"
TYPE="Bridge"
IPADDR=192.168.3.73
NETMASK=255.255.255.0
</source>
</source>


Update the <span class="code">cluster.conf</span> to add the VMs to the cluster.
Leave the cluster, lest we be fenced.


<source lang="xml">
<source lang="bash">
<?xml version="1.0"?>
/etc/init.d/rgmanager stop && /etc/init.d/cman stop
<cluster config_version="12" name="an-clusterB">
</source>
<totem rrp_mode="none" secauth="off"/>
 
<clusternodes>
Restart networking and then check that the new bridges are up and that the proper ethernet devices are slaved to them.
<clusternode name="an-node03.alteeve.com" nodeid="3">
 
<fence>
<source lang="bash">
<method name="apc_pdu">
/etc/init.d/network restart
<device action="reboot" name="pdu2" port="3"/>
</source>
</method>
<source lang="text">
</fence>
Shutting down interface eth0:                              [  OK  ]
</clusternode>
Shutting down interface eth1:                              [  OK  ]
<clusternode name="an-node04.alteeve.com" nodeid="4">
Shutting down interface eth2:                              [  OK  ]
<fence>
Shutting down loopback interface:                          [  OK  ]
<method name="apc_pdu">
Bringing up loopback interface:                            [  OK  ]
<device action="reboot" name="pdu2" port="4"/>
Bringing up interface eth0:                                [  OK  ]
</method>
Bringing up interface eth1:                                [  OK  ]
</fence>
Bringing up interface eth2:                                [  OK  ]
</clusternode>
Bringing up interface vbr0:                                [  OK  ]
<clusternode name="an-node05.alteeve.com" nodeid="5">
Bringing up interface vbr2:                                [  OK  ]
<fence>
</source>
<method name="apc_pdu">
 
<device action="reboot" name="pdu2" port="5"/>
<source lang="bash">
</method>
brctl show
</fence>
</source>
</clusternode>
<source lang="text">
<clusternode name="an-node06.alteeve.com" nodeid="6">
bridge name bridge id STP enabled interfaces
<fence>
vbr0 8000.bcaec5448ade no eth0
<method name="apc_pdu">
vbr2 8000.001b21729b56 no eth2
<device action="reboot" name="pdu2" port="6"/>
</source>
</method>
 
</fence>
<source lang="bash">
</clusternode>
ifconfig
<clusternode name="an-node07.alteeve.com" nodeid="7">
</source>
<fence>
<source lang="text">
<method name="apc_pdu">
eth0      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE 
<device action="reboot" name="pdu2" port="7"/>
          inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
</method>
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
</fence>
          RX packets:4439 errors:0 dropped:0 overruns:0 frame:0
</clusternode>
          TX packets:2752 errors:0 dropped:0 overruns:0 carrier:0
</clusternodes>
          collisions:0 txqueuelen:1000
<fencedevices>
          RX bytes:508352 (496.4 KiB)  TX bytes:494345 (482.7 KiB)
<fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
          Interrupt:31 Base address:0x8000
</fencedevices>
 
<fence_daemon post_join_delay="30"/>
eth1      Link encap:Ethernet  HWaddr 00:1B:21:72:96:E8 
<rm log_level="5">
          inet addr:192.168.2.73  Bcast:192.168.2.255  Mask:255.255.255.0
<resources>
          inet6 addr: fe80::21b:21ff:fe72:96e8/64 Scope:Link
<script file="/etc/init.d/iscsi" name="iscsi"/>
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
<script file="/etc/init.d/clvmd" name="clvmd"/>
          RX packets:617100 errors:0 dropped:0 overruns:0 frame:0
<script file="/etc/init.d/gfs2" name="gfs2"/>
          TX packets:847718 errors:0 dropped:0 overruns:0 carrier:0
</resources>
          collisions:0 txqueuelen:1000
<failoverdomains>
          RX bytes:772489353 (736.7 MiB)  TX bytes:740536232 (706.2 MiB)
<failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
          Interrupt:18 Memory:fe9e0000-fea00000
<failoverdomainnode name="an-node03.alteeve.com"/>
 
</failoverdomain>
eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56 
<failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
          inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
<failoverdomainnode name="an-node04.alteeve.com"/>
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
</failoverdomain>
          RX packets:86586 errors:0 dropped:0 overruns:0 frame:0
<failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
          TX packets:80934 errors:0 dropped:0 overruns:0 carrier:0
<failoverdomainnode name="an-node05.alteeve.com"/>
          collisions:0 txqueuelen:1000
</failoverdomain>
          RX bytes:11366700 (10.8 MiB)  TX bytes:10091579 (9.6 MiB)
<failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
          Interrupt:17 Memory:feae0000-feb00000
<failoverdomainnode name="an-node06.alteeve.com"/>
 
</failoverdomain>
lo        Link encap:Local Loopback 
<failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
          inet addr:127.0.0.1 Mask:255.0.0.0
<failoverdomainnode name="an-node07.alteeve.com"/>
          inet6 addr: ::1/128 Scope:Host
</failoverdomain>
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
<failoverdomain name="an3_primary" nofailback="1" ordered="1" restricted="1">
          RX packets:32 errors:0 dropped:0 overruns:0 frame:0
<failoverdomainnode name="an-node03.alteeve.com" priority="1"/>
          TX packets:32 errors:0 dropped:0 overruns:0 carrier:0
<failoverdomainnode name="an-node04.alteeve.com" priority="2"/>
          collisions:0 txqueuelen:0  
<failoverdomainnode name="an-node05.alteeve.com" priority="3"/>
          RX bytes:11507 (11.2 KiB)  TX bytes:11507 (11.2 KiB)
<failoverdomainnode name="an-node06.alteeve.com" priority="4"/>
 
<failoverdomainnode name="an-node07.alteeve.com" priority="5"/>
vbr0      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE 
</failoverdomain>
          inet addr:192.168.1.73  Bcast:192.168.1.255  Mask:255.255.255.0
<failoverdomain name="an4_primary" nofailback="1" ordered="1" restricted="1">
          inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
<failoverdomainnode name="an-node03.alteeve.com" priority="5"/>
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
<failoverdomainnode name="an-node04.alteeve.com" priority="1"/>
          RX packets:165 errors:0 dropped:0 overruns:0 frame:0
<failoverdomainnode name="an-node05.alteeve.com" priority="2"/>
          TX packets:89 errors:0 dropped:0 overruns:0 carrier:0
<failoverdomainnode name="an-node06.alteeve.com" priority="3"/>
          collisions:0 txqueuelen:0
<failoverdomainnode name="an-node07.alteeve.com" priority="4"/>
          RX bytes:25875 (25.2 KiB)  TX bytes:17081 (16.6 KiB)
</failoverdomain>
 
</failoverdomains>
vbr2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56 
<service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart">
          inet addr:192.168.3.73  Bcast:192.168.3.255  Mask:255.255.255.0
<script ref="iscsi">
          inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
<script ref="clvmd">
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
<script ref="gfs2"/>
          RX packets:74 errors:0 dropped:0 overruns:0 frame:0
</script>
          TX packets:27 errors:0 dropped:0 overruns:0 carrier:0
</script>
          collisions:0 txqueuelen:0
</service>
          RX bytes:19021 (18.5 KiB)  TX bytes:4137 (4.0 KiB)
<service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart">
</source>
<script ref="iscsi">
 
<script ref="clvmd">
Rejoin the cluster.
<script ref="gfs2"/>
 
</script>
<source lang="bash">
</script>
/etc/init.d/cman start && /etc/init.d/rgmanager start
</service>
<service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<vm autostart="0" domain="an3_primary" exclusive="0" max_restarts="2" name="vm0001_labzilla" path="/shared01/definitions/" recovery="restart" restart_expire_time="600"/>
<vm autostart="0" domain="an4_primary" exclusive="0" max_restarts="2" name="vm0002_innovations" path="/shared01/definitions/" recovery="restart" restart_expire_time="600"/>
</rm>
</cluster>
</source>
</source>


Repeat these configurations, altering for [[MAC]] and [[IP]] addresses as appropriate, for the other four VM cluster nodes.


<span class="code"></span>
<span class="code"></span>
Line 2,167: Line 2,362:
</source>
</source>


= Stuff =
== Benchmarks ==


Multi-VM after primary SAN (violent) ejection from cluster. Both VMs remained up!
GFS2 partition on <span class="code">an-node07</span>'s <span class="code">/shared01</span> partition. Test #1, no optimization:


[[Image:two_vms_on_two_by_seven_cluster_build_01.png|thumb|center|800px|Two VMs (windows and Linux) running on the SAN. Initial testing of survivability of primary SAN failure completed successfully!]]
<source lang="bash">
bonnie++ -d /shared01/ -s 8g -u root:root
</source>
<source lang="text">
Version  1.96      ------Sequential Output------ --Sequential Input- --Random-
Concurrency  1    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
an-node07.alteev 8G  388  95 22203  6 14875  8  2978  95 48406  10 107.3  5
Latency              312ms  44400ms  31355ms  41505us    540ms  11926ms
Version  1.96      ------Sequential Create------ --------Random Create--------
an-node07.alteeve.c -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                16  1144  18 +++++ +++  8643  56  939  19 +++++ +++  8262  55
Latency              291ms    586us    2085us    3511ms      51us    3669us
1.96,1.96,an-node07.alteeve.ca,1,1312497509,8G,,388,95,22203,6,14875,8,2978,95,48406,10,107.3,5,16,,,,,1144,18,+++++,+++,8643,56,939,19,+++++,+++,8262,55,312ms,44400ms,31355ms,41505us,540ms,11926ms,291ms,586us,2085us,3511ms,51us,3669us
</source>


First build of the 2x7 Cluster.
CentOS 5.6 x86_64 VM <span class="code">vm0001_labzilla</span>'s <span class="code">/root</span> directory. Test #1, no optimization. VM provisioned using command in below section.


[[Image:first_ever_build.png|thumb|center|700px|First-ever successful build/install of the "Cluster Set" cluster configuration. Fully HA, fully home-brew on all open source software using only commodity hardware. Much tuning/testing to come!]]
<source lang="bash">
 
bonnie++ -d /root/ -s 8g -u root:root
== Bonding and Trunking ==
</source>
 
<source lang="text">
The goal here is to take the network out as a single point of failure.
Version  1.96      ------Sequential Output------ --Sequential Input- --Random-
 
Concurrency  1    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
The design is to use two stacked switches, bonded connections in the nodes with each leg of the bond cabled through either switch. While both are up, the aggregate bandwidth will be achieved using trunking in the switch and the appropriate bond driver configuration. The recovery from failure will need to be configured in such a way that it will be faster than the cluster's token loss timeouts multiplied by the token retransmit loss count.
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
labzilla-new.can 8G  674  98 15708  5 14875  7  1570  65 47806  10 119.1  7
Latency            66766us    7680ms    1588ms    187ms    269ms    1292ms
Version  1.96      ------Sequential Create------ --------Random Create--------
labzilla-new.candco -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                16 27666  39 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency            11360us    1904us    799us    290us      44us      41us
1.96,1.96,labzilla-new.candcoptical.com,1,1312522208,8G,,674,98,15708,5,14875,7,1570,65,47806,10,119.1,7,16,,,,,27666,39,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,66766us,7680ms,1588ms,187ms,269ms,1292ms,11360us,1904us,799us,290us,44us,41us
</source>


This tutorial uses 2x [http://dlink.ca/products/?pid=DGS-3100-24&tab=3 D-Link DGS-3100-24] switches. This is not to endorse these switches, per-se, but it does provide a relatively affordable, decent quality switches for those who'd like to replicate this setup.
== Provision vm0001 ==


=== Configure The Stack ===
Created LV already, so:


First, stack the switches using a ring topology (both HDMI connectors/cables used). If both switches are brand new, simple cable them together and the switches will auto-negotiate the stack configuration. If you are adding a new switch, then power on the existing switch, cable up the second switch and then power on the second switch. After a short time, it's stack ID should increment and you should see the new switch appear in the existing switch's interface.
<source lang="bash">
virt-install --connect qemu:///system \
  --name vm0001_labzilla \
  --ram 1024 \
  --arch x86_64 \
  --vcpus 2 \
  --cpuset 1-3 \
  --location http://192.168.1.254/c5/x86_64/img/ \
  --extra-args "ks=http://192.168.1.254/c5/x86_64/ks/labzilla_c5.ks" \
  --os-type linux \
  --os-variant rhel5.4 \
  --disk path=/dev/san_vg01/vm0001_hdd1 \
  --network bridge=vbr0 \
  --vnc
</source>


=== Configuring the Bonding Drivers ===
== Provision vm0002 ==


This tutorial uses four interfaces joined into two bonds of two NICs like so:
Created LV already, so:


<source lang="text">
<source lang="bash">
# Internet Facing Network:
virt-install --connect qemu:///system \
eth0 + eth1 == bond0
  --name vm0002_innovations \
  --ram 1024 \
  --arch x86_64 \
  --vcpus 2 \
  --cpuset 1-3 \
  --cdrom /shared01/media/Win_Server_2008_Bis_x86_64.iso \
  --os-type windows \
  --os-variant win2k8 \
  --disk path=/dev/san_vg01/vm0002_hdd2 \
  --network bridge=vbr0 \
  --hvm \
  --vnc
</source>
</source>


<source lang="text">
Update the <span class="code">cluster.conf</span> to add the VMs to the cluster.
# Storage and Cluster Communications:
eth2 + eth3 == bond1
</source>


This requires a few steps.
<source lang="xml">
 
<?xml version="1.0"?>
* Create <span class="code">/etc/modprobe.d/bonding.conf</span> and add an entry for the two bonding channels we will create.
<cluster config_version="12" name="an-clusterB">
 
<totem rrp_mode="none" secauth="off"/>
''Note'': My <span class="code">eth0</span> device is an onboard controller with a maximum [[MTU]] of 7200 [[bytes]]. This means that the whole bond is restricted to this MTU.
<clusternodes>
 
<clusternode name="an-node03.alteeve.ca" nodeid="3">
<source lang="bash">
<fence>
vim /etc/modprobe.d/bonding.conf
<method name="apc_pdu">
</source>
<device action="reboot" name="pdu2" port="3"/>
<source lang="text">
</method>
alias bond0 bonding
</fence>
alias bond1 bonding
</clusternode>
</source>
<clusternode name="an-node04.alteeve.ca" nodeid="4">
 
<fence>
* Create the <span class="code">ifcfg-bondX</span> configuration files.
<method name="apc_pdu">
 
<device action="reboot" name="pdu2" port="4"/>
Internet Facing configuration
</method>
 
</fence>
<source lang="bash">
</clusternode>
touch /etc/sysconfig/network-scripts/ifcfg-bond{0,1}
<clusternode name="an-node05.alteeve.ca" nodeid="5">
vim /etc/sysconfig/network-scripts/ifcfg-eth{0,1} /etc/sysconfig/network-scripts/ifcfg-bond0
<fence>
cat /etc/sysconfig/network-scripts/ifcfg-eth0
<method name="apc_pdu">
</source>
<device action="reboot" name="pdu2" port="5"/>
<source lang="bash">
</method>
# Internet Facing Network - Link 1
</fence>
HWADDR="BC:AE:C5:44:8A:DE"
</clusternode>
DEVICE="eth0"
<clusternode name="an-node06.alteeve.ca" nodeid="6">
BOOTPROTO="none"
<fence>
NM_CONTROLLED="no"
<method name="apc_pdu">
ONBOOT="yes"
<device action="reboot" name="pdu2" port="6"/>
MASTER="bond0"
</method>
SLAVE="yes"
</fence>
MTU="7200"
</clusternode>
</source>
<clusternode name="an-node07.alteeve.ca" nodeid="7">
 
<fence>
<source lang="bash">
<method name="apc_pdu">
cat /etc/sysconfig/network-scripts/ifcfg-eth1
<device action="reboot" name="pdu2" port="7"/>
</source>
</method>
<source lang="bash">
</fence>
# Internet Facing Network - Link 2
</clusternode>
HWADDR="00:1B:21:72:96:E8"
</clusternodes>
DEVICE="eth1"
<fencedevices>
BOOTPROTO="none"
<fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
NM_CONTROLLED="no"
</fencedevices>
ONBOOT="yes"
<fence_daemon post_join_delay="30"/>
MASTER="bond0"
<rm log_level="5">
SLAVE="yes"
<resources>
MTU="7200"
<script file="/etc/init.d/iscsi" name="iscsi"/>
</source>
<script file="/etc/init.d/clvmd" name="clvmd"/>
 
<script file="/etc/init.d/gfs2" name="gfs2"/>
<source lang="bash">
</resources>
cat /etc/sysconfig/network-scripts/ifcfg-bond0
<failoverdomains>
</source>
<failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
<source lang="bash">
<failoverdomainnode name="an-node03.alteeve.ca"/>
# Internet Facing Network - Bonded Interface
</failoverdomain>
DEVICE="bond0"
<failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
BOOTPROTO="static"
<failoverdomainnode name="an-node04.alteeve.ca"/>
NM_CONTROLLED="no"
</failoverdomain>
ONBOOT="yes"
<failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
IPADDR="192.168.1.73"
<failoverdomainnode name="an-node05.alteeve.ca"/>
NETMASK="255.255.255.0"
</failoverdomain>
GATEWAY="192.168.1.254"
<failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
DNS1="192.139.81.117"
<failoverdomainnode name="an-node06.alteeve.ca"/>
DNS2="192.139.81.1"
</failoverdomain>
BONDING_OPTS="miimon=1000 mode=0"
<failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
MTU="7200"
<failoverdomainnode name="an-node07.alteeve.ca"/>
</source>
</failoverdomain>
 
<failoverdomain name="an3_primary" nofailback="1" ordered="1" restricted="1">
Merged Storage Network and Back Channel Network configuration.
<failoverdomainnode name="an-node03.alteeve.ca" priority="1"/>
 
<failoverdomainnode name="an-node04.alteeve.ca" priority="2"/>
''Note'': The interfaces in this bond all support maximum [[MTU]] of 9000 [[bytes]].
<failoverdomainnode name="an-node05.alteeve.ca" priority="3"/>
 
<failoverdomainnode name="an-node06.alteeve.ca" priority="4"/>
<source lang="bash">
<failoverdomainnode name="an-node07.alteeve.ca" priority="5"/>
vim /etc/sysconfig/network-scripts/ifcfg-eth{2,3} /etc/sysconfig/network-scripts/ifcfg-bond1
</failoverdomain>
cat /etc/sysconfig/network-scripts/ifcfg-eth2
<failoverdomain name="an4_primary" nofailback="1" ordered="1" restricted="1">
</source>
<failoverdomainnode name="an-node03.alteeve.ca" priority="5"/>
<source lang="bash">
<failoverdomainnode name="an-node04.alteeve.ca" priority="1"/>
# Storage and Back Channel Networks - Link 1
<failoverdomainnode name="an-node05.alteeve.ca" priority="2"/>
HWADDR="00:1B:21:72:9B:56"
<failoverdomainnode name="an-node06.alteeve.ca" priority="3"/>
DEVICE="eth2"
<failoverdomainnode name="an-node07.alteeve.ca" priority="4"/>
BOOTPROTO="none"
</failoverdomain>
NM_CONTROLLED="no"
</failoverdomains>
ONBOOT="yes"
<service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart">
MASTER="bond1"
<script ref="iscsi">
SLAVE="yes"
<script ref="clvmd">
MTU="9000"
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<vm autostart="0" domain="an3_primary" exclusive="0" max_restarts="2" name="vm0001_labzilla" path="/shared01/definitions/" recovery="restart" restart_expire_time="600"/>
<vm autostart="0" domain="an4_primary" exclusive="0" max_restarts="2" name="vm0002_innovations" path="/shared01/definitions/" recovery="restart" restart_expire_time="600"/>
</rm>
</cluster>
</source>
</source>


<span class="code"></span>
<source lang="bash">
<source lang="bash">
cat /etc/sysconfig/network-scripts/ifcfg-eth3
</source>
<source lang="bash">
# Storage and Back Channel Networks - Link 2
HWADDR="00:0E:0C:59:46:E4"
DEVICE="eth3"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond1"
SLAVE="yes"
MTU="9000"
</source>
</source>
<source lang="text">
</source>
= Stuff =
Multi-VM after primary SAN (violent) ejection from cluster. Both VMs remained up!
[[Image:two_vms_on_two_by_seven_cluster_build_01.png|thumb|center|800px|Two VMs (windows and Linux) running on the SAN. Initial testing of survivability of primary SAN failure completed successfully!]]


<source lang="bash">
First build of the 2x7 Cluster.
cat /etc/sysconfig/network-scripts/ifcfg-bond1
</source>
<source lang="bash">
# Storage and Back Channel Networks - Bonded Interface
DEVICE="bond1"
BOOTPROTO="static"
NM_CONTROLLED="no"
ONBOOT="yes"
IPADDR="192.168.3.73"
NETMASK="255.255.255.0"
BONDING_OPTS="miimon=1000 mode=0"
MTU="9000"
</source>


Restart networking.
[[Image:first_ever_build.png|thumb|center|700px|First-ever successful build/install of the "Cluster Set" cluster configuration. Fully HA, fully home-brew on all open source software using only commodity hardware. Much tuning/testing to come!]]


{{note|1=I've noticed that this can error out and fail to start slaved devices at times when using <span class="code">/etc/init.d/network restart</span>. If you have any trouble, you may need to completely stop all networking, then start it back up. This, of course, requires network-less access to the node's console (direct access, [[iKVM]], console redirection, etc).}}
== Bonding and Trunking ==


Some of the errors we will see below are because the network interface configuration changed while the interfaces were still up. To avoid this, if you have networkless access to the nodes, would be to stop the network interfaces prior to beginning editing.
The goal here is to take the network out as a single point of failure.


<source lang="bash">
The design is to use two stacked switches, bonded connections in the nodes with each leg of the bond cabled through either switch. While both are up, the aggregate bandwidth will be achieved using trunking in the switch and the appropriate bond driver configuration. The recovery from failure will need to be configured in such a way that it will be faster than the cluster's token loss timeouts multiplied by the token retransmit loss count.
/etc/init.d/network restart
 
This tutorial uses 2x [http://dlink.ca/products/?pid=DGS-3100-24&tab=3 D-Link DGS-3100-24] switches. This is not to endorse these switches, per-se, but it does provide a relatively affordable, decent quality switches for those who'd like to replicate this setup.
 
=== Configure The Stack ===
 
First, stack the switches using a ring topology (both HDMI connectors/cables used). If both switches are brand new, simple cable them together and the switches will auto-negotiate the stack configuration. If you are adding a new switch, then power on the existing switch, cable up the second switch and then power on the second switch. After a short time, it's stack ID should increment and you should see the new switch appear in the existing switch's interface.
 
=== Configuring the Bonding Drivers ===
 
This tutorial uses four interfaces joined into two bonds of two NICs like so:
 
<source lang="text">
# Internet Facing Network:
eth0 + eth1 == bond0
</source>
</source>
<source lang="text">
<source lang="text">
Shutting down interface eth0:                             [  OK  ]
# Storage and Cluster Communications:
Shutting down interface eth1:                              [  OK  ]
eth2 + eth3 == bond1
Shutting down interface eth2:  /etc/sysconfig/network-scripts/ifdown-eth: line 99: /sys/class/net/bond1/bonding/slaves: No such file or directory
                                                          [  OK  ]
Shutting down interface eth3:  /etc/sysconfig/network-scripts/ifdown-eth: line 99: /sys/class/net/bond1/bonding/slaves: No such file or directory
                                                          [  OK  ]
Shutting down loopback interface:                          [  OK  ]
Bringing up loopback interface:                            [  OK  ]
Bringing up interface bond0:  RTNETLINK answers: File exists
Error adding address 192.168.1.73 for bond0.
RTNETLINK answers: File exists
                                                          [  OK  ]
Bringing up interface bond1:                              [  OK  ]
</source>
</source>


Confirm that we've got our new bonded interfaces
This requires a few steps.
 
* Create <span class="code">/etc/modprobe.d/bonding.conf</span> and add an entry for the two bonding channels we will create.
 
''Note'': My <span class="code">eth0</span> device is an onboard controller with a maximum [[MTU]] of 7200 [[bytes]]. This means that the whole bond is restricted to this MTU.


<source lang="bash">
<source lang="bash">
ifconfig
vim /etc/modprobe.d/bonding.conf
</source>
</source>
<source lang="text">
<source lang="text">
bond0     Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE 
alias bond0 bonding
          inet addr:192.168.1.73  Bcast:192.168.1.255  Mask:255.255.255.0
alias bond1 bonding
          inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
</source>
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:7200  Metric:1
 
          RX packets:1021 errors:0 dropped:0 overruns:0 frame:0
* Create the <span class="code">ifcfg-bondX</span> configuration files.
          TX packets:502 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:128516 (125.5 KiB)  TX bytes:95092 (92.8 KiB)


bond1    Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56 
Internet Facing configuration
          inet addr:192.168.3.73  Bcast:192.168.3.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
          RX packets:787028 errors:0 dropped:0 overruns:0 frame:0
          TX packets:788651 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:65753950 (62.7 MiB)  TX bytes:1194295932 (1.1 GiB)


eth0     Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE
<source lang="bash">
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:7200  Metric:1
touch /etc/sysconfig/network-scripts/ifcfg-bond{0,1}
          RX packets:535 errors:0 dropped:0 overruns:0 frame:0
vim /etc/sysconfig/network-scripts/ifcfg-eth{0,1} /etc/sysconfig/network-scripts/ifcfg-bond0
          TX packets:261 errors:0 dropped:0 overruns:0 carrier:0
cat /etc/sysconfig/network-scripts/ifcfg-eth0
          collisions:0 txqueuelen:1000
</source>
          RX bytes:66786 (65.2 KiB)  TX bytes:47749 (46.6 KiB)
<source lang="bash">
          Interrupt:31 Base address:0x8000
# Internet Facing Network - Link 1
HWADDR="BC:AE:C5:44:8A:DE"
DEVICE="eth0"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond0"
SLAVE="yes"
MTU="7200"
</source>


eth1     Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE 
<source lang="bash">
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:7200  Metric:1
cat /etc/sysconfig/network-scripts/ifcfg-eth1
          RX packets:486 errors:0 dropped:0 overruns:0 frame:0
</source>
          TX packets:241 errors:0 dropped:0 overruns:0 carrier:0
<source lang="bash">
          collisions:0 txqueuelen:1000
# Internet Facing Network - Link 2
          RX bytes:61730 (60.2 KiB)  TX bytes:47343 (46.2 KiB)
HWADDR="00:1B:21:72:96:E8"
          Interrupt:18 Memory:fe8e0000-fe900000
DEVICE="eth1"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond0"
SLAVE="yes"
MTU="7200"
</source>


eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56 
<source lang="bash">
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
cat /etc/sysconfig/network-scripts/ifcfg-bond0
          RX packets:360190 errors:0 dropped:0 overruns:0 frame:0
</source>
          TX packets:394844 errors:0 dropped:0 overruns:0 carrier:0
<source lang="bash">
          collisions:0 txqueuelen:1000
# Internet Facing Network - Bonded Interface
          RX bytes:28756400 (27.4 MiB)  TX bytes:598159146 (570.4 MiB)
DEVICE="bond0"
          Interrupt:17 Memory:fe9e0000-fea00000
BOOTPROTO="static"
 
NM_CONTROLLED="no"
eth3      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56 
ONBOOT="yes"
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
IPADDR="192.168.1.73"
          RX packets:426838 errors:0 dropped:0 overruns:0 frame:0
NETMASK="255.255.255.0"
          TX packets:393807 errors:0 dropped:0 overruns:0 carrier:0
GATEWAY="192.168.1.254"
          collisions:0 txqueuelen:1000
DNS1="192.139.81.117"
          RX bytes:36997550 (35.2 MiB)  TX bytes:596136786 (568.5 MiB)
DNS2="192.139.81.1"
 
BONDING_OPTS="miimon=1000 mode=0"
lo        Link encap:Local Loopback 
MTU="7200"
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
</source>
</source>


== Configuring Trunking ==
Merged Storage Network and Back Channel Network configuration.


By default, we will not get aggregated bandwidth across the stacked switches because routing is done based on [[MAC]] addresses. This means that when the switch sees a packet destined for a given MAC address, all traffic is routed down the one corresponding port. If you look at the MAC address assigned to <span class="code">bond0</span> you will see that it matches the MAC address for <span class="code">eth0</span>. Thus, all traffic destined to that bonded interface will be pushed down <span class="code">eth0</span>'s port on the switch, despite being able to also receive data on the <span class="code">eth1</span> interface now.
''Note'': The interfaces in this bond all support maximum [[MTU]] of 9000 [[bytes]].
 
To tell the switch to use both ports, we need to [[trunk|Trunking]] them. This tells the switch to push data down both ports.
 
Before we do that though, let's look at how we will verify the current link speed using <span class="code">[http://rpm.pbone.net/index.php3?stat=3&search=iperf&Search.x=33&Search.y=6&simple=2&dist%5B%5D=74&dist%5B%5D=0&dl=40&sr=1&field%5B%5D=1&field%5B%5D=2&srodzaj=1 ipperf]</span> (local copy of [https://alteeve.com/files/iperf-2.0.5-1.el6.x86_64.rpm iperf-2.0.5-1.el6.x86_64.rpm]).


<source lang="bash">
<source lang="bash">
rpm -Uvh --nopgp https://alteeve.com/files/iperf-2.0.5-1.el6.x86_64.rpm
vim /etc/sysconfig/network-scripts/ifcfg-eth{2,3} /etc/sysconfig/network-scripts/ifcfg-bond1
</source>
cat /etc/sysconfig/network-scripts/ifcfg-eth2
<source lang="text">
Retrieving https://alteeve.com/files/iperf-2.0.5-1.el6.x86_64.rpm
warning: /var/tmp/rpm-tmp.aAqbEm: Header V3 RSA/SHA256 Signature, key ID 0608b895: NOKEY
Preparing...                ########################################### [100%]
  1:iperf                  ########################################### [100%]
</source>
</source>
Now on one node, you need to tell <span class="code">iperf</span> to act as a server, then run you test from the second node.
Set ''<span class="code">an-node04</span>'' to listen (be the server):
<source lang="bash">
<source lang="bash">
iperf --server --bind an-node04.bcn
# Storage and Back Channel Networks - Link 1
</source>
HWADDR="00:1B:21:72:9B:56"
<source lang="text">
DEVICE="eth2"
------------------------------------------------------------
BOOTPROTO="none"
Server listening on TCP port 5001
NM_CONTROLLED="no"
Binding to local address an-node04.bcn
ONBOOT="yes"
TCP window size: 85.3 KByte (default)
MASTER="bond1"
------------------------------------------------------------
SLAVE="yes"
MTU="9000"
</source>
</source>


Run the test from ''<span class="code">an-node03</span>'' (be the client):
<source lang="bash">
 
cat /etc/sysconfig/network-scripts/ifcfg-eth3
=== LACP Bonding/Trunking ===
</source>
<source lang="bash">
# Storage and Back Channel Networks - Link 2
HWADDR="00:0E:0C:59:46:E4"
DEVICE="eth3"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond1"
SLAVE="yes"
MTU="9000"
</source>


<source lang="bash">
cat /etc/sysconfig/network-scripts/ifcfg-bond1
</source>
<source lang="bash">
<source lang="bash">
# Storage and Back Channel Networks - Bonded Interface
# Storage and Back Channel Networks - Bonded Interface
Line 2,456: Line 2,724:
IPADDR="192.168.3.73"
IPADDR="192.168.3.73"
NETMASK="255.255.255.0"
NETMASK="255.255.255.0"
#MTU="7500"
BONDING_OPTS="miimon=1000 mode=0"
MTU="9000"
</source>


# Bonding modes/options: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/sec-Using_Channel_Bonding.html
Restart networking.


# Balanced Round-Robin. 'arp_ip_target' can have up to 16 IPs, comma-separated.
{{note|1=I've noticed that this can error out and fail to start slaved devices at times when using <span class="code">/etc/init.d/network restart</span>. If you have any trouble, you may need to completely stop all networking, then start it back up. This, of course, requires network-less access to the node's console (direct access, [[iKVM]], console redirection, etc).}}
#BONDING_OPTS="mode=0 miimon=100"
#BONDING_OPTS="mode=0 arp_interval=100 arp_ip_target=192.168.3.74"


# Active Backup
Some of the errors we will see below are because the network interface configuration changed while the interfaces were still up. To avoid this, if you have networkless access to the nodes, would be to stop the network interfaces prior to beginning editing.
#BONDING_OPTS="mode=1 miimon=100 primary=eth1 primary_reselect=1"
#BONDING_OPTS="mode=1 arp_interval=100 arp_ip_target=192.168.3.74 primary=eth1 primary_reselect=1"


# Balanced XOR. policies; 0 = layer2, 1 = layer 3+4, 2 = layer 2+3
<source lang="bash">
#BONDING_OPTS="mode=2 miimon=100 xmit_hash_policy=0"
/etc/init.d/network restart
 
</source>
# Broadcast
<source lang="text">
#BONDING_OPTS="mode=3 miimon=100"
Shutting down interface eth0:                              [  OK  ]
 
Shutting down interface eth1:                              [  OK  ]
# 802.3ad (aka: LACP mode)
Shutting down interface eth2:  /etc/sysconfig/network-scripts/ifdown-eth: line 99: /sys/class/net/bond1/bonding/slaves: No such file or directory
BONDING_OPTS="mode=4 miimon=100 lacp_rate=1 xmit_hash_policy=2"
                                                          [  OK  ]
 
Shutting down interface eth3:  /etc/sysconfig/network-scripts/ifdown-eth: line 99: /sys/class/net/bond1/bonding/slaves: No such file or directory
# Balance TLD (Transmit Load Balance)
                                                          [  OK  ]
#BONDING_OPTS="mode=5 miimon=100"
Shutting down loopback interface:                          [  OK  ]
 
Bringing up loopback interface:                            [  OK  ]
# Balance ALB (Active Load Balance)
Bringing up interface bond0:  RTNETLINK answers: File exists
#BONDING_OPTS="mode=6 miimon=100"
Error adding address 192.168.1.73 for bond0.
RTNETLINK answers: File exists
                                                          [  OK  ]
Bringing up interface bond1:                              [  OK  ]
</source>
</source>


=== D-Link DGS-3100-24 Configuration ===
Confirm that we've got our new bonded interfaces
 
Before *touching* the switch, be sure that you are in a window where downtime is allowed. The chance of partitioning the cluster caused by the switch reconfiguring is very high! If you have a two node cluster, I highly recommend moving all services to one node and then stopping the cluster on the backup node. This way, should the network fail, there will be no need for fencing and the services will deal with minimal interruptions.
 


<source lang="bash">
<source lang="bash">
iperf --client an-node04.bcn
ifconfig
</source>
</source>
<source lang="text">
<source lang="text">
bond0    Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE 
          inet addr:192.168.1.73  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:7200  Metric:1
          RX packets:1021 errors:0 dropped:0 overruns:0 frame:0
          TX packets:502 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:128516 (125.5 KiB)  TX bytes:95092 (92.8 KiB)
bond1    Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56 
          inet addr:192.168.3.73  Bcast:192.168.3.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
          RX packets:787028 errors:0 dropped:0 overruns:0 frame:0
          TX packets:788651 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:65753950 (62.7 MiB)  TX bytes:1194295932 (1.1 GiB)
eth0      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE 
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:7200  Metric:1
          RX packets:535 errors:0 dropped:0 overruns:0 frame:0
          TX packets:261 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:66786 (65.2 KiB)  TX bytes:47749 (46.6 KiB)
          Interrupt:31 Base address:0x8000
eth1      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE 
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:7200  Metric:1
          RX packets:486 errors:0 dropped:0 overruns:0 frame:0
          TX packets:241 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:61730 (60.2 KiB)  TX bytes:47343 (46.2 KiB)
          Interrupt:18 Memory:fe8e0000-fe900000
eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56 
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:360190 errors:0 dropped:0 overruns:0 frame:0
          TX packets:394844 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:28756400 (27.4 MiB)  TX bytes:598159146 (570.4 MiB)
          Interrupt:17 Memory:fe9e0000-fea00000
eth3      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56 
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:426838 errors:0 dropped:0 overruns:0 frame:0
          TX packets:393807 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:36997550 (35.2 MiB)  TX bytes:596136786 (568.5 MiB)
lo        Link encap:Local Loopback 
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
</source>
== Configuring High-Availability Networking ==
There are [http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/sec-Using_Channel_Bonding.html seven bonding] modes, which you can read about in [http://www.kernel.org/doc/Documentation/networking/bonding.txt detail here]. However, the [[RHCS]] stack only supports one of the modes, called [[Active/Passive Bonding]], also known as <span class="code">mode=1</span>.
=== Configuring Your Switches ===
This method provides for no performance gains, but instead treats the slaved interfaces as independent paths. One of which acts as the primary while the other sits dormant. On failure, the bond switches to the backup interface, promoting it to the primary interface. Which is normally primary, and under what conditions a restored link might return to the primary role, are configurable should you wish to do so.
Your managed switch will no doubt have one or more bonding, also known as [[trunking]] configuration options. Likewise, your switches may be stackable. It is strongly advised that you do *not* stack your switches. Being unstacked, it is obviously not possible to configure trunking. Should you decide to disregard this, be very sure to extensive test failure and recovery of both switches under real-world work loads.
Still on the topic of switches; Do not configure [[STP]] (spanning tree protocol) on any port connected to your cluster nodes! When a switch is added to the network, as is the case after restoring a lost switch, STP-enabled switches and port on those switches may block traffic for a period of time while STP renegotiates and reconfigures. This takes more than enough time to cause a cluster to partition. You may still enable and configure STP if you need to do so, simply ensure that you only do in on the appropriate ports.
=== Preparing The Bonding Driver ===
Before we modify the network, we will need to create the following file:
You will need to create a file called
With the switches unstacked and STP disabled, we can now configure your bonding interface.
<source lang="bash">
vim /etc/modprobe.d/bonding.conf
</source>
<source lang="text">
alias bond0 bonding
alias bond1 bonding
alias bond2 bonding
</source>
If you only have four interfaces and plan to merge the [[SN]] and [[BCN]] networks, you can omit the <span class="code">bond2</span> entry.
You can then copy and paste the <span class="code">alias ...</span> entries from the file above into the terminal to avoid the need to reboot.
=== Deciding Which NICs to Bond ===
If all of the interfaces in your server are identical, you can probably skip this step. Before you jump though, consider that not all of the [[PCIe]] interfaces may have all of their lanes connected, resulting in differing speeds. If you are unsure, I strongly recommend you run these tests.
TODO: Upload <span class="code">network_profiler.pl</span> here and explain it's use.
* Before we do that though, let's look at how we will verify the current link speed using <span class="code">[http://rpm.pbone.net/index.php3?stat=3&search=iperf&Search.x=33&Search.y=6&simple=2&dist%5B%5D=74&dist%5B%5D=0&dl=40&sr=1&field%5B%5D=1&field%5B%5D=2&srodzaj=1 ipperf]</span> (local copy of [https://alteeve.ca/files/iperf-2.0.5-1.el6.x86_64.rpm iperf-2.0.5-1.el6.x86_64.rpm]).
Once you've determined the various capabilities on your interfaces, pair them off with their closest-performing partners.
Keep in mind:
* Any interface piggy-backing on an IPMI interface *must* be part of the [[BCN]] bond!
* The fasted interfaces should be paired for the [[SN]] bond.
* The lowest latency interfaces should be used the [[BCN]] bond.
* The lowest remaining two interfaces should be used in the [[IFN]] bond.
=== Creating the Bonds ===
{{warning|1=This step will almost certainly leave you without a network access to your servers. It is *strongly* advised that you do the next steps when you have physical access to your servers. If that is simply not possible, then proceed with extreme caution.}}
In my case, I found that bonding the following optimal configuration:
* <span class="code">eth0</span> and <span class="code">eth3</span> bonded as <span class="code">bond0</span>.
* <span class="code">eth1</span> and <span class="code">eth2</span> bonded as <span class="code">bond1</span>.
I did not have enough interfaces for three bonds, so I will configure the following:
* <span class="code">bond0</span> will be the [[IFN]] interface on the <span class="code">192.168.1.0/24</span> subnet.
* <span class="code">bond1</span> will be the merged [[BCN]] and [[SN]] interfaces on the <span class="code">192.168.3.0/24</span> subnet.
TODO: Create/show the <span class="code">diff</span>s for the following <span class="code">ifcfg-ethX</span> files.
* Create <span class="code">bond0</span> our of <span class="code">eth0</span> and <span class="code">eth3</span>:
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-eth0
</source>
<source lang="bash">
# Internet Facing Network - Link 1
HWADDR="BC:AE:C5:44:8A:DE"
DEVICE="eth0"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="no"
MASTER="bond0"
SLAVE="yes"
</source>
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-eth3
</source>
<source lang="bash">
# Storage and Back Channel Networks - Link 2
HWADDR="00:0E:0C:59:46:E4"
DEVICE="eth3"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="no"
MASTER="bond0"
SLAVE="yes"
</source>
<source lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-bond0
</source>
<source lang="bash">
# Internet Facing Network - Bonded Interface
DEVICE="bond0"
BOOTPROTO="static"
NM_CONTROLLED="no"
ONBOOT="yes"
IPADDR="192.168.1.73"
NETMASK="255.255.255.0"
GATEWAY="192.168.1.254"
DNS1="192.139.81.117"
DNS2="192.139.81.1"
# Clustering *only* supports mode=1 (active-passive)
BONDING_OPTS="mode=1 miimon=100 use_carrier=0 updelay=0 downdelay=0"
</source>
</source>



Latest revision as of 21:43, 20 June 2016

 AN!Wiki :: How To :: 2x5 Scalable Cluster Tutorial


Warning: This document is old, abandoned and very out of date. DON'T USE ANYTHING HERE! Consider it only as historical note taking.

The Design

Storage

Storage, high-level:

[ Storage Cluster ]                                                       
    _____________________________             _____________________________ 
   | [ an-node01 ]               |           | [ an-node02 ]               |
   |  _____    _____             |           |             _____    _____  |
   | ( HDD )  ( SSD )            |           |            ( SSD )  ( HDD ) |
   | (_____)  (_____)  __________|           |__________  (_____)  (_____) |
   |    |        |    | Storage  =--\     /--=  Storage |    |        |    |
   |    |        \----| Network ||  |     |  || Network |----/        |    |
   |    \-------------|_________||  |     |  ||_________|-------------/    |
   |_____________________________|  |     |  |_____________________________|
                                  __|_____|__                               
                                 |  HDD LUN  |                              
                                 |  SDD LUN  |                              
                                 |___________|                              
                                       |                                    
                                  _____|_____                             
                                 | Floating  |                            
                                 |   SAN IP  |                                
[ VM Cluster ]                   |___________|                                
  ______________________________   | | | | |   ______________________________ 
 | [ an-node03 ]                |  | | | | |  |                [ an-node06 ] |
 |  _________                   |  | | | | |  |                   _________  |
 | | [ vmA ] |                  |  | | | | |  |                  | [ vmJ ] | |
 | |  _____  |                  |  | | | | |  |                  |  _____  | |
 | | (_hdd_)-=----\             |  | | | | |  |             /----=-(_hdd_) | |
 | |_________|    |             |  | | | | |  |             |    |_________| |
 |  _________     |             |  | | | | |  |             |     _________  |
 | | [ vmB ] |    |             |  | | | | |  |             |    | [ vmK ] | |
 | |  _____  |    |             |  | | | | |  |             |    |  _____  | |
 | | (_hdd_)-=--\ |   __________|  | | | | |  |__________   | /--=-(_hdd_) | |
 | |_________|  | \--| Storage  =--/ | | | \--=  Storage |--/ |  |_________| |
 |  _________   \----| Network ||    | | |    || Network |----/   _________  |
 | | [ vmC ] |  /----|_________||    | | |    ||_________|----\  | [ vmL ] | |
 | |  _____  |  |               |    | | |    |               |  |  _____  | |
 | | (_hdd_)-=--/               |    | | |    |               \--=-(_hdd_) | |
 | |_________|                  |    | | |    |                  |_________| |
 |______________________________|    | | |    |______________________________|            
  ______________________________     | | |     ______________________________ 
 | [ an-node04 ]                |    | | |    |                [ an-node07 ] |
 |  _________                   |    | | |    |                   _________  |
 | | [ vmD ] |                  |    | | |    |                  | [ vmM ] | |
 | |  _____  |                  |    | | |    |                  |  _____  | |
 | | (_hdd_)-=----\             |    | | |    |             /----=-(_hdd_) | |
 | |_________|    |             |    | | |    |             |    |_________| |
 |  _________     |             |    | | |    |             |     _________  |
 | | [ vmE ] |    |             |    | | |    |             |    | [ vmN ] | |
 | |  _____  |    |             |    | | |    |             |    |  _____  | |
 | | (_hdd_)-=--\ |   __________|    | | |    |__________   | /--=-(_hdd_) | |
 | |_________|  | \--| Storage  =----/ | \----=  Storage |--/ |  |_________| |
 |  _________   \----| Network ||      |      || Network |----/   _________  |
 | | [ vmF ] |  /----|_________||      |      ||_________|----\  | [ vmO ] | |
 | |  _____  |  |               |      |      |               |  |  _____  | |
 | | (_hdd_)-=--+               |      |      |               \--=-(_hdd_) | |
 | | (_ssd_)-=--/               |      |      |                  |_________| |
 | |_________|                  |      |      |                              |
 |______________________________|      |      |______________________________|            
  ______________________________       |                                      
 | [ an-node05 ]                |      |                                      
 |  _________                   |      |                                      
 | | [ vmG ] |                  |      |                                      
 | |  _____  |                  |      |                                      
 | | (_hdd_)-=----\             |      |                                      
 | |_________|    |             |      |                                      
 |  _________     |             |      |                                      
 | | [ vmH ] |    |             |      |                                      
 | |  _____  |    |             |      |                                      
 | | (_hdd_)-=--\ |             |      |                                      
 | | (_sdd_)-=--+ |   __________|      |                                      
 | |_________|  | \--| Storage  =------/                                      
 |  _________   \----| Network ||                                             
 | | [ vmI ] |  /----|_________||                                             
 | |  _____  |  |               |                                             
 | | (_hdd_)-=--/               |                                             
 | |_________|                  |                                             
 |______________________________|

Long View

Note: Yes, this is a big graphic, but this is also a big project. I am no artist though, and any help making this clearer is greatly appreciated!
The planned network. This shows separate IPMI and full redundancy through-out the cluster. This is the way a production cluster should be built, but is not expected for dev/test clusters.

Failure Mapping

VM Cluster; Guest VM failure migration planning;

  • Each node can host 5 VMs @ 2GB/VM.
  • This is an N-1 cluster with five nodes; 20 VMs total.
          |    All    | an-node03 | an-node04 | an-node05 | an-node06 | an-node07 |
          | on-line   |   down    |   down    |   down    |   down    |   down    |
----------+-----------+-----------+-----------+-----------+-----------+-----------+
an-node03 |   vm01    |    --     |   vm01    |   vm01    |   vm01    |   vm01    |
          |   vm02    |    --     |   vm02    |   vm02    |   vm02    |   vm02    |
          |   vm03    |    --     |   vm03    |   vm03    |   vm03    |   vm03    |
          |   vm04    |    --     |   vm04    |   vm04    |   vm04    |   vm04    |
          |    --     |    --     |   vm05    |   vm09    |   vm13    |   vm17    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node04 |   vm05    |   vm05    |    --     |   vm05    |   vm05    |   vm05    |
          |   vm06    |   vm06    |    --     |   vm06    |   vm06    |   vm06    |
          |   vm07    |   vm07    |    --     |   vm07    |   vm07    |   vm07    |
          |   vm08    |   vm08    |    --     |   vm08    |   vm08    |   vm08    |
          |    --     |   vm01    |    --     |   vm10    |   vm14    |   vm18    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node05 |   vm09    |   vm09    |   vm09    |    --     |   vm09    |   vm09    |
          |   vm10    |   vm10    |   vm10    |    --     |   vm10    |   vm10    |
          |   vm11    |   vm11    |   vm11    |    --     |   vm11    |   vm11    |
          |   vm12    |   vm12    |   vm12    |    --     |   vm12    |   vm12    |
          |    --     |   vm02    |   vm06    |    --     |   vm15    |   vm19    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node06 |   vm13    |   vm13    |   vm13    |   vm13    |    --     |   vm13    |
          |   vm14    |   vm14    |   vm14    |   vm14    |    --     |   vm14    |
          |   vm15    |   vm15    |   vm15    |   vm15    |    --     |   vm15    |
          |   vm16    |   vm16    |   vm16    |   vm16    |    --     |   vm16    |
          |    --     |   vm03    |   vm07    |   vm11    |    --     |   vm20    |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node07 |   vm17    |   vm17    |   vm17    |   vm17    |   vm17    |    --     |
          |   vm18    |   vm18    |   vm18    |   vm18    |   vm18    |    --     |
          |   vm19    |   vm19    |   vm19    |   vm19    |   vm19    |    --     |
          |   vm20    |   vm20    |   vm20    |   vm20    |   vm20    |    --     |
          |    --     |   vm04    |   vm08    |   vm12    |   vm16    |    --     |
----------+-----------+-----------+-----------+-----------+-----------+-----------+

Cluster Overview

Note: This is not programatically accurate!

This is meant to show, at a logical level, how the parts of a cluster work together. It is the first draft and is likely defective in terrible ways.

[ Resource Managment ]                                                                                                           
  ___________     ___________                                                                                                
 |           |   |           |                                                                                               
 | Service A |   | Service B |                                                                                               
 |___________|   |___________|                                                                                               
            |     |         |                                                                                                    
          __|_____|__    ___|_______________                                                                                     
         |           |  |                   |                                                                                    
         | RGManager |  | Clustered Storage |================================================.                                   
         |___________|  |___________________|                                                |                                   
               |                  |                                                          |                                   
               |__________________|______________                                            |                                 
                            |                    \                                           |                                 
         _________      ____|____                 |                                          |                                 
        |         |    |         |                |                                          |                                 
 /------| Fencing |----| Locking |                |                                          |                                 
 |      |_________|    |_________|                |                                          |                                 
_|___________|_____________|______________________|__________________________________________|_____
 |           |             |                      |                                          |                                  
 |     ______|_____    ____|___                   |                                          |                                  
 |    |            |  |        |                  |                                          |                                  
 |    | Membership |  | Quorum |                  |                                          |                                  
 |    |____________|  |________|                  |                                          |                                  
 |           |____________|                       |                                          |                                  
 |                      __|__                     |                                          |                                  
 |                     /     \                    |                                          |                                  
 |                    { Totem }                   |                                          |                                  
 |                     \_____/                    |                                          |                                  
 |      __________________|_______________________|_______________ ______________            |                                    
 |     |-----------|-----------|----------------|-----------------|--------------|           |                                    
 |  ___|____    ___|____    ___|____         ___|____        _____|_____    _____|_____    __|___                                 
 | |        |  |        |  |        |       |        |      |           |  |           |  |      |                                
 | | Node 1 |  | Node 2 |  | Node 3 |  ...  | Node N |      | Storage 1 |==| Storage 2 |==| DRBD |                                
 | |________|  |________|  |________|       |________|      |___________|  |___________|  |______|                                
 \_____|___________|___________|________________|_________________|______________|                                                
                                                                                                                                 
[ Cluster Communication ]

Network IPs

SAN: 10.10.1.1

Node:
          | IFN         | SN         | BCN       | IPMI      |
----------+-------------+------------+-----------+-----------+
an-node01 | 10.255.0.1  | 10.10.0.1  | 10.20.0.1 | 10.20.1.1 |                                                 
an-node02 | 10.255.0.2  | 10.10.0.2  | 10.20.0.2 | 10.20.1.2 |                                                 
an-node03 | 10.255.0.3  | 10.10.0.3  | 10.20.0.3 | 10.20.1.3 |                                                 
an-node04 | 10.255.0.4  | 10.10.0.4  | 10.20.0.4 | 10.20.1.4 |                                                 
an-node05 | 10.255.0.5  | 10.10.0.5  | 10.20.0.5 | 10.20.1.5 |                                                 
an-node06 | 10.255.0.6  | 10.10.0.6  | 10.20.0.6 | 10.20.1.6 |                                                 
an-node07 | 10.255.0.7  | 10.10.0.7  | 10.20.0.7 | 10.20.1.7 |                                                 
----------+-------------+------------+-----------+-----------+

Aux Equipment:
          | BCN         |
----------+-------------+
pdu1      | 10.20.2.1   |                                                                                                  
pdu2      | 10.20.2.2   |                                                                                                  
switch1   | 10.20.2.3   |                                                                                                  
switch2   | 10.20.2.4   |                                                                                                  
ups1      | 10.20.2.5   |                                                                                         
ups2      | 10.20.2.6   |                                                                                         
----------+-------------+
                                                                                                                  
VMs:                                                                                                              
          | VMN         |                                                                                         
----------+-------------+
vm01      | 10.254.0.1  |                                                                                         
vm02      | 10.254.0.2  |                                                                                         
vm03      | 10.254.0.3  |                                                                                         
vm04      | 10.254.0.4  |                                                                                         
vm05      | 10.254.0.5  |                                                                                         
vm06      | 10.254.0.6  |                                                                                         
vm07      | 10.254.0.7  |                                                                                         
vm08      | 10.254.0.8  |                                                                                         
vm09      | 10.254.0.9  |                                                                                         
vm10      | 10.254.0.10 |                                                                                         
vm11      | 10.254.0.11 |                                                                                         
vm12      | 10.254.0.12 |                                                                                         
vm13      | 10.254.0.13 |                                                                                         
vm14      | 10.254.0.14 |                                                                                         
vm15      | 10.254.0.15 |                                                                                         
vm16      | 10.254.0.16 |                                                                                         
vm17      | 10.254.0.17 |                                                                                         
vm18      | 10.254.0.18 |                                                                                         
vm19      | 10.254.0.19 |                                                                                         
vm20      | 10.254.0.20 |                                                                                         
----------+-------------+

Install The Cluster Software

If you are using Red Hat Enterprise Linux, you will need to add the RHEL Server Optional (v. 6 64-bit x86_64) channel for each node in your cluster. You can do this in RHN by going the your subscription management page, clicking on each server, clicking on "Alter Channel Subscriptions", click to enable the RHEL Server Optional (v. 6 64-bit x86_64) channel and then by clicking on "Change Subscription".

This actual installation is simple, just use yum to install cman.

yum install cman fence-agents rgmanager resource-agents lvm2-cluster gfs2-utils python-virtinst libvirt qemu-kvm-tools qemu-kvm virt-manager virt-viewer virtio-win

Initial Config

Everything uses ricci, which itself needs to have a password set. I set this to match root.

Both:

passwd ricci
New password: 
Retype new password: 
passwd: all authentication tokens updated successfully.

With these decisions and the information gathered, here is what our first /etc/cluster/cluster.conf file will look like.

touch /etc/cluster/cluster.conf
vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="1" name="an-cluster">
	<cman expected_votes="1" two_node="1" />
	<clusternodes>
		<clusternode name="an-node01.alteeve.ca" nodeid="1">
			<fence>
				<method name="ipmi">
					<device action="reboot" name="ipmi_an01" />
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="1" />
					<device action="reboot" name="pdu2" port="1" />
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node02.alteeve.ca" nodeid="2">
			<fence>
				<method name="ipmi">
					<device action="reboot" name="ipmi_an02" />
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="2" />
					<device action="reboot" name="pdu2" port="2" />
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node03.alteeve.ca" nodeid="3">
			<fence>
				<method name="ipmi">
					<device action="reboot" name="ipmi_an03" />
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="3" />
					<device action="reboot" name="pdu2" port="3" />
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node04.alteeve.ca" nodeid="4">
			<fence>
				<method name="ipmi">
					<device action="reboot" name="ipmi_an04" />
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="4" />
					<device action="reboot" name="pdu2" port="4" />
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node05.alteeve.ca" nodeid="5">
			<fence>
				<method name="ipmi">
					<device action="reboot" name="ipmi_an05" />
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="5" />
					<device action="reboot" name="pdu2" port="5" />
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node06.alteeve.ca" nodeid="6">
			<fence>
				<method name="ipmi">
					<device action="reboot" name="ipmi_an06" />
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="6" />
					<device action="reboot" name="pdu2" port="6" />
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node07.alteeve.ca" nodeid="7">
			<fence>
				<method name="ipmi">
					<device action="reboot" name="ipmi_an07" />
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="7" />
					<device action="reboot" name="pdu2" port="7" />
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice agent="fence_ipmilan" ipaddr="an-node01.ipmi" login="root" name="ipmi_an01" passwd="secret" />
		<fencedevice agent="fence_ipmilan" ipaddr="an-node02.ipmi" login="root" name="ipmi_an02" passwd="secret" />
		<fencedevice agent="fence_ipmilan" ipaddr="an-node03.ipmi" login="root" name="ipmi_an03" passwd="secret" />
		<fencedevice agent="fence_ipmilan" ipaddr="an-node04.ipmi" login="root" name="ipmi_an04" passwd="secret" />
		<fencedevice agent="fence_ipmilan" ipaddr="an-node05.ipmi" login="root" name="ipmi_an05" passwd="secret" />
		<fencedevice agent="fence_ipmilan" ipaddr="an-node06.ipmi" login="root" name="ipmi_an06" passwd="secret" />
		<fencedevice agent="fence_ipmilan" ipaddr="an-node07.ipmi" login="root" name="ipmi_an07" passwd="secret" />
		<fencedevice agent="fence_apc_snmp" ipaddr="pdu1.alteeve.ca" name="pdu1" />
		<fencedevice agent="fence_apc_snmp" ipaddr="pdu2.alteeve.ca" name="pdu2" />
	</fencedevices>
	<fence_daemon post_join_delay="30" />
	<totem rrp_mode="none" secauth="off" />
	<rm>
		<resources>
			<ip address="10.10.1.1" monitor_link="on" />
			<script file="/etc/init.d/tgtd" name="tgtd" />
			<script file="/etc/init.d/drbd" name="drbd" />
			<script file="/etc/init.d/clvmd" name="clvmd" />
			<script file="/etc/init.d/gfs2" name="gfs2" />
			<script file="/etc/init.d/libvirtd" name="libvirtd" />
		</resources>
		<failoverdomains>
			<!-- Used for storage -->
			<!-- SAN Nodes -->
			<failoverdomain name="an1_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node01.alteeve.ca" />
			</failoverdomain>
			<failoverdomain name="an2_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node02.alteeve.ca" />
			</failoverdomain>
			
			<!-- VM Nodes -->
			<failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca" />
			</failoverdomain>
			<failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.ca" />
			</failoverdomain>
			<failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.ca" />
			</failoverdomain>
			<failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node06.alteeve.ca" />
			</failoverdomain>
			<failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node07.alteeve.ca" />
			</failoverdomain>
			
			<!-- Domain for the SAN -->
			<failoverdomain name="an1_primary" nofailback="1" ordered="1" restricted="0">
				<failoverdomainnode name="an-node01.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node02.alteeve.ca" priority="2" />
			</failoverdomain>
			
			<!-- Domains for VMs running primarily on an-node03 -->
			<failoverdomain name="an3_an4" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an3_an5" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an3_an6" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an3_an7" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
			</failoverdomain>
			
			<!-- Domains for VMs running primarily on an-node04 -->
			<failoverdomain name="an4_an3" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an4_an5" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an4_an6" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an4_an7" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
			</failoverdomain>
			
			<!-- Domains for VMs running primarily on an-node05 -->
			<failoverdomain name="an5_an3" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an5_an4" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an5_an6" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an5_an7" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
			</failoverdomain>
			
			<!-- Domains for VMs running primarily on an-node06 -->
			<failoverdomain name="an6_an3" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an6_an4" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an6_an5" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an6_an7" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
			</failoverdomain>
			
			<!-- Domains for VMs running primarily on an-node07 -->
			<failoverdomain name="an7_an3" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an7_an4" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an7_an5" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
			</failoverdomain>
			<failoverdomain name="an7_an6" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
				<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
			</failoverdomain>
		</failoverdomains>
		
		<!-- SAN Services -->
		<service autostart="1" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="tgtd" />
				</script>
			</script>
		</service>
		<service autostart="1" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="tgtd" />
				</script>
			</script>
		</service>
		<service autostart="1" domain="an1_primary" name="san_ip" recovery="relocate">
			<ip ref="10.10.1.1" />
		</service>
		
		<!-- VM Storage services. -->
		<service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2">
						<script ref="libvirtd" />
					</script>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2">
						<script ref="libvirtd" />
					</script>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2">
						<script ref="libvirtd" />
					</script>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2">
						<script ref="libvirtd" />
					</script>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2">
						<script ref="libvirtd" />
					</script>
				</script>
			</script>
		</service>
		
		<!-- VM Services -->
		<!-- VMs running primarily on an-node03 -->
		<vm name="vm01" domain="an03_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm02" domain="an03_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm03" domain="an03_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm04" domain="an03_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		
		<!-- VMs running primarily on an-node04 -->
		<vm name="vm05" domain="an04_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm06" domain="an04_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm07" domain="an04_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm08" domain="an04_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		
		<!-- VMs running primarily on an-node05 -->
		<vm name="vm09" domain="an05_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm10" domain="an05_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm11" domain="an05_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm12" domain="an05_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		
		<!-- VMs running primarily on an-node06 -->
		<vm name="vm13" domain="an06_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm14" domain="an06_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm15" domain="an06_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm16" domain="an06_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		
		<!-- VMs running primarily on an-node07 -->
		<vm name="vm17" domain="an07_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm18" domain="an07_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm19" domain="an07_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm20" domain="an07_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
	</rm>
</cluster>

Save the file, then validate it. If it fails, address the errors and try again.

ip addr list | grep <ip>
rg_test test /etc/cluster/cluster.conf
ccs_config_validate
Configuration validates

Push it to the other node:

rsync -av /etc/cluster/cluster.conf root@an-node02:/etc/cluster/
sending incremental file list
cluster.conf

sent 781 bytes  received 31 bytes  541.33 bytes/sec
total size is 701  speedup is 0.86

Start:


DO NOT PROCEED UNTIL YOUR cluster.conf FILE VALIDATES!

Unless you have it perfect, your cluster will fail.

Once it validates, proceed.

Starting The Cluster For The First Time

By default, if you start one node only and you've enabled the <cman two_node="1" expected_votes="1"/> option as we have done, the lone server will effectively gain quorum. It will try to connect to the cluster, but there won't be a cluster to connect to, so it will fence the other node after a timeout period. This timeout is 6 seconds by default.

For now, we will leave the default as it is. If you're interested in changing it though, the argument you are looking for is post_join_delay.

This behaviour means that we'll want to start both nodes well within six seconds of one another, least the slower one get needlessly fenced.

Left off here

Note to help minimize dual-fences:

  • you could add FENCED_OPTS="-f 5" to /etc/sysconfig/cman on *one* node (ilo fence devices may need this)

DRBD Config

Install from source:

Both:

# Obliterate peer - fence via cman
wget -c https://alteeve.ca/files/an-cluster/sbin/obliterate-peer.sh -O /sbin/obliterate-peer.sh
chmod a+x /sbin/obliterate-peer.sh
ls -lah /sbin/obliterate-peer.sh

# Download, compile and install DRBD
wget -c http://oss.linbit.com/drbd/8.3/drbd-8.3.11.tar.gz
tar -xvzf drbd-8.3.11.tar.gz
cd drbd-8.3.11
./configure \
   --prefix=/usr \
   --localstatedir=/var \
   --sysconfdir=/etc \
   --with-utils \
   --with-km \
   --with-udev \
   --with-pacemaker \
   --with-rgmanager \
   --with-bashcompletion
make
make install
chkconfig --add drbd
chkconfig drbd off

Configure

an-node01:

# Configure DRBD's global value.
cp /etc/drbd.d/global_common.conf /etc/drbd.d/global_common.conf.orig
vim /etc/drbd.d/global_common.conf
diff -u /etc/drbd.d/global_common.conf
--- /etc/drbd.d/global_common.conf.orig	2011-08-01 21:58:46.000000000 -0400
+++ /etc/drbd.d/global_common.conf	2011-08-01 23:18:27.000000000 -0400
@@ -15,24 +15,35 @@
 		# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
 		# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
 		# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
+		fence-peer		"/sbin/obliterate-peer.sh";
 	}
 
 	startup {
 		# wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
+		become-primary-on	both;
+		wfc-timeout		300;
+		degr-wfc-timeout	120;
 	}
 
 	disk {
 		# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
 		# no-disk-drain no-md-flushes max-bio-bvecs
+		fencing			resource-and-stonith;
 	}
 
 	net {
 		# sndbuf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
 		# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
 		# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
+		allow-two-primaries;
+		after-sb-0pri		discard-zero-changes;
+		after-sb-1pri		discard-secondary;
+		after-sb-2pri		disconnect;
 	}
 
 	syncer {
 		# rate after al-extents use-rle cpu-mask verify-alg csums-alg
+		# This should be no more than 30% of the maximum sustainable write speed.
+		rate			20M;
 	}
 }
vim /etc/drbd.d/r0.res
resource r0 {
        device          /dev/drbd0;
        meta-disk       internal;
        on an-node01.alteeve.ca {
                address         192.168.2.71:7789;
                disk            /dev/sda5;
        }
        on an-node02.alteeve.ca {
                address         192.168.2.72:7789;
                disk            /dev/sda5;
        }
}
cp /etc/drbd.d/r0.res /etc/drbd.d/r1.res 
vim /etc/drbd.d/r1.res
resource r1 {
        device          /dev/drbd1;
        meta-disk       internal;
        on an-node01.alteeve.ca {
                address         192.168.2.71:7790;
                disk            /dev/sdb1;
        }
        on an-node02.alteeve.ca {
                address         192.168.2.72:7790;
                disk            /dev/sdb1;
        }
}
Note: If you have multiple DRBD resources on on (set of) backing disks, consider adding syncer { after <minor-1>; }. For example, tell /dev/drbd1 to wait for /dev/drbd0 by adding syncer { after 0; }. This will prevent simultaneous resync's which could seriously impact performance. Resources will wait in state until the defined resource has completed sync'ing.

Validate:

drbdadm dump
  --==  Thank you for participating in the global usage survey  ==--
The server's response is:

you are the 369th user to install this version
# /usr/etc/drbd.conf
common {
    protocol               C;
    net {
        allow-two-primaries;
        after-sb-0pri    discard-zero-changes;
        after-sb-1pri    discard-secondary;
        after-sb-2pri    disconnect;
    }
    disk {
        fencing          resource-and-stonith;
    }
    syncer {
        rate             20M;
    }
    startup {
        wfc-timeout      300;
        degr-wfc-timeout 120;
        become-primary-on both;
    }
    handlers {
        pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
        local-io-error   "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
        fence-peer       /sbin/obliterate-peer.sh;
    }
}

# resource r0 on an-node01.alteeve.ca: not ignored, not stacked
resource r0 {
    on an-node01.alteeve.ca {
        device           /dev/drbd0 minor 0;
        disk             /dev/sda5;
        address          ipv4 192.168.2.71:7789;
        meta-disk        internal;
    }
    on an-node02.alteeve.ca {
        device           /dev/drbd0 minor 0;
        disk             /dev/sda5;
        address          ipv4 192.168.2.72:7789;
        meta-disk        internal;
    }
}

# resource r1 on an-node01.alteeve.ca: not ignored, not stacked
resource r1 {
    on an-node01.alteeve.ca {
        device           /dev/drbd1 minor 1;
        disk             /dev/sdb1;
        address          ipv4 192.168.2.71:7790;
        meta-disk        internal;
    }
    on an-node02.alteeve.ca {
        device           /dev/drbd1 minor 1;
        disk             /dev/sdb1;
        address          ipv4 192.168.2.72:7790;
        meta-disk        internal;
    }
}
rsync -av /etc/drbd.d root@an-node02:/etc/
drbd.d/
drbd.d/global_common.conf
drbd.d/global_common.conf.orig
drbd.d/r0.res
drbd.d/r1.res

sent 3523 bytes  received 110 bytes  7266.00 bytes/sec
total size is 3926  speedup is 1.08

Initialize and First start

Both:

Create the meta-data.

modprobe
drbdadm create-md r{0,1}
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success

Attach, connect and confirm (after both have attached and connected):

drbdadm attach r{0,1}
drbdadm connect r{0,1}
cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:441969960
 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:29309628

There is no data, so force both devices to be instantly UpToDate:

drbdadm -- --clear-bitmap new-current-uuid r{0,1}
cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
 0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Set both to primary and run a final check.

drbdadm primary r{0,1}
cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:672 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:672 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Update the cluster

vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="17" name="an-clusterA">
        <cman expected_votes="1" two_node="1"/>
        <totem rrp_mode="none" secauth="off"/>
        <clusternodes>
                <clusternode name="an-node01.alteeve.ca" nodeid="1">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node02.alteeve.ca" nodeid="2">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
        </fencedevices>
        <fence_daemon post_join_delay="30"/>
        <rm>
                <resources>
                        <ip address="192.168.2.100" monitor_link="on"/>
                        <script file="/etc/init.d/drbd" name="drbd"/>
                        <script file="/etc/init.d/clvmd" name="clvmd"/>
                        <script file="/etc/init.d/tgtd" name="tgtd"/>
                </resources>
                <failoverdomains>
                        <failoverdomain name="an1_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node01.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an2_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node02.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an1_primary" nofailback="1" ordered="1" restricted="0">
                                <failoverdomainnode name="an-node01.alteeve.ca" priority="1"/>
                                <failoverdomainnode name="an-node02.alteeve.ca" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <service autostart="0" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd"/>
                        </script>
                </service>
                <service autostart="0" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd"/>
                        </script>
                </service>
        </rm>
</cluster>
rg_test test /etc/cluster/cluster.conf
Running in test mode.
Loading resource rule from /usr/share/cluster/oralistener.sh
Loading resource rule from /usr/share/cluster/apache.sh
Loading resource rule from /usr/share/cluster/SAPDatabase
Loading resource rule from /usr/share/cluster/postgres-8.sh
Loading resource rule from /usr/share/cluster/lvm.sh
Loading resource rule from /usr/share/cluster/mysql.sh
Loading resource rule from /usr/share/cluster/lvm_by_vg.sh
Loading resource rule from /usr/share/cluster/service.sh
Loading resource rule from /usr/share/cluster/samba.sh
Loading resource rule from /usr/share/cluster/SAPInstance
Loading resource rule from /usr/share/cluster/checkquorum
Loading resource rule from /usr/share/cluster/ocf-shellfuncs
Loading resource rule from /usr/share/cluster/svclib_nfslock
Loading resource rule from /usr/share/cluster/script.sh
Loading resource rule from /usr/share/cluster/clusterfs.sh
Loading resource rule from /usr/share/cluster/fs.sh
Loading resource rule from /usr/share/cluster/oracledb.sh
Loading resource rule from /usr/share/cluster/nfsserver.sh
Loading resource rule from /usr/share/cluster/netfs.sh
Loading resource rule from /usr/share/cluster/orainstance.sh
Loading resource rule from /usr/share/cluster/vm.sh
Loading resource rule from /usr/share/cluster/lvm_by_lv.sh
Loading resource rule from /usr/share/cluster/tomcat-6.sh
Loading resource rule from /usr/share/cluster/ip.sh
Loading resource rule from /usr/share/cluster/nfsexport.sh
Loading resource rule from /usr/share/cluster/openldap.sh
Loading resource rule from /usr/share/cluster/ASEHAagent.sh
Loading resource rule from /usr/share/cluster/nfsclient.sh
Loading resource rule from /usr/share/cluster/named.sh
Loaded 24 resource rules
=== Resources List ===
Resource type: ip
Instances: 1/1
Agent: ip.sh
Attributes:
  address = 192.168.2.100 [ primary unique ]
  monitor_link = on
  nfslock [ inherit("service%nfslock") ]

Resource type: script
Agent: script.sh
Attributes:
  name = drbd [ primary unique ]
  file = /etc/init.d/drbd [ unique required ]
  service_name [ inherit("service%name") ]

Resource type: script
Agent: script.sh
Attributes:
  name = clvmd [ primary unique ]
  file = /etc/init.d/clvmd [ unique required ]
  service_name [ inherit("service%name") ]

Resource type: script
Agent: script.sh
Attributes:
  name = tgtd [ primary unique ]
  file = /etc/init.d/tgtd [ unique required ]
  service_name [ inherit("service%name") ]

Resource type: service [INLINE]
Instances: 1/1
Agent: service.sh
Attributes:
  name = an1_storage [ primary unique required ]
  domain = an1_only [ reconfig ]
  autostart = 0 [ reconfig ]
  exclusive = 0 [ reconfig ]
  nfslock = 0
  nfs_client_cache = 0
  recovery = restart [ reconfig ]
  depend_mode = hard
  max_restarts = 0
  restart_expire_time = 0
  priority = 0

Resource type: service [INLINE]
Instances: 1/1
Agent: service.sh
Attributes:
  name = an2_storage [ primary unique required ]
  domain = an2_only [ reconfig ]
  autostart = 0 [ reconfig ]
  exclusive = 0 [ reconfig ]
  nfslock = 0
  nfs_client_cache = 0
  recovery = restart [ reconfig ]
  depend_mode = hard
  max_restarts = 0
  restart_expire_time = 0
  priority = 0

Resource type: service [INLINE]
Instances: 1/1
Agent: service.sh
Attributes:
  name = san_ip [ primary unique required ]
  domain = an1_primary [ reconfig ]
  autostart = 0 [ reconfig ]
  exclusive = 0 [ reconfig ]
  nfslock = 0
  nfs_client_cache = 0
  recovery = relocate [ reconfig ]
  depend_mode = hard
  max_restarts = 0
  restart_expire_time = 0
  priority = 0

=== Resource Tree ===
service (S0) {
  name = "an1_storage";
  domain = "an1_only";
  autostart = "0";
  exclusive = "0";
  nfslock = "0";
  nfs_client_cache = "0";
  recovery = "restart";
  depend_mode = "hard";
  max_restarts = "0";
  restart_expire_time = "0";
  priority = "0";
  script (S0) {
    name = "drbd";
    file = "/etc/init.d/drbd";
    service_name = "an1_storage";
    script (S0) {
      name = "clvmd";
      file = "/etc/init.d/clvmd";
      service_name = "an1_storage";
    }
  }
}
service (S0) {
  name = "an2_storage";
  domain = "an2_only";
  autostart = "0";
  exclusive = "0";
  nfslock = "0";
  nfs_client_cache = "0";
  recovery = "restart";
  depend_mode = "hard";
  max_restarts = "0";
  restart_expire_time = "0";
  priority = "0";
  script (S0) {
    name = "drbd";
    file = "/etc/init.d/drbd";
    service_name = "an2_storage";
    script (S0) {
      name = "clvmd";
      file = "/etc/init.d/clvmd";
      service_name = "an2_storage";
    }
  }
}
service (S0) {
  name = "san_ip";
  domain = "an1_primary";
  autostart = "0";
  exclusive = "0";
  nfslock = "0";
  nfs_client_cache = "0";
  recovery = "relocate";
  depend_mode = "hard";
  max_restarts = "0";
  restart_expire_time = "0";
  priority = "0";
  ip (S0) {
    address = "192.168.2.100";
    monitor_link = "on";
    nfslock = "0";
  }
}
=== Failover Domains ===
Failover domain: an1_only
Flags: Restricted No Failback
  Node an-node01.alteeve.ca (id 1, priority 0)
Failover domain: an2_only
Flags: Restricted No Failback
  Node an-node02.alteeve.ca (id 2, priority 0)
Failover domain: an1_primary
Flags: Ordered No Failback
  Node an-node01.alteeve.ca (id 1, priority 1)
  Node an-node02.alteeve.ca (id 2, priority 2)
=== Event Triggers ===
Event Priority Level 100:
  Name: Default
    (Any event)
    File: /usr/share/cluster/default_event_script.sl
[root@an-node01 ~]# cman_tool version -r
You have not authenticated to the ricci daemon on an-node01.alteeve.ca
Password: 
[root@an-node01 ~]# clusvcadm -e service:an1_storage
Local machine trying to enable service:an1_storage...Success
service:an1_storage is now running on an-node01.alteeve.ca
[root@an-node01 ~]# cat /proc/drbd 
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:924 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:916 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
cman_tool version -r
You have not authenticated to the ricci daemon on an-node01.alteeve.ca
Password:

an-node01:

clusvcadm -e service:an1_storage
service:an1_storage is now running on an-node01.alteeve.ca

an-node02:

clusvcadm -e service:an2_storage
service:an2_storage is now running on an-node02.alteeve.ca

Either

cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:924 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:916 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

Configure Clustered LVM

an-node01:

cp /etc/lvm/lvm.conf /etc/lvm/lvm.conf.orig
vim /etc/lvm/lvm.conf
diff -u /etc/lvm/lvm.conf.orig /etc/lvm/lvm.conf
--- /etc/lvm/lvm.conf.orig	2011-08-02 21:59:01.000000000 -0400
+++ /etc/lvm/lvm.conf	2011-08-02 22:00:17.000000000 -0400
@@ -50,7 +50,8 @@
 
 
     # By default we accept every block device:
-    filter = [ "a/.*/" ]
+    #filter = [ "a/.*/" ]
+    filter = [ "a|/dev/drbd*|", "r/.*/" ]
 
     # Exclude the cdrom drive
     # filter = [ "r|/dev/cdrom|" ]
@@ -308,7 +309,8 @@
     # Type 3 uses built-in clustered locking.
     # Type 4 uses read-only locking which forbids any operations that might 
     # change metadata.
-    locking_type = 1
+    #locking_type = 1
+    locking_type = 3
 
     # Set to 0 to fail when a lock request cannot be satisfied immediately.
     wait_for_locks = 1
@@ -324,7 +326,8 @@
     # to 1 an attempt will be made to use local file-based locking (type 1).
     # If this succeeds, only commands against local volume groups will proceed.
     # Volume Groups marked as clustered will be ignored.
-    fallback_to_local_locking = 1
+    #fallback_to_local_locking = 1
+    fallback_to_local_locking = 0
 
     # Local non-LV directory that holds file-based locks while commands are
     # in progress.  A directory like /tmp that may get wiped on reboot is OK.
rsync -av /etc/lvm/lvm.conf root@an-node02:/etc/lvm/
sending incremental file list
lvm.conf

sent 2412 bytes  received 247 bytes  5318.00 bytes/sec
total size is 24668  speedup is 9.28

Create the LVM PVs, VGs and LVs.

an-node01:

pvcreate /dev/drbd{0,1}
  Physical volume "/dev/drbd0" successfully created
  Physical volume "/dev/drbd1" successfully created

an-node02:

pvscan
  PV /dev/drbd0                      lvm2 [421.50 GiB]
  PV /dev/drbd1                      lvm2 [27.95 GiB]
  Total: 2 [449.45 GiB] / in use: 0 [0   ] / in no VG: 2 [449.45 GiB]

an-node01:

vgcreate -c y hdd_vg0 /dev/drbd0 && vgcreate -c y sdd_vg0 /dev/drbd1
  Clustered volume group "hdd_vg0" successfully created
  Clustered volume group "ssd_vg0" successfully created

an-node02:

vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "ssd_vg0" using metadata type lvm2
  Found volume group "hdd_vg0" using metadata type lvm2

an-node01:

lvcreate -l 100%FREE -n lun0 /dev/hdd_vg0 && lvcreate -l 100%FREE -n lun1 /dev/ssd_vg0
  Logical volume "lun0" created
  Logical volume "lun1" created

an-node02:

lvscan
  ACTIVE            '/dev/ssd_vg0/lun1' [27.95 GiB] inherit
  ACTIVE            '/dev/hdd_vg0/lun0' [421.49 GiB] inherit

iSCSI notes

IET vs tgt pros and cons needed.

default iscsi port is 3260

initiator: This is the client. target: This is the server side. sid: Session ID; Found with iscsiadm -m session -P 1. SID and sysfs path are not persistent, partially start-order based. iQN: iSCSI Qualified Name; This is a string that uniquely identifies targets and initiators.

Both:

yum install iscsi-initiator-utils scsi-target-utils

an-node01:

cp /etc/tgt/targets.conf /etc/tgt/targets.conf.orig
vim /etc/tgt/targets.conf
diff -u /etc/tgt/targets.conf.orig /etc/tgt/targets.conf
--- /etc/tgt/targets.conf.orig	2011-07-31 12:38:35.000000000 -0400
+++ /etc/tgt/targets.conf	2011-08-02 22:19:06.000000000 -0400
@@ -251,3 +251,9 @@
 #        vendor_id VENDOR1
 #    </direct-store>
 #</target>
+
+<target iqn.2011-08.com.alteeve:an-clusterA.target01>
+	direct-store /dev/drbd0
+	direct-store /dev/drbd1
+	vendor_id Alteeve
rsync -av /etc/tgt/targets.conf root@an-node02:/etc/tgt/
sending incremental file list
targets.conf

sent 909 bytes  received 97 bytes  670.67 bytes/sec
total size is 7093  speedup is 7.05

Update the cluster

               <service autostart="0" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd">
                                        <script ref="tgtd"/>
                                </script>
                        </script>
                </service>
                <service autostart="0" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd">
                                        <script ref="tgtd"/>
                                </script>
                        </script>
                </service>

Connect to the SAN from a VM node

an-node03+:

iscsiadm -m discovery -t sendtargets -p 192.168.2.100
192.168.2.100:3260,1 iqn.2011-08.com.alteeve:an-clusterA.target01
iscsiadm --mode node --portal 192.168.2.100 --target iqn.2011-08.com.alteeve:an-clusterA.target01 --login
Logging in to [iface: default, target: iqn.2011-08.com.alteeve:an-clusterA.target01, portal: 192.168.2.100,3260]
Login to [iface: default, target: iqn.2011-08.com.alteeve:an-clusterA.target01, portal: 192.168.2.100,3260] successful.
fdisk -l
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00062f4a

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          33      262144   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              33        5255    41943040   83  Linux
/dev/sda3            5255        5777     4194304   82  Linux swap / Solaris

Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/sdb doesn't contain a valid partition table

Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/sdc doesn't contain a valid partition table

Setup the VM Cluster

Install RPMs.

yum -y install lvm2-cluster cman fence-agents

Configure lvm.conf.

cp /etc/lvm/lvm.conf /etc/lvm/lvm.conf.orig
vim /etc/lvm/lvm.conf
diff -u /etc/lvm/lvm.conf.orig /etc/lvm/lvm.conf
--- /etc/lvm/lvm.conf.orig	2011-08-02 21:59:01.000000000 -0400
+++ /etc/lvm/lvm.conf	2011-08-03 00:35:45.000000000 -0400
@@ -308,7 +308,8 @@
     # Type 3 uses built-in clustered locking.
     # Type 4 uses read-only locking which forbids any operations that might 
     # change metadata.
-    locking_type = 1
+    #locking_type = 1
+    locking_type = 3
 
     # Set to 0 to fail when a lock request cannot be satisfied immediately.
     wait_for_locks = 1
@@ -324,7 +325,8 @@
     # to 1 an attempt will be made to use local file-based locking (type 1).
     # If this succeeds, only commands against local volume groups will proceed.
     # Volume Groups marked as clustered will be ignored.
-    fallback_to_local_locking = 1
+    #fallback_to_local_locking = 1
+    fallback_to_local_locking = 0
 
     # Local non-LV directory that holds file-based locks while commands are
     # in progress.  A directory like /tmp that may get wiped on reboot is OK.
rsync -av /etc/lvm/lvm.conf root@an-node04:/etc/lvm/
sending incremental file list
lvm.conf

sent 873 bytes  received 247 bytes  2240.00 bytes/sec
total size is 24625  speedup is 21.99
rsync -av /etc/lvm/lvm.conf root@an-node05:/etc/lvm/
sending incremental file list
lvm.conf

sent 873 bytes  received 247 bytes  2240.00 bytes/sec
total size is 24625  speedup is 21.99

Config the cluster.

vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="5" name="an-clusterB">
        <totem rrp_mode="none" secauth="off"/>
        <clusternodes>
                <clusternode name="an-node03.alteeve.ca" nodeid="1">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="3"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node04.alteeve.ca" nodeid="2">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="4"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node05.alteeve.ca" nodeid="3">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="5"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
        </fencedevices>
        <fence_daemon post_join_delay="30"/>
        <rm>
                <resources>
                        <script file="/etc/init.d/iscsi" name="iscsi" />
                        <script file="/etc/init.d/clvmd" name="clvmd" />
                </resources>
                <failoverdomains>
                        <failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node03.alteeve.ca" />
                        </failoverdomain>
                        <failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node04.alteeve.ca" />
                        </failoverdomain>
                        <failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node05.alteeve.ca" />
                        </failoverdomain>
                </failoverdomains>
                <service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd"/>
                        </script>
                </service>
                <service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd"/>
                        </script>
                </service>
                <service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd"/>
                        </script>
                </service>
        </rm>   
</cluster>
ccs_config_validate
Configuration validates

Make sure iscsi and clvmd do not start on boot, stop both, then make sure they start and stop cleanly.

chkconfig clvmd off; chkconfig iscsi off; /etc/init.d/iscsi stop && /etc/init.d/clvmd stop
Stopping iscsi:                                            [  OK  ]
/etc/init.d/clvmd start && /etc/init.d/iscsi start && /etc/init.d/iscsi stop && /etc/init.d/clvmd stop
Starting clvmd: 
Activating VG(s):   No volume groups found
                                                           [  OK  ]
Starting iscsi:                                            [  OK  ]
Stopping iscsi:                                            [  OK  ]
Signaling clvmd to exit                                    [  OK  ]
clvmd terminated                                           [  OK  ]

Use the cluster to stop (in case it autostarted before now) and then start the services.

# Disable (stop)
clusvcadm -d service:an3_storage
clusvcadm -d service:an4_storage
clusvcadm -d service:an5_storage
# Enable (start)
clusvcadm -e service:an3_storage -m an-node03.alteeve.ca
clusvcadm -e service:an4_storage -m an-node04.alteeve.ca
clusvcadm -e service:an5_storage -m an-node05.alteeve.ca
# Check
clustat
Cluster Status for an-clusterB @ Wed Aug  3 00:25:10 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node03.alteeve.ca                        1 Online, Local, rgmanager
 an-node04.alteeve.ca                        2 Online, rgmanager
 an-node05.alteeve.ca                        3 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an3_storage            an-node03.alteeve.ca           started       
 service:an4_storage            an-node04.alteeve.ca           started       
 service:an5_storage            an-node05.alteeve.ca           started

Flush iSCSI's Cache

If you remove an iQN (or change the name of one), the /etc/init.d/iscsi script will return errors. To flush it and re-scan:

I am sure there is a more elegant way.

/etc/init.d/iscsi stop && rm -rf /var/lib/iscsi/nodes/* && iscsiadm -m discovery -t sendtargets -p 192.168.2.100

Setup the VM Cluster's Clustered LVM

Partition the SAN disks

an-node03:

fdisk -l
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00062f4a

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          33      262144   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              33        5255    41943040   83  Linux
/dev/sda3            5255        5777     4194304   82  Linux swap / Solaris

Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/sdc doesn't contain a valid partition table

Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/sdb doesn't contain a valid partition table

Create partitions.

fdisk /dev/sdb
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x403f1fb8.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
         switch off the mode (command 'c') and change display units to
         sectors (command 'u').

Command (m for help): c
DOS Compatibility flag is not set

Command (m for help): u
Changing display/entry units to sectors

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-55022, default 1): 1
Last cylinder, +cylinders or +size{K,M,G} (1-55022, default 55022): 
Using default value 55022

Command (m for help): p

Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x403f1fb8

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       55022   441964183+  83  Linux

Command (m for help): t
Selected partition 1
Hex code (type L to list codes): 8e
Changed system type of partition 1 to 8e (Linux LVM)

Command (m for help): p

Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x403f1fb8

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       55022   441964183+  8e  Linux LVM

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
fdisk /dev/sdc
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0xba7503eb.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
         switch off the mode (command 'c') and change display units to
         sectors (command 'u').

Command (m for help): c
DOS Compatibility flag is not set

Command (m for help): u
Changing display/entry units to sectors

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First sector (2048-58613759, default 2048): 
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-58613759, default 58613759): 
Using default value 58613759

Command (m for help): t
Selected partition 1
Hex code (type L to list codes): 8e
Changed system type of partition 1 to 8e (Linux LVM)

Command (m for help): p

Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders, total 58613760 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xba7503eb

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1            2048    58613759    29305856   8e  Linux LVM

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
fdisk -l
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00062f4a

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          33      262144   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              33        5255    41943040   83  Linux
/dev/sda3            5255        5777     4194304   82  Linux swap / Solaris

Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xba7503eb

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               2       28620    29305856   8e  Linux LVM

Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x403f1fb8

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       55022   441964183+  8e  Linux LVM

Setup LVM devices

Create PV.

an-node03:

pvcreate /dev/sd{b,c}1
  Physical volume "/dev/sdb1" successfully created
  Physical volume "/dev/sdc1" successfully created

an-node04 and an-node05:

pvscan
  PV /dev/sdb1                      lvm2 [421.49 GiB]
  PV /dev/sdc1                      lvm2 [27.95 GiB]
  Total: 2 [449.44 GiB] / in use: 0 [0   ] / in no VG: 2 [449.44 GiB]

Create the VGs.

an-node03:

vgcreate -c y san_vg01 /dev/sdb1
  Clustered volume group "san_vg01" successfully created
vgcreate -c y san_vg02 /dev/sdc1
  Clustered volume group "san_vg02" successfully created

an-node04 and an-node05:

vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "san_vg02" using metadata type lvm2
  Found volume group "san_vg01" using metadata type lvm2

Create the first VM's LVs.

an-node03:

lvcreate -L 10G -n shared01 /dev/san_vg01
  Logical volume "shared01" created
lvcreate -L 50G -n vm0001_hdd1 /dev/san_vg01
  Logical volume "vm0001_hdd1" created
lvcreate -L 10G -n vm0001_ssd1 /dev/san_vg02
  Logical volume "vm0001_ssd1" created

an-node04 and an-node05:

lvscan
  ACTIVE            '/dev/san_vg01/shared01' [10.00 GiB] inherit
  ACTIVE            '/dev/san_vg02/vm0001_ssd1' [10.00 GiB] inherit
  ACTIVE            '/dev/san_vg01/vm0001_hdd1' [50.00 GiB] inherit

Create Shared GFS2 Partition

an-node03:

mkfs.gfs2 -p lock_dlm -j 5 -t an-clusterB:shared01 /dev/san_vg01/shared01
This will destroy any data on /dev/san_vg01/shared01.
It appears to contain: symbolic link to `../dm-2'

Are you sure you want to proceed? [y/n] y

Device:                    /dev/san_vg01/shared01
Blocksize:                 4096
Device Size                10.00 GB (2621440 blocks)
Filesystem Size:           10.00 GB (2621438 blocks)
Journals:                  5
Resource Groups:           40
Locking Protocol:          "lock_dlm"
Lock Table:                "an-clusterB:shared01"
UUID:                      6C0D7D1D-A1D3-ED79-705D-28EE3D674E75

Add it to /etc/fstab (needed for the gfs2 init script to find and mount):

an-node03 - an-node07:

echo `gfs2_edit -p sb /dev/san_vg01/shared01 | grep sb_uuid | sed -e "s/.*sb_uuid  *\(.*\)/UUID=\L\1\E \/shared01\t\tgfs2\trw,suid,dev,exec,nouser,async\t0 0/"` >> /etc/fstab 
cat /etc/fstab
#
# /etc/fstab
# Created by anaconda on Fri Jul  8 22:01:41 2011
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=2c1f4cb1-959f-4675-b9c7-5d753c303dd1 /                       ext3    defaults        1 1
UUID=9a0224dc-15b4-439e-8d7c-5f9dbcd05e3f /boot                   ext3    defaults        1 2
UUID=4f2a83e8-1769-40d8-ba2a-e1f535306848 swap                    swap    defaults        0 0
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
UUID=6c0d7d1d-a1d3-ed79-705d-28ee3d674e75 /shared01 gfs2 rw,suid,dev,exec,nouser,async 0 0

Make the mount point and mount it.

mkdir /shared01
/etc/init.d/gfs2 start
Mounting GFS2 filesystem (/shared01):                      [  OK  ]
df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              40G  3.3G   35G   9% /
tmpfs                 1.8G   32M  1.8G   2% /dev/shm
/dev/sda1             248M   85M  151M  36% /boot
/dev/mapper/san_vg01-shared01
                       10G  647M  9.4G   7% /shared01

Stop GFS2 on all five nodes and update the cluster.conf config.

/etc/init.d/gfs2 stop
Unmounting GFS2 filesystem (/shared01):                    [  OK  ]
df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              40G  3.3G   35G   9% /
tmpfs                 1.8G   32M  1.8G   2% /dev/shm
/dev/sda1             248M   85M  151M  36% /boot
/dev/mapper/san_vg01-shared01
                       10G  647M  9.4G   7% /shared01

an-node03:

<?xml version="1.0"?>
<cluster config_version="9" name="an-clusterB">
        <totem rrp_mode="none" secauth="off"/>
        <clusternodes>
                <clusternode name="an-node03.alteeve.ca" nodeid="3">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="3"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node04.alteeve.ca" nodeid="4">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="4"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node05.alteeve.ca" nodeid="5">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="5"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node06.alteeve.ca" nodeid="6">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="6"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node07.alteeve.ca" nodeid="7">
                        <fence>
                                <method name="apc_pdu">
                                        <device action="reboot" name="pdu2" port="7"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
        </fencedevices>
        <fence_daemon post_join_delay="30"/>
        <rm>
                <resources>
                        <script file="/etc/init.d/iscsi" name="iscsi"/>
                        <script file="/etc/init.d/clvmd" name="clvmd"/>
                        <script file="/etc/init.d/gfs2" name="gfs2"/>
                </resources>
                <failoverdomains>
                        <failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node03.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node04.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node05.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node06.alteeve.ca"/>
                        </failoverdomain>
                        <failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node07.alteeve.ca"/>
                        </failoverdomain>
                </failoverdomains>
                <service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
                <service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
                <service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
                <service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
                <service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart">
                        <script ref="iscsi">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
        </rm>
</cluster>
cman_tool version -r

Check that rgmanager picked up the updated config and remounted the GFS2 partition.

df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              40G  3.3G   35G   9% /
tmpfs                 1.8G   32M  1.8G   2% /dev/shm
/dev/sda1             248M   85M  151M  36% /boot
/dev/mapper/san_vg01-shared01
                       10G  647M  9.4G   7% /shared01

Configure KVM

Host network and VM hypervisor config.

Disable the 'qemu' Bridge

By default, libvirtd creates a bridge called virbr0 designed to connect virtual machines to the first eth0 interface. Our system will not need this, so we will remove it. This bridge is configured in the /etc/libvirt/qemu/networks/default.xml file.

So to remove this bridge, simply delete the contents of the file, stop the bridge, delete the bridge and then stop iptables to make sure any rules created for the bridge are flushed.

cat /dev/null >/etc/libvirt/qemu/networks/default.xml
ifconfig virbr0 down
brctl delbr virbr0
/etc/init.d/iptables stop

Configure Bridges

On an-node03 through an-node07:

vim /etc/sysconfig/network-scripts/ifcfg-{eth,vbr}0

ifcfg-eth0:

# Internet facing
HWADDR="bc:ae:c5:44:8a:de"
DEVICE="eth0"
BRIDGE="vbr0"
BOOTPROTO="static"
IPV6INIT="yes"
NM_CONTROLLED="no"
ONBOOT="yes"

Note that you can use what ever bridge names makes sense to you. However, the file name for the bridge configuration must sort after the ifcfg-ethX file. If the bridge file is read before the ethernet interface, it will fail to come up. Also, the bridge name as defined in the file does not need to match the one used it the actual file name. Personally, I like vbrX for "vm bridge".

ifcfg-vbr0:

# Bridge - IFN
DEVICE="vbr0"
TYPE="Bridge"
IPADDR=192.168.1.73
NETMASK=255.255.255.0
GATEWAY=192.168.1.254
DNS1=192.139.81.117
DNS2=192.139.81.1

You may wish to not make the Back-Channel Network accessible to the virtual machines, then there is no need to setup this second bridge.

vim /etc/sysconfig/network-scripts/ifcfg-{eth,vbr}2

ifcfg-eth2:

# Back-channel
HWADDR="00:1B:21:72:9B:56"
DEVICE="eth2"
BRIDGE="vbr2"
BOOTPROTO="static"
IPV6INIT="yes"
NM_CONTROLLED="no"
ONBOOT="yes"

ifcfg-vbr2:

# Bridge - BCN
DEVICE="vbr2"
TYPE="Bridge"
IPADDR=192.168.3.73
NETMASK=255.255.255.0

Leave the cluster, lest we be fenced.

/etc/init.d/rgmanager stop && /etc/init.d/cman stop

Restart networking and then check that the new bridges are up and that the proper ethernet devices are slaved to them.

/etc/init.d/network restart
Shutting down interface eth0:                              [  OK  ]
Shutting down interface eth1:                              [  OK  ]
Shutting down interface eth2:                              [  OK  ]
Shutting down loopback interface:                          [  OK  ]
Bringing up loopback interface:                            [  OK  ]
Bringing up interface eth0:                                [  OK  ]
Bringing up interface eth1:                                [  OK  ]
Bringing up interface eth2:                                [  OK  ]
Bringing up interface vbr0:                                [  OK  ]
Bringing up interface vbr2:                                [  OK  ]
brctl show
bridge name	bridge id		STP enabled	interfaces
vbr0		8000.bcaec5448ade	no		eth0
vbr2		8000.001b21729b56	no		eth2
ifconfig
eth0      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE  
          inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:4439 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2752 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:508352 (496.4 KiB)  TX bytes:494345 (482.7 KiB)
          Interrupt:31 Base address:0x8000 

eth1      Link encap:Ethernet  HWaddr 00:1B:21:72:96:E8  
          inet addr:192.168.2.73  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:96e8/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:617100 errors:0 dropped:0 overruns:0 frame:0
          TX packets:847718 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:772489353 (736.7 MiB)  TX bytes:740536232 (706.2 MiB)
          Interrupt:18 Memory:fe9e0000-fea00000 

eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56  
          inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:86586 errors:0 dropped:0 overruns:0 frame:0
          TX packets:80934 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:11366700 (10.8 MiB)  TX bytes:10091579 (9.6 MiB)
          Interrupt:17 Memory:feae0000-feb00000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:32 errors:0 dropped:0 overruns:0 frame:0
          TX packets:32 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:11507 (11.2 KiB)  TX bytes:11507 (11.2 KiB)

vbr0      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE  
          inet addr:192.168.1.73  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:165 errors:0 dropped:0 overruns:0 frame:0
          TX packets:89 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:25875 (25.2 KiB)  TX bytes:17081 (16.6 KiB)

vbr2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56  
          inet addr:192.168.3.73  Bcast:192.168.3.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:74 errors:0 dropped:0 overruns:0 frame:0
          TX packets:27 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:19021 (18.5 KiB)  TX bytes:4137 (4.0 KiB)

Rejoin the cluster.

/etc/init.d/cman start && /etc/init.d/rgmanager start


Repeat these configurations, altering for MAC and IP addresses as appropriate, for the other four VM cluster nodes.

Benchmarks

GFS2 partition on an-node07's /shared01 partition. Test #1, no optimization:

bonnie++ -d /shared01/ -s 8g -u root:root
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
an-node07.alteev 8G   388  95 22203   6 14875   8  2978  95 48406  10 107.3   5
Latency               312ms   44400ms   31355ms   41505us     540ms   11926ms
Version  1.96       ------Sequential Create------ --------Random Create--------
an-node07.alteeve.c -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1144  18 +++++ +++  8643  56   939  19 +++++ +++  8262  55
Latency               291ms     586us    2085us    3511ms      51us    3669us
1.96,1.96,an-node07.alteeve.ca,1,1312497509,8G,,388,95,22203,6,14875,8,2978,95,48406,10,107.3,5,16,,,,,1144,18,+++++,+++,8643,56,939,19,+++++,+++,8262,55,312ms,44400ms,31355ms,41505us,540ms,11926ms,291ms,586us,2085us,3511ms,51us,3669us

CentOS 5.6 x86_64 VM vm0001_labzilla's /root directory. Test #1, no optimization. VM provisioned using command in below section.

bonnie++ -d /root/ -s 8g -u root:root
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
labzilla-new.can 8G   674  98 15708   5 14875   7  1570  65 47806  10 119.1   7
Latency             66766us    7680ms    1588ms     187ms     269ms    1292ms
Version  1.96       ------Sequential Create------ --------Random Create--------
labzilla-new.candco -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 27666  39 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency             11360us    1904us     799us     290us      44us      41us
1.96,1.96,labzilla-new.candcoptical.com,1,1312522208,8G,,674,98,15708,5,14875,7,1570,65,47806,10,119.1,7,16,,,,,27666,39,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,66766us,7680ms,1588ms,187ms,269ms,1292ms,11360us,1904us,799us,290us,44us,41us

Provision vm0001

Created LV already, so:

virt-install --connect qemu:///system \
  --name vm0001_labzilla \
  --ram 1024 \
  --arch x86_64 \
  --vcpus 2 \
  --cpuset 1-3 \
  --location http://192.168.1.254/c5/x86_64/img/ \
  --extra-args "ks=http://192.168.1.254/c5/x86_64/ks/labzilla_c5.ks" \
  --os-type linux \
  --os-variant rhel5.4 \
  --disk path=/dev/san_vg01/vm0001_hdd1 \
  --network bridge=vbr0 \
  --vnc

Provision vm0002

Created LV already, so:

virt-install --connect qemu:///system \
  --name vm0002_innovations \
  --ram 1024 \
  --arch x86_64 \
  --vcpus 2 \
  --cpuset 1-3 \
  --cdrom /shared01/media/Win_Server_2008_Bis_x86_64.iso \
  --os-type windows \
  --os-variant win2k8 \
  --disk path=/dev/san_vg01/vm0002_hdd2 \
  --network bridge=vbr0 \
  --hvm \
  --vnc

Update the cluster.conf to add the VMs to the cluster.

<?xml version="1.0"?>
<cluster config_version="12" name="an-clusterB">
	<totem rrp_mode="none" secauth="off"/>
	<clusternodes>
		<clusternode name="an-node03.alteeve.ca" nodeid="3">
			<fence>
				<method name="apc_pdu">
					<device action="reboot" name="pdu2" port="3"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node04.alteeve.ca" nodeid="4">
			<fence>
				<method name="apc_pdu">
					<device action="reboot" name="pdu2" port="4"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node05.alteeve.ca" nodeid="5">
			<fence>
				<method name="apc_pdu">
					<device action="reboot" name="pdu2" port="5"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node06.alteeve.ca" nodeid="6">
			<fence>
				<method name="apc_pdu">
					<device action="reboot" name="pdu2" port="6"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node07.alteeve.ca" nodeid="7">
			<fence>
				<method name="apc_pdu">
					<device action="reboot" name="pdu2" port="7"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
	</fencedevices>
	<fence_daemon post_join_delay="30"/>
	<rm log_level="5">
		<resources>
			<script file="/etc/init.d/iscsi" name="iscsi"/>
			<script file="/etc/init.d/clvmd" name="clvmd"/>
			<script file="/etc/init.d/gfs2" name="gfs2"/>
		</resources>
		<failoverdomains>
			<failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca"/>
			</failoverdomain>
			<failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.ca"/>
			</failoverdomain>
			<failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.ca"/>
			</failoverdomain>
			<failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node06.alteeve.ca"/>
			</failoverdomain>
			<failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-node07.alteeve.ca"/>
			</failoverdomain>
			<failoverdomain name="an3_primary" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca" priority="1"/>
				<failoverdomainnode name="an-node04.alteeve.ca" priority="2"/>
				<failoverdomainnode name="an-node05.alteeve.ca" priority="3"/>
				<failoverdomainnode name="an-node06.alteeve.ca" priority="4"/>
				<failoverdomainnode name="an-node07.alteeve.ca" priority="5"/>
			</failoverdomain>
			<failoverdomain name="an4_primary" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-node03.alteeve.ca" priority="5"/>
				<failoverdomainnode name="an-node04.alteeve.ca" priority="1"/>
				<failoverdomainnode name="an-node05.alteeve.ca" priority="2"/>
				<failoverdomainnode name="an-node06.alteeve.ca" priority="3"/>
				<failoverdomainnode name="an-node07.alteeve.ca" priority="4"/>
			</failoverdomain>
		</failoverdomains>
		<service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart">
			<script ref="iscsi">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart">
			<script ref="iscsi">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart">
			<script ref="iscsi">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart">
			<script ref="iscsi">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
		<service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart">
			<script ref="iscsi">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
		<vm autostart="0" domain="an3_primary" exclusive="0" max_restarts="2" name="vm0001_labzilla" path="/shared01/definitions/" recovery="restart" restart_expire_time="600"/>
		<vm autostart="0" domain="an4_primary" exclusive="0" max_restarts="2" name="vm0002_innovations" path="/shared01/definitions/" recovery="restart" restart_expire_time="600"/>
	</rm>
</cluster>


Stuff

Multi-VM after primary SAN (violent) ejection from cluster. Both VMs remained up!

Two VMs (windows and Linux) running on the SAN. Initial testing of survivability of primary SAN failure completed successfully!

First build of the 2x7 Cluster.

First-ever successful build/install of the "Cluster Set" cluster configuration. Fully HA, fully home-brew on all open source software using only commodity hardware. Much tuning/testing to come!

Bonding and Trunking

The goal here is to take the network out as a single point of failure.

The design is to use two stacked switches, bonded connections in the nodes with each leg of the bond cabled through either switch. While both are up, the aggregate bandwidth will be achieved using trunking in the switch and the appropriate bond driver configuration. The recovery from failure will need to be configured in such a way that it will be faster than the cluster's token loss timeouts multiplied by the token retransmit loss count.

This tutorial uses 2x D-Link DGS-3100-24 switches. This is not to endorse these switches, per-se, but it does provide a relatively affordable, decent quality switches for those who'd like to replicate this setup.

Configure The Stack

First, stack the switches using a ring topology (both HDMI connectors/cables used). If both switches are brand new, simple cable them together and the switches will auto-negotiate the stack configuration. If you are adding a new switch, then power on the existing switch, cable up the second switch and then power on the second switch. After a short time, it's stack ID should increment and you should see the new switch appear in the existing switch's interface.

Configuring the Bonding Drivers

This tutorial uses four interfaces joined into two bonds of two NICs like so:

# Internet Facing Network:
eth0 + eth1 == bond0
# Storage and Cluster Communications:
eth2 + eth3 == bond1

This requires a few steps.

  • Create /etc/modprobe.d/bonding.conf and add an entry for the two bonding channels we will create.

Note: My eth0 device is an onboard controller with a maximum MTU of 7200 bytes. This means that the whole bond is restricted to this MTU.

vim /etc/modprobe.d/bonding.conf
alias bond0 bonding
alias bond1 bonding
  • Create the ifcfg-bondX configuration files.

Internet Facing configuration

touch /etc/sysconfig/network-scripts/ifcfg-bond{0,1}
vim /etc/sysconfig/network-scripts/ifcfg-eth{0,1} /etc/sysconfig/network-scripts/ifcfg-bond0
cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Internet Facing Network - Link 1
HWADDR="BC:AE:C5:44:8A:DE"
DEVICE="eth0"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond0"
SLAVE="yes"
MTU="7200"
cat /etc/sysconfig/network-scripts/ifcfg-eth1
# Internet Facing Network - Link 2
HWADDR="00:1B:21:72:96:E8"
DEVICE="eth1"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond0"
SLAVE="yes"
MTU="7200"
cat /etc/sysconfig/network-scripts/ifcfg-bond0
# Internet Facing Network - Bonded Interface
DEVICE="bond0"
BOOTPROTO="static"
NM_CONTROLLED="no"
ONBOOT="yes"
IPADDR="192.168.1.73"
NETMASK="255.255.255.0"
GATEWAY="192.168.1.254"
DNS1="192.139.81.117"
DNS2="192.139.81.1"
BONDING_OPTS="miimon=1000 mode=0"
MTU="7200"

Merged Storage Network and Back Channel Network configuration.

Note: The interfaces in this bond all support maximum MTU of 9000 bytes.

vim /etc/sysconfig/network-scripts/ifcfg-eth{2,3} /etc/sysconfig/network-scripts/ifcfg-bond1
cat /etc/sysconfig/network-scripts/ifcfg-eth2
# Storage and Back Channel Networks - Link 1
HWADDR="00:1B:21:72:9B:56"
DEVICE="eth2"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond1"
SLAVE="yes"
MTU="9000"
cat /etc/sysconfig/network-scripts/ifcfg-eth3
# Storage and Back Channel Networks - Link 2
HWADDR="00:0E:0C:59:46:E4"
DEVICE="eth3"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond1"
SLAVE="yes"
MTU="9000"
cat /etc/sysconfig/network-scripts/ifcfg-bond1
# Storage and Back Channel Networks - Bonded Interface
DEVICE="bond1"
BOOTPROTO="static"
NM_CONTROLLED="no"
ONBOOT="yes"
IPADDR="192.168.3.73"
NETMASK="255.255.255.0"
BONDING_OPTS="miimon=1000 mode=0"
MTU="9000"

Restart networking.

Note: I've noticed that this can error out and fail to start slaved devices at times when using /etc/init.d/network restart. If you have any trouble, you may need to completely stop all networking, then start it back up. This, of course, requires network-less access to the node's console (direct access, iKVM, console redirection, etc).

Some of the errors we will see below are because the network interface configuration changed while the interfaces were still up. To avoid this, if you have networkless access to the nodes, would be to stop the network interfaces prior to beginning editing.

/etc/init.d/network restart
Shutting down interface eth0:                              [  OK  ]
Shutting down interface eth1:                              [  OK  ]
Shutting down interface eth2:  /etc/sysconfig/network-scripts/ifdown-eth: line 99: /sys/class/net/bond1/bonding/slaves: No such file or directory
                                                           [  OK  ]
Shutting down interface eth3:  /etc/sysconfig/network-scripts/ifdown-eth: line 99: /sys/class/net/bond1/bonding/slaves: No such file or directory
                                                           [  OK  ]
Shutting down loopback interface:                          [  OK  ]
Bringing up loopback interface:                            [  OK  ]
Bringing up interface bond0:  RTNETLINK answers: File exists
Error adding address 192.168.1.73 for bond0.
RTNETLINK answers: File exists
                                                           [  OK  ]
Bringing up interface bond1:                               [  OK  ]

Confirm that we've got our new bonded interfaces

ifconfig
bond0     Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE  
          inet addr:192.168.1.73  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:7200  Metric:1
          RX packets:1021 errors:0 dropped:0 overruns:0 frame:0
          TX packets:502 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:128516 (125.5 KiB)  TX bytes:95092 (92.8 KiB)

bond1     Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56  
          inet addr:192.168.3.73  Bcast:192.168.3.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
          RX packets:787028 errors:0 dropped:0 overruns:0 frame:0
          TX packets:788651 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:65753950 (62.7 MiB)  TX bytes:1194295932 (1.1 GiB)

eth0      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:7200  Metric:1
          RX packets:535 errors:0 dropped:0 overruns:0 frame:0
          TX packets:261 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:66786 (65.2 KiB)  TX bytes:47749 (46.6 KiB)
          Interrupt:31 Base address:0x8000 

eth1      Link encap:Ethernet  HWaddr BC:AE:C5:44:8A:DE  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:7200  Metric:1
          RX packets:486 errors:0 dropped:0 overruns:0 frame:0
          TX packets:241 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:61730 (60.2 KiB)  TX bytes:47343 (46.2 KiB)
          Interrupt:18 Memory:fe8e0000-fe900000 

eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:360190 errors:0 dropped:0 overruns:0 frame:0
          TX packets:394844 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:28756400 (27.4 MiB)  TX bytes:598159146 (570.4 MiB)
          Interrupt:17 Memory:fe9e0000-fea00000 

eth3      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:426838 errors:0 dropped:0 overruns:0 frame:0
          TX packets:393807 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:36997550 (35.2 MiB)  TX bytes:596136786 (568.5 MiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

Configuring High-Availability Networking

There are seven bonding modes, which you can read about in detail here. However, the RHCS stack only supports one of the modes, called Active/Passive Bonding, also known as mode=1.

Configuring Your Switches

This method provides for no performance gains, but instead treats the slaved interfaces as independent paths. One of which acts as the primary while the other sits dormant. On failure, the bond switches to the backup interface, promoting it to the primary interface. Which is normally primary, and under what conditions a restored link might return to the primary role, are configurable should you wish to do so.

Your managed switch will no doubt have one or more bonding, also known as trunking configuration options. Likewise, your switches may be stackable. It is strongly advised that you do *not* stack your switches. Being unstacked, it is obviously not possible to configure trunking. Should you decide to disregard this, be very sure to extensive test failure and recovery of both switches under real-world work loads.

Still on the topic of switches; Do not configure STP (spanning tree protocol) on any port connected to your cluster nodes! When a switch is added to the network, as is the case after restoring a lost switch, STP-enabled switches and port on those switches may block traffic for a period of time while STP renegotiates and reconfigures. This takes more than enough time to cause a cluster to partition. You may still enable and configure STP if you need to do so, simply ensure that you only do in on the appropriate ports.

Preparing The Bonding Driver

Before we modify the network, we will need to create the following file:

You will need to create a file called

With the switches unstacked and STP disabled, we can now configure your bonding interface.

vim /etc/modprobe.d/bonding.conf
alias bond0 bonding
alias bond1 bonding
alias bond2 bonding

If you only have four interfaces and plan to merge the SN and BCN networks, you can omit the bond2 entry.

You can then copy and paste the alias ... entries from the file above into the terminal to avoid the need to reboot.

Deciding Which NICs to Bond

If all of the interfaces in your server are identical, you can probably skip this step. Before you jump though, consider that not all of the PCIe interfaces may have all of their lanes connected, resulting in differing speeds. If you are unsure, I strongly recommend you run these tests.

TODO: Upload network_profiler.pl here and explain it's use.


Once you've determined the various capabilities on your interfaces, pair them off with their closest-performing partners.

Keep in mind:

  • Any interface piggy-backing on an IPMI interface *must* be part of the BCN bond!
  • The fasted interfaces should be paired for the SN bond.
  • The lowest latency interfaces should be used the BCN bond.
  • The lowest remaining two interfaces should be used in the IFN bond.

Creating the Bonds

Warning: This step will almost certainly leave you without a network access to your servers. It is *strongly* advised that you do the next steps when you have physical access to your servers. If that is simply not possible, then proceed with extreme caution.

In my case, I found that bonding the following optimal configuration:

  • eth0 and eth3 bonded as bond0.
  • eth1 and eth2 bonded as bond1.

I did not have enough interfaces for three bonds, so I will configure the following:

  • bond0 will be the IFN interface on the 192.168.1.0/24 subnet.
  • bond1 will be the merged BCN and SN interfaces on the 192.168.3.0/24 subnet.

TODO: Create/show the diffs for the following ifcfg-ethX files.

  • Create bond0 our of eth0 and eth3:
vim /etc/sysconfig/network-scripts/ifcfg-eth0
# Internet Facing Network - Link 1
HWADDR="BC:AE:C5:44:8A:DE"
DEVICE="eth0"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="no"
MASTER="bond0"
SLAVE="yes"
vim /etc/sysconfig/network-scripts/ifcfg-eth3
# Storage and Back Channel Networks - Link 2
HWADDR="00:0E:0C:59:46:E4"
DEVICE="eth3"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="no"
MASTER="bond0"
SLAVE="yes"
vim /etc/sysconfig/network-scripts/ifcfg-bond0
# Internet Facing Network - Bonded Interface
DEVICE="bond0"
BOOTPROTO="static"
NM_CONTROLLED="no"
ONBOOT="yes"
IPADDR="192.168.1.73"
NETMASK="255.255.255.0"
GATEWAY="192.168.1.254"
DNS1="192.139.81.117"
DNS2="192.139.81.1"
# Clustering *only* supports mode=1 (active-passive)
BONDING_OPTS="mode=1 miimon=100 use_carrier=0 updelay=0 downdelay=0"

GFS2

Try adding noatime to the /etc/fstab options. h/t to Dak1n1; "it avoids cluster reads from turning into unnecessary writes, and can improve performance".

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.