2x5 Scalable Cluster Tutorial
Alteeve Wiki :: How To :: 2x5 Scalable Cluster Tutorial |
Warning: This document is old, abandoned and very out of date. DON'T USE ANYTHING HERE! Consider it only as historical note taking. |
The Design
Storage
Storage, high-level:
[ Storage Cluster ]
_____________________________ _____________________________
| [ an-node01 ] | | [ an-node02 ] |
| _____ _____ | | _____ _____ |
| ( HDD ) ( SSD ) | | ( SSD ) ( HDD ) |
| (_____) (_____) __________| |__________ (_____) (_____) |
| | | | Storage =--\ /--= Storage | | | |
| | \----| Network || | | || Network |----/ | |
| \-------------|_________|| | | ||_________|-------------/ |
|_____________________________| | | |_____________________________|
__|_____|__
| HDD LUN |
| SDD LUN |
|___________|
|
_____|_____
| Floating |
| SAN IP |
[ VM Cluster ] |___________|
______________________________ | | | | | ______________________________
| [ an-node03 ] | | | | | | | [ an-node06 ] |
| _________ | | | | | | | _________ |
| | [ vmA ] | | | | | | | | | [ vmJ ] | |
| | _____ | | | | | | | | | _____ | |
| | (_hdd_)-=----\ | | | | | | | /----=-(_hdd_) | |
| |_________| | | | | | | | | | |_________| |
| _________ | | | | | | | | | _________ |
| | [ vmB ] | | | | | | | | | | | [ vmK ] | |
| | _____ | | | | | | | | | | | _____ | |
| | (_hdd_)-=--\ | __________| | | | | | |__________ | /--=-(_hdd_) | |
| |_________| | \--| Storage =--/ | | | \--= Storage |--/ | |_________| |
| _________ \----| Network || | | | || Network |----/ _________ |
| | [ vmC ] | /----|_________|| | | | ||_________|----\ | [ vmL ] | |
| | _____ | | | | | | | | | _____ | |
| | (_hdd_)-=--/ | | | | | \--=-(_hdd_) | |
| |_________| | | | | | |_________| |
|______________________________| | | | |______________________________|
______________________________ | | | ______________________________
| [ an-node04 ] | | | | | [ an-node07 ] |
| _________ | | | | | _________ |
| | [ vmD ] | | | | | | | [ vmM ] | |
| | _____ | | | | | | | _____ | |
| | (_hdd_)-=----\ | | | | | /----=-(_hdd_) | |
| |_________| | | | | | | | |_________| |
| _________ | | | | | | | _________ |
| | [ vmE ] | | | | | | | | | [ vmN ] | |
| | _____ | | | | | | | | | _____ | |
| | (_hdd_)-=--\ | __________| | | | |__________ | /--=-(_hdd_) | |
| |_________| | \--| Storage =----/ | \----= Storage |--/ | |_________| |
| _________ \----| Network || | || Network |----/ _________ |
| | [ vmF ] | /----|_________|| | ||_________|----\ | [ vmO ] | |
| | _____ | | | | | | | _____ | |
| | (_hdd_)-=--+ | | | \--=-(_hdd_) | |
| | (_ssd_)-=--/ | | | |_________| |
| |_________| | | | |
|______________________________| | |______________________________|
______________________________ |
| [ an-node05 ] | |
| _________ | |
| | [ vmG ] | | |
| | _____ | | |
| | (_hdd_)-=----\ | |
| |_________| | | |
| _________ | | |
| | [ vmH ] | | | |
| | _____ | | | |
| | (_hdd_)-=--\ | | |
| | (_sdd_)-=--+ | __________| |
| |_________| | \--| Storage =------/
| _________ \----| Network ||
| | [ vmI ] | /----|_________||
| | _____ | | |
| | (_hdd_)-=--/ |
| |_________| |
|______________________________|
Long View
Note: Yes, this is a big graphic, but this is also a big project. I am no artist though, and any help making this clearer is greatly appreciated! |
Failure Mapping
VM Cluster; Guest VM failure migration planning;
- Each node can host 5 VMs @ 2GB/VM.
- This is an N-1 cluster with five nodes; 20 VMs total.
| All | an-node03 | an-node04 | an-node05 | an-node06 | an-node07 |
| on-line | down | down | down | down | down |
----------+-----------+-----------+-----------+-----------+-----------+-----------+
an-node03 | vm01 | -- | vm01 | vm01 | vm01 | vm01 |
| vm02 | -- | vm02 | vm02 | vm02 | vm02 |
| vm03 | -- | vm03 | vm03 | vm03 | vm03 |
| vm04 | -- | vm04 | vm04 | vm04 | vm04 |
| -- | -- | vm05 | vm09 | vm13 | vm17 |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node04 | vm05 | vm05 | -- | vm05 | vm05 | vm05 |
| vm06 | vm06 | -- | vm06 | vm06 | vm06 |
| vm07 | vm07 | -- | vm07 | vm07 | vm07 |
| vm08 | vm08 | -- | vm08 | vm08 | vm08 |
| -- | vm01 | -- | vm10 | vm14 | vm18 |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node05 | vm09 | vm09 | vm09 | -- | vm09 | vm09 |
| vm10 | vm10 | vm10 | -- | vm10 | vm10 |
| vm11 | vm11 | vm11 | -- | vm11 | vm11 |
| vm12 | vm12 | vm12 | -- | vm12 | vm12 |
| -- | vm02 | vm06 | -- | vm15 | vm19 |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node06 | vm13 | vm13 | vm13 | vm13 | -- | vm13 |
| vm14 | vm14 | vm14 | vm14 | -- | vm14 |
| vm15 | vm15 | vm15 | vm15 | -- | vm15 |
| vm16 | vm16 | vm16 | vm16 | -- | vm16 |
| -- | vm03 | vm07 | vm11 | -- | vm20 |
----------|-----------|-----------|-----------|-----------|-----------|-----------|
an-node07 | vm17 | vm17 | vm17 | vm17 | vm17 | -- |
| vm18 | vm18 | vm18 | vm18 | vm18 | -- |
| vm19 | vm19 | vm19 | vm19 | vm19 | -- |
| vm20 | vm20 | vm20 | vm20 | vm20 | -- |
| -- | vm04 | vm08 | vm12 | vm16 | -- |
----------+-----------+-----------+-----------+-----------+-----------+-----------+
Cluster Overview
Note: This is not programatically accurate! |
This is meant to show, at a logical level, how the parts of a cluster work together. It is the first draft and is likely defective in terrible ways.
[ Resource Managment ]
___________ ___________
| | | |
| Service A | | Service B |
|___________| |___________|
| | |
__|_____|__ ___|_______________
| | | |
| RGManager | | Clustered Storage |================================================.
|___________| |___________________| |
| | |
|__________________|______________ |
| \ |
_________ ____|____ | |
| | | | | |
/------| Fencing |----| Locking | | |
| |_________| |_________| | |
_|___________|_____________|______________________|__________________________________________|_____
| | | | |
| ______|_____ ____|___ | |
| | | | | | |
| | Membership | | Quorum | | |
| |____________| |________| | |
| |____________| | |
| __|__ | |
| / \ | |
| { Totem } | |
| \_____/ | |
| __________________|_______________________|_______________ ______________ |
| |-----------|-----------|----------------|-----------------|--------------| |
| ___|____ ___|____ ___|____ ___|____ _____|_____ _____|_____ __|___
| | | | | | | | | | | | | | |
| | Node 1 | | Node 2 | | Node 3 | ... | Node N | | Storage 1 |==| Storage 2 |==| DRBD |
| |________| |________| |________| |________| |___________| |___________| |______|
\_____|___________|___________|________________|_________________|______________|
[ Cluster Communication ]
Network IPs
SAN: 10.10.1.1
Node:
| IFN | SN | BCN | IPMI |
----------+-------------+------------+-----------+-----------+
an-node01 | 10.255.0.1 | 10.10.0.1 | 10.20.0.1 | 10.20.1.1 |
an-node02 | 10.255.0.2 | 10.10.0.2 | 10.20.0.2 | 10.20.1.2 |
an-node03 | 10.255.0.3 | 10.10.0.3 | 10.20.0.3 | 10.20.1.3 |
an-node04 | 10.255.0.4 | 10.10.0.4 | 10.20.0.4 | 10.20.1.4 |
an-node05 | 10.255.0.5 | 10.10.0.5 | 10.20.0.5 | 10.20.1.5 |
an-node06 | 10.255.0.6 | 10.10.0.6 | 10.20.0.6 | 10.20.1.6 |
an-node07 | 10.255.0.7 | 10.10.0.7 | 10.20.0.7 | 10.20.1.7 |
----------+-------------+------------+-----------+-----------+
Aux Equipment:
| BCN |
----------+-------------+
pdu1 | 10.20.2.1 |
pdu2 | 10.20.2.2 |
switch1 | 10.20.2.3 |
switch2 | 10.20.2.4 |
ups1 | 10.20.2.5 |
ups2 | 10.20.2.6 |
----------+-------------+
VMs:
| VMN |
----------+-------------+
vm01 | 10.254.0.1 |
vm02 | 10.254.0.2 |
vm03 | 10.254.0.3 |
vm04 | 10.254.0.4 |
vm05 | 10.254.0.5 |
vm06 | 10.254.0.6 |
vm07 | 10.254.0.7 |
vm08 | 10.254.0.8 |
vm09 | 10.254.0.9 |
vm10 | 10.254.0.10 |
vm11 | 10.254.0.11 |
vm12 | 10.254.0.12 |
vm13 | 10.254.0.13 |
vm14 | 10.254.0.14 |
vm15 | 10.254.0.15 |
vm16 | 10.254.0.16 |
vm17 | 10.254.0.17 |
vm18 | 10.254.0.18 |
vm19 | 10.254.0.19 |
vm20 | 10.254.0.20 |
----------+-------------+
Install The Cluster Software
If you are using Red Hat Enterprise Linux, you will need to add the RHEL Server Optional (v. 6 64-bit x86_64) channel for each node in your cluster. You can do this in RHN by going the your subscription management page, clicking on each server, clicking on "Alter Channel Subscriptions", click to enable the RHEL Server Optional (v. 6 64-bit x86_64) channel and then by clicking on "Change Subscription".
This actual installation is simple, just use yum to install cman.
yum install cman fence-agents rgmanager resource-agents lvm2-cluster gfs2-utils python-virtinst libvirt qemu-kvm-tools qemu-kvm virt-manager virt-viewer virtio-win
Initial Config
Everything uses ricci, which itself needs to have a password set. I set this to match root.
Both:
passwd ricci
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
With these decisions and the information gathered, here is what our first /etc/cluster/cluster.conf file will look like.
touch /etc/cluster/cluster.conf
vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="1" name="an-cluster">
<cman expected_votes="1" two_node="1" />
<clusternodes>
<clusternode name="an-node01.alteeve.ca" nodeid="1">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an01" />
</method>
<method name="pdu">
<device action="reboot" name="pdu1" port="1" />
<device action="reboot" name="pdu2" port="1" />
</method>
</fence>
</clusternode>
<clusternode name="an-node02.alteeve.ca" nodeid="2">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an02" />
</method>
<method name="pdu">
<device action="reboot" name="pdu1" port="2" />
<device action="reboot" name="pdu2" port="2" />
</method>
</fence>
</clusternode>
<clusternode name="an-node03.alteeve.ca" nodeid="3">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an03" />
</method>
<method name="pdu">
<device action="reboot" name="pdu1" port="3" />
<device action="reboot" name="pdu2" port="3" />
</method>
</fence>
</clusternode>
<clusternode name="an-node04.alteeve.ca" nodeid="4">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an04" />
</method>
<method name="pdu">
<device action="reboot" name="pdu1" port="4" />
<device action="reboot" name="pdu2" port="4" />
</method>
</fence>
</clusternode>
<clusternode name="an-node05.alteeve.ca" nodeid="5">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an05" />
</method>
<method name="pdu">
<device action="reboot" name="pdu1" port="5" />
<device action="reboot" name="pdu2" port="5" />
</method>
</fence>
</clusternode>
<clusternode name="an-node06.alteeve.ca" nodeid="6">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an06" />
</method>
<method name="pdu">
<device action="reboot" name="pdu1" port="6" />
<device action="reboot" name="pdu2" port="6" />
</method>
</fence>
</clusternode>
<clusternode name="an-node07.alteeve.ca" nodeid="7">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an07" />
</method>
<method name="pdu">
<device action="reboot" name="pdu1" port="7" />
<device action="reboot" name="pdu2" port="7" />
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_ipmilan" ipaddr="an-node01.ipmi" login="root" name="ipmi_an01" passwd="secret" />
<fencedevice agent="fence_ipmilan" ipaddr="an-node02.ipmi" login="root" name="ipmi_an02" passwd="secret" />
<fencedevice agent="fence_ipmilan" ipaddr="an-node03.ipmi" login="root" name="ipmi_an03" passwd="secret" />
<fencedevice agent="fence_ipmilan" ipaddr="an-node04.ipmi" login="root" name="ipmi_an04" passwd="secret" />
<fencedevice agent="fence_ipmilan" ipaddr="an-node05.ipmi" login="root" name="ipmi_an05" passwd="secret" />
<fencedevice agent="fence_ipmilan" ipaddr="an-node06.ipmi" login="root" name="ipmi_an06" passwd="secret" />
<fencedevice agent="fence_ipmilan" ipaddr="an-node07.ipmi" login="root" name="ipmi_an07" passwd="secret" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu1.alteeve.ca" name="pdu1" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu2.alteeve.ca" name="pdu2" />
</fencedevices>
<fence_daemon post_join_delay="30" />
<totem rrp_mode="none" secauth="off" />
<rm>
<resources>
<ip address="10.10.1.1" monitor_link="on" />
<script file="/etc/init.d/tgtd" name="tgtd" />
<script file="/etc/init.d/drbd" name="drbd" />
<script file="/etc/init.d/clvmd" name="clvmd" />
<script file="/etc/init.d/gfs2" name="gfs2" />
<script file="/etc/init.d/libvirtd" name="libvirtd" />
</resources>
<failoverdomains>
<!-- Used for storage -->
<!-- SAN Nodes -->
<failoverdomain name="an1_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node01.alteeve.ca" />
</failoverdomain>
<failoverdomain name="an2_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node02.alteeve.ca" />
</failoverdomain>
<!-- VM Nodes -->
<failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" />
</failoverdomain>
<failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca" />
</failoverdomain>
<failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" />
</failoverdomain>
<failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca" />
</failoverdomain>
<failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca" />
</failoverdomain>
<!-- Domain for the SAN -->
<failoverdomain name="an1_primary" nofailback="1" ordered="1" restricted="0">
<failoverdomainnode name="an-node01.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node02.alteeve.ca" priority="2" />
</failoverdomain>
<!-- Domains for VMs running primarily on an-node03 -->
<failoverdomain name="an3_an4" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an3_an5" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an3_an6" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an3_an7" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
</failoverdomain>
<!-- Domains for VMs running primarily on an-node04 -->
<failoverdomain name="an4_an3" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an4_an5" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an4_an6" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an4_an7" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
</failoverdomain>
<!-- Domains for VMs running primarily on an-node05 -->
<failoverdomain name="an5_an3" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an5_an4" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an5_an6" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an5_an7" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
</failoverdomain>
<!-- Domains for VMs running primarily on an-node06 -->
<failoverdomain name="an6_an3" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an6_an4" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an6_an5" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an6_an7" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node07.alteeve.ca" priority="2" />
</failoverdomain>
<!-- Domains for VMs running primarily on an-node07 -->
<failoverdomain name="an7_an3" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node03.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an7_an4" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node04.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an7_an5" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node05.alteeve.ca" priority="2" />
</failoverdomain>
<failoverdomain name="an7_an6" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca" priority="1" />
<failoverdomainnode name="an-node06.alteeve.ca" priority="2" />
</failoverdomain>
</failoverdomains>
<!-- SAN Services -->
<service autostart="1" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="tgtd" />
</script>
</script>
</service>
<service autostart="1" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="tgtd" />
</script>
</script>
</service>
<service autostart="1" domain="an1_primary" name="san_ip" recovery="relocate">
<ip ref="10.10.1.1" />
</service>
<!-- VM Storage services. -->
<service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd" />
</script>
</script>
</script>
</service>
<service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd" />
</script>
</script>
</script>
</service>
<service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd" />
</script>
</script>
</script>
</service>
<service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd" />
</script>
</script>
</script>
</service>
<service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd" />
</script>
</script>
</script>
</service>
<!-- VM Services -->
<!-- VMs running primarily on an-node03 -->
<vm name="vm01" domain="an03_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm02" domain="an03_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm03" domain="an03_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm04" domain="an03_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<!-- VMs running primarily on an-node04 -->
<vm name="vm05" domain="an04_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm06" domain="an04_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm07" domain="an04_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm08" domain="an04_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<!-- VMs running primarily on an-node05 -->
<vm name="vm09" domain="an05_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm10" domain="an05_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm11" domain="an05_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm12" domain="an05_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<!-- VMs running primarily on an-node06 -->
<vm name="vm13" domain="an06_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm14" domain="an06_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm15" domain="an06_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm16" domain="an06_an07" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<!-- VMs running primarily on an-node07 -->
<vm name="vm17" domain="an07_an03" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm18" domain="an07_an04" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm19" domain="an07_an05" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm20" domain="an07_an06" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
</rm>
</cluster>
Save the file, then validate it. If it fails, address the errors and try again.
ip addr list | grep <ip>
rg_test test /etc/cluster/cluster.conf
ccs_config_validate
Configuration validates
Push it to the other node:
rsync -av /etc/cluster/cluster.conf root@an-node02:/etc/cluster/
sending incremental file list
cluster.conf
sent 781 bytes received 31 bytes 541.33 bytes/sec
total size is 701 speedup is 0.86
Start:
DO NOT PROCEED UNTIL YOUR cluster.conf FILE VALIDATES!
Unless you have it perfect, your cluster will fail.
Once it validates, proceed.
Starting The Cluster For The First Time
By default, if you start one node only and you've enabled the <cman two_node="1" expected_votes="1"/> option as we have done, the lone server will effectively gain quorum. It will try to connect to the cluster, but there won't be a cluster to connect to, so it will fence the other node after a timeout period. This timeout is 6 seconds by default.
For now, we will leave the default as it is. If you're interested in changing it though, the argument you are looking for is post_join_delay.
This behaviour means that we'll want to start both nodes well within six seconds of one another, least the slower one get needlessly fenced.
Left off here
Note to help minimize dual-fences:
- you could add FENCED_OPTS="-f 5" to /etc/sysconfig/cman on *one* node (ilo fence devices may need this)
DRBD Config
Install from source:
Both:
# Obliterate peer - fence via cman
wget -c https://alteeve.ca/files/an-cluster/sbin/obliterate-peer.sh -O /sbin/obliterate-peer.sh
chmod a+x /sbin/obliterate-peer.sh
ls -lah /sbin/obliterate-peer.sh
# Download, compile and install DRBD
wget -c http://oss.linbit.com/drbd/8.3/drbd-8.3.11.tar.gz
tar -xvzf drbd-8.3.11.tar.gz
cd drbd-8.3.11
./configure \
--prefix=/usr \
--localstatedir=/var \
--sysconfdir=/etc \
--with-utils \
--with-km \
--with-udev \
--with-pacemaker \
--with-rgmanager \
--with-bashcompletion
make
make install
chkconfig --add drbd
chkconfig drbd off
Configure
an-node01:
# Configure DRBD's global value.
cp /etc/drbd.d/global_common.conf /etc/drbd.d/global_common.conf.orig
vim /etc/drbd.d/global_common.conf
diff -u /etc/drbd.d/global_common.conf
--- /etc/drbd.d/global_common.conf.orig 2011-08-01 21:58:46.000000000 -0400
+++ /etc/drbd.d/global_common.conf 2011-08-01 23:18:27.000000000 -0400
@@ -15,24 +15,35 @@
# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
+ fence-peer "/sbin/obliterate-peer.sh";
}
startup {
# wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
+ become-primary-on both;
+ wfc-timeout 300;
+ degr-wfc-timeout 120;
}
disk {
# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
# no-disk-drain no-md-flushes max-bio-bvecs
+ fencing resource-and-stonith;
}
net {
# sndbuf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
+ allow-two-primaries;
+ after-sb-0pri discard-zero-changes;
+ after-sb-1pri discard-secondary;
+ after-sb-2pri disconnect;
}
syncer {
# rate after al-extents use-rle cpu-mask verify-alg csums-alg
+ # This should be no more than 30% of the maximum sustainable write speed.
+ rate 20M;
}
}
vim /etc/drbd.d/r0.res
resource r0 {
device /dev/drbd0;
meta-disk internal;
on an-node01.alteeve.ca {
address 192.168.2.71:7789;
disk /dev/sda5;
}
on an-node02.alteeve.ca {
address 192.168.2.72:7789;
disk /dev/sda5;
}
}
cp /etc/drbd.d/r0.res /etc/drbd.d/r1.res
vim /etc/drbd.d/r1.res
resource r1 {
device /dev/drbd1;
meta-disk internal;
on an-node01.alteeve.ca {
address 192.168.2.71:7790;
disk /dev/sdb1;
}
on an-node02.alteeve.ca {
address 192.168.2.72:7790;
disk /dev/sdb1;
}
}
Note: If you have multiple DRBD resources on on (set of) backing disks, consider adding syncer { after <minor-1>; }. For example, tell /dev/drbd1 to wait for /dev/drbd0 by adding syncer { after 0; }. This will prevent simultaneous resync's which could seriously impact performance. Resources will wait in state until the defined resource has completed sync'ing. |
Validate:
drbdadm dump
--== Thank you for participating in the global usage survey ==--
The server's response is:
you are the 369th user to install this version
# /usr/etc/drbd.conf
common {
protocol C;
net {
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
disk {
fencing resource-and-stonith;
}
syncer {
rate 20M;
}
startup {
wfc-timeout 300;
degr-wfc-timeout 120;
become-primary-on both;
}
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
fence-peer /sbin/obliterate-peer.sh;
}
}
# resource r0 on an-node01.alteeve.ca: not ignored, not stacked
resource r0 {
on an-node01.alteeve.ca {
device /dev/drbd0 minor 0;
disk /dev/sda5;
address ipv4 192.168.2.71:7789;
meta-disk internal;
}
on an-node02.alteeve.ca {
device /dev/drbd0 minor 0;
disk /dev/sda5;
address ipv4 192.168.2.72:7789;
meta-disk internal;
}
}
# resource r1 on an-node01.alteeve.ca: not ignored, not stacked
resource r1 {
on an-node01.alteeve.ca {
device /dev/drbd1 minor 1;
disk /dev/sdb1;
address ipv4 192.168.2.71:7790;
meta-disk internal;
}
on an-node02.alteeve.ca {
device /dev/drbd1 minor 1;
disk /dev/sdb1;
address ipv4 192.168.2.72:7790;
meta-disk internal;
}
}
rsync -av /etc/drbd.d root@an-node02:/etc/
drbd.d/
drbd.d/global_common.conf
drbd.d/global_common.conf.orig
drbd.d/r0.res
drbd.d/r1.res
sent 3523 bytes received 110 bytes 7266.00 bytes/sec
total size is 3926 speedup is 1.08
Initialize and First start
Both:
Create the meta-data.
modprobe
drbdadm create-md r{0,1}
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
Attach, connect and confirm (after both have attached and connected):
drbdadm attach r{0,1}
drbdadm connect r{0,1}
cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:441969960
1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:29309628
There is no data, so force both devices to be instantly UpToDate:
drbdadm -- --clear-bitmap new-current-uuid r{0,1}
cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
Set both to primary and run a final check.
drbdadm primary r{0,1}
cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:672 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:672 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
Update the cluster
vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="17" name="an-clusterA">
<cman expected_votes="1" two_node="1"/>
<totem rrp_mode="none" secauth="off"/>
<clusternodes>
<clusternode name="an-node01.alteeve.ca" nodeid="1">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="1"/>
</method>
</fence>
</clusternode>
<clusternode name="an-node02.alteeve.ca" nodeid="2">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
</fencedevices>
<fence_daemon post_join_delay="30"/>
<rm>
<resources>
<ip address="192.168.2.100" monitor_link="on"/>
<script file="/etc/init.d/drbd" name="drbd"/>
<script file="/etc/init.d/clvmd" name="clvmd"/>
<script file="/etc/init.d/tgtd" name="tgtd"/>
</resources>
<failoverdomains>
<failoverdomain name="an1_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node01.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an2_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node02.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an1_primary" nofailback="1" ordered="1" restricted="0">
<failoverdomainnode name="an-node01.alteeve.ca" priority="1"/>
<failoverdomainnode name="an-node02.alteeve.ca" priority="2"/>
</failoverdomain>
</failoverdomains>
<service autostart="0" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd"/>
</script>
</service>
<service autostart="0" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd"/>
</script>
</service>
</rm>
</cluster>
rg_test test /etc/cluster/cluster.conf
Running in test mode.
Loading resource rule from /usr/share/cluster/oralistener.sh
Loading resource rule from /usr/share/cluster/apache.sh
Loading resource rule from /usr/share/cluster/SAPDatabase
Loading resource rule from /usr/share/cluster/postgres-8.sh
Loading resource rule from /usr/share/cluster/lvm.sh
Loading resource rule from /usr/share/cluster/mysql.sh
Loading resource rule from /usr/share/cluster/lvm_by_vg.sh
Loading resource rule from /usr/share/cluster/service.sh
Loading resource rule from /usr/share/cluster/samba.sh
Loading resource rule from /usr/share/cluster/SAPInstance
Loading resource rule from /usr/share/cluster/checkquorum
Loading resource rule from /usr/share/cluster/ocf-shellfuncs
Loading resource rule from /usr/share/cluster/svclib_nfslock
Loading resource rule from /usr/share/cluster/script.sh
Loading resource rule from /usr/share/cluster/clusterfs.sh
Loading resource rule from /usr/share/cluster/fs.sh
Loading resource rule from /usr/share/cluster/oracledb.sh
Loading resource rule from /usr/share/cluster/nfsserver.sh
Loading resource rule from /usr/share/cluster/netfs.sh
Loading resource rule from /usr/share/cluster/orainstance.sh
Loading resource rule from /usr/share/cluster/vm.sh
Loading resource rule from /usr/share/cluster/lvm_by_lv.sh
Loading resource rule from /usr/share/cluster/tomcat-6.sh
Loading resource rule from /usr/share/cluster/ip.sh
Loading resource rule from /usr/share/cluster/nfsexport.sh
Loading resource rule from /usr/share/cluster/openldap.sh
Loading resource rule from /usr/share/cluster/ASEHAagent.sh
Loading resource rule from /usr/share/cluster/nfsclient.sh
Loading resource rule from /usr/share/cluster/named.sh
Loaded 24 resource rules
=== Resources List ===
Resource type: ip
Instances: 1/1
Agent: ip.sh
Attributes:
address = 192.168.2.100 [ primary unique ]
monitor_link = on
nfslock [ inherit("service%nfslock") ]
Resource type: script
Agent: script.sh
Attributes:
name = drbd [ primary unique ]
file = /etc/init.d/drbd [ unique required ]
service_name [ inherit("service%name") ]
Resource type: script
Agent: script.sh
Attributes:
name = clvmd [ primary unique ]
file = /etc/init.d/clvmd [ unique required ]
service_name [ inherit("service%name") ]
Resource type: script
Agent: script.sh
Attributes:
name = tgtd [ primary unique ]
file = /etc/init.d/tgtd [ unique required ]
service_name [ inherit("service%name") ]
Resource type: service [INLINE]
Instances: 1/1
Agent: service.sh
Attributes:
name = an1_storage [ primary unique required ]
domain = an1_only [ reconfig ]
autostart = 0 [ reconfig ]
exclusive = 0 [ reconfig ]
nfslock = 0
nfs_client_cache = 0
recovery = restart [ reconfig ]
depend_mode = hard
max_restarts = 0
restart_expire_time = 0
priority = 0
Resource type: service [INLINE]
Instances: 1/1
Agent: service.sh
Attributes:
name = an2_storage [ primary unique required ]
domain = an2_only [ reconfig ]
autostart = 0 [ reconfig ]
exclusive = 0 [ reconfig ]
nfslock = 0
nfs_client_cache = 0
recovery = restart [ reconfig ]
depend_mode = hard
max_restarts = 0
restart_expire_time = 0
priority = 0
Resource type: service [INLINE]
Instances: 1/1
Agent: service.sh
Attributes:
name = san_ip [ primary unique required ]
domain = an1_primary [ reconfig ]
autostart = 0 [ reconfig ]
exclusive = 0 [ reconfig ]
nfslock = 0
nfs_client_cache = 0
recovery = relocate [ reconfig ]
depend_mode = hard
max_restarts = 0
restart_expire_time = 0
priority = 0
=== Resource Tree ===
service (S0) {
name = "an1_storage";
domain = "an1_only";
autostart = "0";
exclusive = "0";
nfslock = "0";
nfs_client_cache = "0";
recovery = "restart";
depend_mode = "hard";
max_restarts = "0";
restart_expire_time = "0";
priority = "0";
script (S0) {
name = "drbd";
file = "/etc/init.d/drbd";
service_name = "an1_storage";
script (S0) {
name = "clvmd";
file = "/etc/init.d/clvmd";
service_name = "an1_storage";
}
}
}
service (S0) {
name = "an2_storage";
domain = "an2_only";
autostart = "0";
exclusive = "0";
nfslock = "0";
nfs_client_cache = "0";
recovery = "restart";
depend_mode = "hard";
max_restarts = "0";
restart_expire_time = "0";
priority = "0";
script (S0) {
name = "drbd";
file = "/etc/init.d/drbd";
service_name = "an2_storage";
script (S0) {
name = "clvmd";
file = "/etc/init.d/clvmd";
service_name = "an2_storage";
}
}
}
service (S0) {
name = "san_ip";
domain = "an1_primary";
autostart = "0";
exclusive = "0";
nfslock = "0";
nfs_client_cache = "0";
recovery = "relocate";
depend_mode = "hard";
max_restarts = "0";
restart_expire_time = "0";
priority = "0";
ip (S0) {
address = "192.168.2.100";
monitor_link = "on";
nfslock = "0";
}
}
=== Failover Domains ===
Failover domain: an1_only
Flags: Restricted No Failback
Node an-node01.alteeve.ca (id 1, priority 0)
Failover domain: an2_only
Flags: Restricted No Failback
Node an-node02.alteeve.ca (id 2, priority 0)
Failover domain: an1_primary
Flags: Ordered No Failback
Node an-node01.alteeve.ca (id 1, priority 1)
Node an-node02.alteeve.ca (id 2, priority 2)
=== Event Triggers ===
Event Priority Level 100:
Name: Default
(Any event)
File: /usr/share/cluster/default_event_script.sl
[root@an-node01 ~]# cman_tool version -r
You have not authenticated to the ricci daemon on an-node01.alteeve.ca
Password:
[root@an-node01 ~]# clusvcadm -e service:an1_storage
Local machine trying to enable service:an1_storage...Success
service:an1_storage is now running on an-node01.alteeve.ca
[root@an-node01 ~]# cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:924 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:916 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
cman_tool version -r
You have not authenticated to the ricci daemon on an-node01.alteeve.ca
Password:
an-node01:
clusvcadm -e service:an1_storage
service:an1_storage is now running on an-node01.alteeve.ca
an-node02:
clusvcadm -e service:an2_storage
service:an2_storage is now running on an-node02.alteeve.ca
Either
cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by root@an-node01.alteeve.ca, 2011-08-01 22:04:32
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:924 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:916 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
Configure Clustered LVM
an-node01:
cp /etc/lvm/lvm.conf /etc/lvm/lvm.conf.orig
vim /etc/lvm/lvm.conf
diff -u /etc/lvm/lvm.conf.orig /etc/lvm/lvm.conf
--- /etc/lvm/lvm.conf.orig 2011-08-02 21:59:01.000000000 -0400
+++ /etc/lvm/lvm.conf 2011-08-02 22:00:17.000000000 -0400
@@ -50,7 +50,8 @@
# By default we accept every block device:
- filter = [ "a/.*/" ]
+ #filter = [ "a/.*/" ]
+ filter = [ "a|/dev/drbd*|", "r/.*/" ]
# Exclude the cdrom drive
# filter = [ "r|/dev/cdrom|" ]
@@ -308,7 +309,8 @@
# Type 3 uses built-in clustered locking.
# Type 4 uses read-only locking which forbids any operations that might
# change metadata.
- locking_type = 1
+ #locking_type = 1
+ locking_type = 3
# Set to 0 to fail when a lock request cannot be satisfied immediately.
wait_for_locks = 1
@@ -324,7 +326,8 @@
# to 1 an attempt will be made to use local file-based locking (type 1).
# If this succeeds, only commands against local volume groups will proceed.
# Volume Groups marked as clustered will be ignored.
- fallback_to_local_locking = 1
+ #fallback_to_local_locking = 1
+ fallback_to_local_locking = 0
# Local non-LV directory that holds file-based locks while commands are
# in progress. A directory like /tmp that may get wiped on reboot is OK.
rsync -av /etc/lvm/lvm.conf root@an-node02:/etc/lvm/
sending incremental file list
lvm.conf
sent 2412 bytes received 247 bytes 5318.00 bytes/sec
total size is 24668 speedup is 9.28
Create the LVM PVs, VGs and LVs.
an-node01:
pvcreate /dev/drbd{0,1}
Physical volume "/dev/drbd0" successfully created
Physical volume "/dev/drbd1" successfully created
an-node02:
pvscan
PV /dev/drbd0 lvm2 [421.50 GiB]
PV /dev/drbd1 lvm2 [27.95 GiB]
Total: 2 [449.45 GiB] / in use: 0 [0 ] / in no VG: 2 [449.45 GiB]
an-node01:
vgcreate -c y hdd_vg0 /dev/drbd0 && vgcreate -c y sdd_vg0 /dev/drbd1
Clustered volume group "hdd_vg0" successfully created
Clustered volume group "ssd_vg0" successfully created
an-node02:
vgscan
Reading all physical volumes. This may take a while...
Found volume group "ssd_vg0" using metadata type lvm2
Found volume group "hdd_vg0" using metadata type lvm2
an-node01:
lvcreate -l 100%FREE -n lun0 /dev/hdd_vg0 && lvcreate -l 100%FREE -n lun1 /dev/ssd_vg0
Logical volume "lun0" created
Logical volume "lun1" created
an-node02:
lvscan
ACTIVE '/dev/ssd_vg0/lun1' [27.95 GiB] inherit
ACTIVE '/dev/hdd_vg0/lun0' [421.49 GiB] inherit
iSCSI notes
IET vs tgt pros and cons needed.
default iscsi port is 3260
initiator: This is the client. target: This is the server side. sid: Session ID; Found with iscsiadm -m session -P 1. SID and sysfs path are not persistent, partially start-order based. iQN: iSCSI Qualified Name; This is a string that uniquely identifies targets and initiators.
Both:
yum install iscsi-initiator-utils scsi-target-utils
an-node01:
cp /etc/tgt/targets.conf /etc/tgt/targets.conf.orig
vim /etc/tgt/targets.conf
diff -u /etc/tgt/targets.conf.orig /etc/tgt/targets.conf
--- /etc/tgt/targets.conf.orig 2011-07-31 12:38:35.000000000 -0400
+++ /etc/tgt/targets.conf 2011-08-02 22:19:06.000000000 -0400
@@ -251,3 +251,9 @@
# vendor_id VENDOR1
# </direct-store>
#</target>
+
+<target iqn.2011-08.com.alteeve:an-clusterA.target01>
+ direct-store /dev/drbd0
+ direct-store /dev/drbd1
+ vendor_id Alteeve
rsync -av /etc/tgt/targets.conf root@an-node02:/etc/tgt/
sending incremental file list
targets.conf
sent 909 bytes received 97 bytes 670.67 bytes/sec
total size is 7093 speedup is 7.05
Update the cluster
<service autostart="0" domain="an1_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="tgtd"/>
</script>
</script>
</service>
<service autostart="0" domain="an2_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart" restart_expire_time="0">
<script ref="drbd">
<script ref="clvmd">
<script ref="tgtd"/>
</script>
</script>
</service>
Connect to the SAN from a VM node
an-node03+:
iscsiadm -m discovery -t sendtargets -p 192.168.2.100
192.168.2.100:3260,1 iqn.2011-08.com.alteeve:an-clusterA.target01
iscsiadm --mode node --portal 192.168.2.100 --target iqn.2011-08.com.alteeve:an-clusterA.target01 --login
Logging in to [iface: default, target: iqn.2011-08.com.alteeve:an-clusterA.target01, portal: 192.168.2.100,3260]
Login to [iface: default, target: iqn.2011-08.com.alteeve:an-clusterA.target01, portal: 192.168.2.100,3260] successful.
fdisk -l
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00062f4a
Device Boot Start End Blocks Id System
/dev/sda1 * 1 33 262144 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2 33 5255 41943040 83 Linux
/dev/sda3 5255 5777 4194304 82 Linux swap / Solaris
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/sdb doesn't contain a valid partition table
Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/sdc doesn't contain a valid partition table
Setup the VM Cluster
Install RPMs.
yum -y install lvm2-cluster cman fence-agents
Configure lvm.conf.
cp /etc/lvm/lvm.conf /etc/lvm/lvm.conf.orig
vim /etc/lvm/lvm.conf
diff -u /etc/lvm/lvm.conf.orig /etc/lvm/lvm.conf
--- /etc/lvm/lvm.conf.orig 2011-08-02 21:59:01.000000000 -0400
+++ /etc/lvm/lvm.conf 2011-08-03 00:35:45.000000000 -0400
@@ -308,7 +308,8 @@
# Type 3 uses built-in clustered locking.
# Type 4 uses read-only locking which forbids any operations that might
# change metadata.
- locking_type = 1
+ #locking_type = 1
+ locking_type = 3
# Set to 0 to fail when a lock request cannot be satisfied immediately.
wait_for_locks = 1
@@ -324,7 +325,8 @@
# to 1 an attempt will be made to use local file-based locking (type 1).
# If this succeeds, only commands against local volume groups will proceed.
# Volume Groups marked as clustered will be ignored.
- fallback_to_local_locking = 1
+ #fallback_to_local_locking = 1
+ fallback_to_local_locking = 0
# Local non-LV directory that holds file-based locks while commands are
# in progress. A directory like /tmp that may get wiped on reboot is OK.
rsync -av /etc/lvm/lvm.conf root@an-node04:/etc/lvm/
sending incremental file list
lvm.conf
sent 873 bytes received 247 bytes 2240.00 bytes/sec
total size is 24625 speedup is 21.99
rsync -av /etc/lvm/lvm.conf root@an-node05:/etc/lvm/
sending incremental file list
lvm.conf
sent 873 bytes received 247 bytes 2240.00 bytes/sec
total size is 24625 speedup is 21.99
Config the cluster.
vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="5" name="an-clusterB">
<totem rrp_mode="none" secauth="off"/>
<clusternodes>
<clusternode name="an-node03.alteeve.ca" nodeid="1">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="3"/>
</method>
</fence>
</clusternode>
<clusternode name="an-node04.alteeve.ca" nodeid="2">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="4"/>
</method>
</fence>
</clusternode>
<clusternode name="an-node05.alteeve.ca" nodeid="3">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="5"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
</fencedevices>
<fence_daemon post_join_delay="30"/>
<rm>
<resources>
<script file="/etc/init.d/iscsi" name="iscsi" />
<script file="/etc/init.d/clvmd" name="clvmd" />
</resources>
<failoverdomains>
<failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" />
</failoverdomain>
<failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca" />
</failoverdomain>
<failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca" />
</failoverdomain>
</failoverdomains>
<service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an1_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd"/>
</script>
</service>
<service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd"/>
</script>
</service>
<service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an2_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd"/>
</script>
</service>
</rm>
</cluster>
ccs_config_validate
Configuration validates
Make sure iscsi and clvmd do not start on boot, stop both, then make sure they start and stop cleanly.
chkconfig clvmd off; chkconfig iscsi off; /etc/init.d/iscsi stop && /etc/init.d/clvmd stop
Stopping iscsi: [ OK ]
/etc/init.d/clvmd start && /etc/init.d/iscsi start && /etc/init.d/iscsi stop && /etc/init.d/clvmd stop
Starting clvmd:
Activating VG(s): No volume groups found
[ OK ]
Starting iscsi: [ OK ]
Stopping iscsi: [ OK ]
Signaling clvmd to exit [ OK ]
clvmd terminated [ OK ]
Use the cluster to stop (in case it autostarted before now) and then start the services.
# Disable (stop)
clusvcadm -d service:an3_storage
clusvcadm -d service:an4_storage
clusvcadm -d service:an5_storage
# Enable (start)
clusvcadm -e service:an3_storage -m an-node03.alteeve.ca
clusvcadm -e service:an4_storage -m an-node04.alteeve.ca
clusvcadm -e service:an5_storage -m an-node05.alteeve.ca
# Check
clustat
Cluster Status for an-clusterB @ Wed Aug 3 00:25:10 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-node03.alteeve.ca 1 Online, Local, rgmanager
an-node04.alteeve.ca 2 Online, rgmanager
an-node05.alteeve.ca 3 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:an3_storage an-node03.alteeve.ca started
service:an4_storage an-node04.alteeve.ca started
service:an5_storage an-node05.alteeve.ca started
Flush iSCSI's Cache
If you remove an iQN (or change the name of one), the /etc/init.d/iscsi script will return errors. To flush it and re-scan:
I am sure there is a more elegant way.
/etc/init.d/iscsi stop && rm -rf /var/lib/iscsi/nodes/* && iscsiadm -m discovery -t sendtargets -p 192.168.2.100
Setup the VM Cluster's Clustered LVM
Partition the SAN disks
an-node03:
fdisk -l
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00062f4a
Device Boot Start End Blocks Id System
/dev/sda1 * 1 33 262144 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2 33 5255 41943040 83 Linux
/dev/sda3 5255 5777 4194304 82 Linux swap / Solaris
Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/sdc doesn't contain a valid partition table
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/sdb doesn't contain a valid partition table
Create partitions.
fdisk /dev/sdb
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x403f1fb8.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
switch off the mode (command 'c') and change display units to
sectors (command 'u').
Command (m for help): c
DOS Compatibility flag is not set
Command (m for help): u
Changing display/entry units to sectors
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-55022, default 1): 1
Last cylinder, +cylinders or +size{K,M,G} (1-55022, default 55022):
Using default value 55022
Command (m for help): p
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x403f1fb8
Device Boot Start End Blocks Id System
/dev/sdb1 1 55022 441964183+ 83 Linux
Command (m for help): t
Selected partition 1
Hex code (type L to list codes): 8e
Changed system type of partition 1 to 8e (Linux LVM)
Command (m for help): p
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x403f1fb8
Device Boot Start End Blocks Id System
/dev/sdb1 1 55022 441964183+ 8e Linux LVM
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
fdisk /dev/sdc
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0xba7503eb.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
switch off the mode (command 'c') and change display units to
sectors (command 'u').
Command (m for help): c
DOS Compatibility flag is not set
Command (m for help): u
Changing display/entry units to sectors
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First sector (2048-58613759, default 2048):
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-58613759, default 58613759):
Using default value 58613759
Command (m for help): t
Selected partition 1
Hex code (type L to list codes): 8e
Changed system type of partition 1 to 8e (Linux LVM)
Command (m for help): p
Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders, total 58613760 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xba7503eb
Device Boot Start End Blocks Id System
/dev/sdc1 2048 58613759 29305856 8e Linux LVM
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
fdisk -l
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00062f4a
Device Boot Start End Blocks Id System
/dev/sda1 * 1 33 262144 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2 33 5255 41943040 83 Linux
/dev/sda3 5255 5777 4194304 82 Linux swap / Solaris
Disk /dev/sdc: 30.0 GB, 30010245120 bytes
64 heads, 32 sectors/track, 28620 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xba7503eb
Device Boot Start End Blocks Id System
/dev/sdc1 2 28620 29305856 8e Linux LVM
Disk /dev/sdb: 452.6 GB, 452573790208 bytes
255 heads, 63 sectors/track, 55022 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x403f1fb8
Device Boot Start End Blocks Id System
/dev/sdb1 1 55022 441964183+ 8e Linux LVM
Setup LVM devices
Create PV.
an-node03:
pvcreate /dev/sd{b,c}1
Physical volume "/dev/sdb1" successfully created
Physical volume "/dev/sdc1" successfully created
an-node04 and an-node05:
pvscan
PV /dev/sdb1 lvm2 [421.49 GiB]
PV /dev/sdc1 lvm2 [27.95 GiB]
Total: 2 [449.44 GiB] / in use: 0 [0 ] / in no VG: 2 [449.44 GiB]
Create the VGs.
an-node03:
vgcreate -c y san_vg01 /dev/sdb1
Clustered volume group "san_vg01" successfully created
vgcreate -c y san_vg02 /dev/sdc1
Clustered volume group "san_vg02" successfully created
an-node04 and an-node05:
vgscan
Reading all physical volumes. This may take a while...
Found volume group "san_vg02" using metadata type lvm2
Found volume group "san_vg01" using metadata type lvm2
Create the first VM's LVs.
an-node03:
lvcreate -L 10G -n shared01 /dev/san_vg01
Logical volume "shared01" created
lvcreate -L 50G -n vm0001_hdd1 /dev/san_vg01
Logical volume "vm0001_hdd1" created
lvcreate -L 10G -n vm0001_ssd1 /dev/san_vg02
Logical volume "vm0001_ssd1" created
an-node04 and an-node05:
lvscan
ACTIVE '/dev/san_vg01/shared01' [10.00 GiB] inherit
ACTIVE '/dev/san_vg02/vm0001_ssd1' [10.00 GiB] inherit
ACTIVE '/dev/san_vg01/vm0001_hdd1' [50.00 GiB] inherit
an-node03:
mkfs.gfs2 -p lock_dlm -j 5 -t an-clusterB:shared01 /dev/san_vg01/shared01
This will destroy any data on /dev/san_vg01/shared01.
It appears to contain: symbolic link to `../dm-2'
Are you sure you want to proceed? [y/n] y
Device: /dev/san_vg01/shared01
Blocksize: 4096
Device Size 10.00 GB (2621440 blocks)
Filesystem Size: 10.00 GB (2621438 blocks)
Journals: 5
Resource Groups: 40
Locking Protocol: "lock_dlm"
Lock Table: "an-clusterB:shared01"
UUID: 6C0D7D1D-A1D3-ED79-705D-28EE3D674E75
Add it to /etc/fstab (needed for the gfs2 init script to find and mount):
an-node03 - an-node07:
echo `gfs2_edit -p sb /dev/san_vg01/shared01 | grep sb_uuid | sed -e "s/.*sb_uuid *\(.*\)/UUID=\L\1\E \/shared01\t\tgfs2\trw,suid,dev,exec,nouser,async\t0 0/"` >> /etc/fstab
cat /etc/fstab
#
# /etc/fstab
# Created by anaconda on Fri Jul 8 22:01:41 2011
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=2c1f4cb1-959f-4675-b9c7-5d753c303dd1 / ext3 defaults 1 1
UUID=9a0224dc-15b4-439e-8d7c-5f9dbcd05e3f /boot ext3 defaults 1 2
UUID=4f2a83e8-1769-40d8-ba2a-e1f535306848 swap swap defaults 0 0
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
UUID=6c0d7d1d-a1d3-ed79-705d-28ee3d674e75 /shared01 gfs2 rw,suid,dev,exec,nouser,async 0 0
Make the mount point and mount it.
mkdir /shared01
/etc/init.d/gfs2 start
Mounting GFS2 filesystem (/shared01): [ OK ]
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 40G 3.3G 35G 9% /
tmpfs 1.8G 32M 1.8G 2% /dev/shm
/dev/sda1 248M 85M 151M 36% /boot
/dev/mapper/san_vg01-shared01
10G 647M 9.4G 7% /shared01
Stop GFS2 on all five nodes and update the cluster.conf config.
/etc/init.d/gfs2 stop
Unmounting GFS2 filesystem (/shared01): [ OK ]
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 40G 3.3G 35G 9% /
tmpfs 1.8G 32M 1.8G 2% /dev/shm
/dev/sda1 248M 85M 151M 36% /boot
/dev/mapper/san_vg01-shared01
10G 647M 9.4G 7% /shared01
an-node03:
<?xml version="1.0"?>
<cluster config_version="9" name="an-clusterB">
<totem rrp_mode="none" secauth="off"/>
<clusternodes>
<clusternode name="an-node03.alteeve.ca" nodeid="3">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="3"/>
</method>
</fence>
</clusternode>
<clusternode name="an-node04.alteeve.ca" nodeid="4">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="4"/>
</method>
</fence>
</clusternode>
<clusternode name="an-node05.alteeve.ca" nodeid="5">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="5"/>
</method>
</fence>
</clusternode>
<clusternode name="an-node06.alteeve.ca" nodeid="6">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="6"/>
</method>
</fence>
</clusternode>
<clusternode name="an-node07.alteeve.ca" nodeid="7">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="7"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
</fencedevices>
<fence_daemon post_join_delay="30"/>
<rm>
<resources>
<script file="/etc/init.d/iscsi" name="iscsi"/>
<script file="/etc/init.d/clvmd" name="clvmd"/>
<script file="/etc/init.d/gfs2" name="gfs2"/>
</resources>
<failoverdomains>
<failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca"/>
</failoverdomain>
</failoverdomains>
<service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
</rm>
</cluster>
cman_tool version -r
Check that rgmanager picked up the updated config and remounted the GFS2 partition.
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 40G 3.3G 35G 9% /
tmpfs 1.8G 32M 1.8G 2% /dev/shm
/dev/sda1 248M 85M 151M 36% /boot
/dev/mapper/san_vg01-shared01
10G 647M 9.4G 7% /shared01
Configure KVM
Host network and VM hypervisor config.
Disable the 'qemu' Bridge
By default, libvirtd creates a bridge called virbr0 designed to connect virtual machines to the first eth0 interface. Our system will not need this, so we will remove it. This bridge is configured in the /etc/libvirt/qemu/networks/default.xml file.
So to remove this bridge, simply delete the contents of the file, stop the bridge, delete the bridge and then stop iptables to make sure any rules created for the bridge are flushed.
cat /dev/null >/etc/libvirt/qemu/networks/default.xml
ifconfig virbr0 down
brctl delbr virbr0
/etc/init.d/iptables stop
Configure Bridges
On an-node03 through an-node07:
vim /etc/sysconfig/network-scripts/ifcfg-{eth,vbr}0
ifcfg-eth0:
# Internet facing
HWADDR="bc:ae:c5:44:8a:de"
DEVICE="eth0"
BRIDGE="vbr0"
BOOTPROTO="static"
IPV6INIT="yes"
NM_CONTROLLED="no"
ONBOOT="yes"
Note that you can use what ever bridge names makes sense to you. However, the file name for the bridge configuration must sort after the ifcfg-ethX file. If the bridge file is read before the ethernet interface, it will fail to come up. Also, the bridge name as defined in the file does not need to match the one used it the actual file name. Personally, I like vbrX for "vm bridge".
ifcfg-vbr0:
# Bridge - IFN
DEVICE="vbr0"
TYPE="Bridge"
IPADDR=192.168.1.73
NETMASK=255.255.255.0
GATEWAY=192.168.1.254
DNS1=192.139.81.117
DNS2=192.139.81.1
You may wish to not make the Back-Channel Network accessible to the virtual machines, then there is no need to setup this second bridge.
vim /etc/sysconfig/network-scripts/ifcfg-{eth,vbr}2
ifcfg-eth2:
# Back-channel
HWADDR="00:1B:21:72:9B:56"
DEVICE="eth2"
BRIDGE="vbr2"
BOOTPROTO="static"
IPV6INIT="yes"
NM_CONTROLLED="no"
ONBOOT="yes"
ifcfg-vbr2:
# Bridge - BCN
DEVICE="vbr2"
TYPE="Bridge"
IPADDR=192.168.3.73
NETMASK=255.255.255.0
Leave the cluster, lest we be fenced.
/etc/init.d/rgmanager stop && /etc/init.d/cman stop
Restart networking and then check that the new bridges are up and that the proper ethernet devices are slaved to them.
/etc/init.d/network restart
Shutting down interface eth0: [ OK ]
Shutting down interface eth1: [ OK ]
Shutting down interface eth2: [ OK ]
Shutting down loopback interface: [ OK ]
Bringing up loopback interface: [ OK ]
Bringing up interface eth0: [ OK ]
Bringing up interface eth1: [ OK ]
Bringing up interface eth2: [ OK ]
Bringing up interface vbr0: [ OK ]
Bringing up interface vbr2: [ OK ]
brctl show
bridge name bridge id STP enabled interfaces
vbr0 8000.bcaec5448ade no eth0
vbr2 8000.001b21729b56 no eth2
ifconfig
eth0 Link encap:Ethernet HWaddr BC:AE:C5:44:8A:DE
inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1
RX packets:4439 errors:0 dropped:0 overruns:0 frame:0
TX packets:2752 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:508352 (496.4 KiB) TX bytes:494345 (482.7 KiB)
Interrupt:31 Base address:0x8000
eth1 Link encap:Ethernet HWaddr 00:1B:21:72:96:E8
inet addr:192.168.2.73 Bcast:192.168.2.255 Mask:255.255.255.0
inet6 addr: fe80::21b:21ff:fe72:96e8/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:617100 errors:0 dropped:0 overruns:0 frame:0
TX packets:847718 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:772489353 (736.7 MiB) TX bytes:740536232 (706.2 MiB)
Interrupt:18 Memory:fe9e0000-fea00000
eth2 Link encap:Ethernet HWaddr 00:1B:21:72:9B:56
inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1
RX packets:86586 errors:0 dropped:0 overruns:0 frame:0
TX packets:80934 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:11366700 (10.8 MiB) TX bytes:10091579 (9.6 MiB)
Interrupt:17 Memory:feae0000-feb00000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:32 errors:0 dropped:0 overruns:0 frame:0
TX packets:32 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:11507 (11.2 KiB) TX bytes:11507 (11.2 KiB)
vbr0 Link encap:Ethernet HWaddr BC:AE:C5:44:8A:DE
inet addr:192.168.1.73 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:165 errors:0 dropped:0 overruns:0 frame:0
TX packets:89 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:25875 (25.2 KiB) TX bytes:17081 (16.6 KiB)
vbr2 Link encap:Ethernet HWaddr 00:1B:21:72:9B:56
inet addr:192.168.3.73 Bcast:192.168.3.255 Mask:255.255.255.0
inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:74 errors:0 dropped:0 overruns:0 frame:0
TX packets:27 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:19021 (18.5 KiB) TX bytes:4137 (4.0 KiB)
Rejoin the cluster.
/etc/init.d/cman start && /etc/init.d/rgmanager start
Repeat these configurations, altering for MAC and IP addresses as appropriate, for the other four VM cluster nodes.
Benchmarks
GFS2 partition on an-node07's /shared01 partition. Test #1, no optimization:
bonnie++ -d /shared01/ -s 8g -u root:root
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
an-node07.alteev 8G 388 95 22203 6 14875 8 2978 95 48406 10 107.3 5
Latency 312ms 44400ms 31355ms 41505us 540ms 11926ms
Version 1.96 ------Sequential Create------ --------Random Create--------
an-node07.alteeve.c -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 1144 18 +++++ +++ 8643 56 939 19 +++++ +++ 8262 55
Latency 291ms 586us 2085us 3511ms 51us 3669us
1.96,1.96,an-node07.alteeve.ca,1,1312497509,8G,,388,95,22203,6,14875,8,2978,95,48406,10,107.3,5,16,,,,,1144,18,+++++,+++,8643,56,939,19,+++++,+++,8262,55,312ms,44400ms,31355ms,41505us,540ms,11926ms,291ms,586us,2085us,3511ms,51us,3669us
CentOS 5.6 x86_64 VM vm0001_labzilla's /root directory. Test #1, no optimization. VM provisioned using command in below section.
bonnie++ -d /root/ -s 8g -u root:root
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
labzilla-new.can 8G 674 98 15708 5 14875 7 1570 65 47806 10 119.1 7
Latency 66766us 7680ms 1588ms 187ms 269ms 1292ms
Version 1.96 ------Sequential Create------ --------Random Create--------
labzilla-new.candco -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 27666 39 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency 11360us 1904us 799us 290us 44us 41us
1.96,1.96,labzilla-new.candcoptical.com,1,1312522208,8G,,674,98,15708,5,14875,7,1570,65,47806,10,119.1,7,16,,,,,27666,39,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,66766us,7680ms,1588ms,187ms,269ms,1292ms,11360us,1904us,799us,290us,44us,41us
Provision vm0001
Created LV already, so:
virt-install --connect qemu:///system \
--name vm0001_labzilla \
--ram 1024 \
--arch x86_64 \
--vcpus 2 \
--cpuset 1-3 \
--location http://192.168.1.254/c5/x86_64/img/ \
--extra-args "ks=http://192.168.1.254/c5/x86_64/ks/labzilla_c5.ks" \
--os-type linux \
--os-variant rhel5.4 \
--disk path=/dev/san_vg01/vm0001_hdd1 \
--network bridge=vbr0 \
--vnc
Provision vm0002
Created LV already, so:
virt-install --connect qemu:///system \
--name vm0002_innovations \
--ram 1024 \
--arch x86_64 \
--vcpus 2 \
--cpuset 1-3 \
--cdrom /shared01/media/Win_Server_2008_Bis_x86_64.iso \
--os-type windows \
--os-variant win2k8 \
--disk path=/dev/san_vg01/vm0002_hdd2 \
--network bridge=vbr0 \
--hvm \
--vnc
Update the cluster.conf to add the VMs to the cluster.
<?xml version="1.0"?>
<cluster config_version="12" name="an-clusterB">
<totem rrp_mode="none" secauth="off"/>
<clusternodes>
<clusternode name="an-node03.alteeve.ca" nodeid="3">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="3"/>
</method>
</fence>
</clusternode>
<clusternode name="an-node04.alteeve.ca" nodeid="4">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="4"/>
</method>
</fence>
</clusternode>
<clusternode name="an-node05.alteeve.ca" nodeid="5">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="5"/>
</method>
</fence>
</clusternode>
<clusternode name="an-node06.alteeve.ca" nodeid="6">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="6"/>
</method>
</fence>
</clusternode>
<clusternode name="an-node07.alteeve.ca" nodeid="7">
<fence>
<method name="apc_pdu">
<device action="reboot" name="pdu2" port="7"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_apc" ipaddr="192.168.1.6" login="apc" name="pdu2" passwd="secret"/>
</fencedevices>
<fence_daemon post_join_delay="30"/>
<rm log_level="5">
<resources>
<script file="/etc/init.d/iscsi" name="iscsi"/>
<script file="/etc/init.d/clvmd" name="clvmd"/>
<script file="/etc/init.d/gfs2" name="gfs2"/>
</resources>
<failoverdomains>
<failoverdomain name="an3_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an4_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node04.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an5_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node05.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an6_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node06.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an7_only" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-node07.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="an3_primary" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="1"/>
<failoverdomainnode name="an-node04.alteeve.ca" priority="2"/>
<failoverdomainnode name="an-node05.alteeve.ca" priority="3"/>
<failoverdomainnode name="an-node06.alteeve.ca" priority="4"/>
<failoverdomainnode name="an-node07.alteeve.ca" priority="5"/>
</failoverdomain>
<failoverdomain name="an4_primary" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-node03.alteeve.ca" priority="5"/>
<failoverdomainnode name="an-node04.alteeve.ca" priority="1"/>
<failoverdomainnode name="an-node05.alteeve.ca" priority="2"/>
<failoverdomainnode name="an-node06.alteeve.ca" priority="3"/>
<failoverdomainnode name="an-node07.alteeve.ca" priority="4"/>
</failoverdomain>
</failoverdomains>
<service autostart="1" domain="an3_only" exclusive="0" max_restarts="0" name="an3_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an6_only" exclusive="0" max_restarts="0" name="an6_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<service autostart="1" domain="an7_only" exclusive="0" max_restarts="0" name="an7_storage" recovery="restart">
<script ref="iscsi">
<script ref="clvmd">
<script ref="gfs2"/>
</script>
</script>
</service>
<vm autostart="0" domain="an3_primary" exclusive="0" max_restarts="2" name="vm0001_labzilla" path="/shared01/definitions/" recovery="restart" restart_expire_time="600"/>
<vm autostart="0" domain="an4_primary" exclusive="0" max_restarts="2" name="vm0002_innovations" path="/shared01/definitions/" recovery="restart" restart_expire_time="600"/>
</rm>
</cluster>
Stuff
Multi-VM after primary SAN (violent) ejection from cluster. Both VMs remained up!
First build of the 2x7 Cluster.
Bonding and Trunking
The goal here is to take the network out as a single point of failure.
The design is to use two stacked switches, bonded connections in the nodes with each leg of the bond cabled through either switch. While both are up, the aggregate bandwidth will be achieved using trunking in the switch and the appropriate bond driver configuration. The recovery from failure will need to be configured in such a way that it will be faster than the cluster's token loss timeouts multiplied by the token retransmit loss count.
This tutorial uses 2x D-Link DGS-3100-24 switches. This is not to endorse these switches, per-se, but it does provide a relatively affordable, decent quality switches for those who'd like to replicate this setup.
Configure The Stack
First, stack the switches using a ring topology (both HDMI connectors/cables used). If both switches are brand new, simple cable them together and the switches will auto-negotiate the stack configuration. If you are adding a new switch, then power on the existing switch, cable up the second switch and then power on the second switch. After a short time, it's stack ID should increment and you should see the new switch appear in the existing switch's interface.
Configuring the Bonding Drivers
This tutorial uses four interfaces joined into two bonds of two NICs like so:
# Internet Facing Network:
eth0 + eth1 == bond0
# Storage and Cluster Communications:
eth2 + eth3 == bond1
This requires a few steps.
- Create /etc/modprobe.d/bonding.conf and add an entry for the two bonding channels we will create.
Note: My eth0 device is an onboard controller with a maximum MTU of 7200 bytes. This means that the whole bond is restricted to this MTU.
vim /etc/modprobe.d/bonding.conf
alias bond0 bonding
alias bond1 bonding
- Create the ifcfg-bondX configuration files.
Internet Facing configuration
touch /etc/sysconfig/network-scripts/ifcfg-bond{0,1}
vim /etc/sysconfig/network-scripts/ifcfg-eth{0,1} /etc/sysconfig/network-scripts/ifcfg-bond0
cat /etc/sysconfig/network-scripts/ifcfg-eth0
# Internet Facing Network - Link 1
HWADDR="BC:AE:C5:44:8A:DE"
DEVICE="eth0"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond0"
SLAVE="yes"
MTU="7200"
cat /etc/sysconfig/network-scripts/ifcfg-eth1
# Internet Facing Network - Link 2
HWADDR="00:1B:21:72:96:E8"
DEVICE="eth1"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond0"
SLAVE="yes"
MTU="7200"
cat /etc/sysconfig/network-scripts/ifcfg-bond0
# Internet Facing Network - Bonded Interface
DEVICE="bond0"
BOOTPROTO="static"
NM_CONTROLLED="no"
ONBOOT="yes"
IPADDR="192.168.1.73"
NETMASK="255.255.255.0"
GATEWAY="192.168.1.254"
DNS1="192.139.81.117"
DNS2="192.139.81.1"
BONDING_OPTS="miimon=1000 mode=0"
MTU="7200"
Merged Storage Network and Back Channel Network configuration.
Note: The interfaces in this bond all support maximum MTU of 9000 bytes.
vim /etc/sysconfig/network-scripts/ifcfg-eth{2,3} /etc/sysconfig/network-scripts/ifcfg-bond1
cat /etc/sysconfig/network-scripts/ifcfg-eth2
# Storage and Back Channel Networks - Link 1
HWADDR="00:1B:21:72:9B:56"
DEVICE="eth2"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond1"
SLAVE="yes"
MTU="9000"
cat /etc/sysconfig/network-scripts/ifcfg-eth3
# Storage and Back Channel Networks - Link 2
HWADDR="00:0E:0C:59:46:E4"
DEVICE="eth3"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
MASTER="bond1"
SLAVE="yes"
MTU="9000"
cat /etc/sysconfig/network-scripts/ifcfg-bond1
# Storage and Back Channel Networks - Bonded Interface
DEVICE="bond1"
BOOTPROTO="static"
NM_CONTROLLED="no"
ONBOOT="yes"
IPADDR="192.168.3.73"
NETMASK="255.255.255.0"
BONDING_OPTS="miimon=1000 mode=0"
MTU="9000"
Restart networking.
Note: I've noticed that this can error out and fail to start slaved devices at times when using /etc/init.d/network restart. If you have any trouble, you may need to completely stop all networking, then start it back up. This, of course, requires network-less access to the node's console (direct access, iKVM, console redirection, etc). |
Some of the errors we will see below are because the network interface configuration changed while the interfaces were still up. To avoid this, if you have networkless access to the nodes, would be to stop the network interfaces prior to beginning editing.
/etc/init.d/network restart
Shutting down interface eth0: [ OK ]
Shutting down interface eth1: [ OK ]
Shutting down interface eth2: /etc/sysconfig/network-scripts/ifdown-eth: line 99: /sys/class/net/bond1/bonding/slaves: No such file or directory
[ OK ]
Shutting down interface eth3: /etc/sysconfig/network-scripts/ifdown-eth: line 99: /sys/class/net/bond1/bonding/slaves: No such file or directory
[ OK ]
Shutting down loopback interface: [ OK ]
Bringing up loopback interface: [ OK ]
Bringing up interface bond0: RTNETLINK answers: File exists
Error adding address 192.168.1.73 for bond0.
RTNETLINK answers: File exists
[ OK ]
Bringing up interface bond1: [ OK ]
Confirm that we've got our new bonded interfaces
ifconfig
bond0 Link encap:Ethernet HWaddr BC:AE:C5:44:8A:DE
inet addr:192.168.1.73 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::beae:c5ff:fe44:8ade/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:7200 Metric:1
RX packets:1021 errors:0 dropped:0 overruns:0 frame:0
TX packets:502 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:128516 (125.5 KiB) TX bytes:95092 (92.8 KiB)
bond1 Link encap:Ethernet HWaddr 00:1B:21:72:9B:56
inet addr:192.168.3.73 Bcast:192.168.3.255 Mask:255.255.255.0
inet6 addr: fe80::21b:21ff:fe72:9b56/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:9000 Metric:1
RX packets:787028 errors:0 dropped:0 overruns:0 frame:0
TX packets:788651 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:65753950 (62.7 MiB) TX bytes:1194295932 (1.1 GiB)
eth0 Link encap:Ethernet HWaddr BC:AE:C5:44:8A:DE
UP BROADCAST RUNNING SLAVE MULTICAST MTU:7200 Metric:1
RX packets:535 errors:0 dropped:0 overruns:0 frame:0
TX packets:261 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:66786 (65.2 KiB) TX bytes:47749 (46.6 KiB)
Interrupt:31 Base address:0x8000
eth1 Link encap:Ethernet HWaddr BC:AE:C5:44:8A:DE
UP BROADCAST RUNNING SLAVE MULTICAST MTU:7200 Metric:1
RX packets:486 errors:0 dropped:0 overruns:0 frame:0
TX packets:241 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:61730 (60.2 KiB) TX bytes:47343 (46.2 KiB)
Interrupt:18 Memory:fe8e0000-fe900000
eth2 Link encap:Ethernet HWaddr 00:1B:21:72:9B:56
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9000 Metric:1
RX packets:360190 errors:0 dropped:0 overruns:0 frame:0
TX packets:394844 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:28756400 (27.4 MiB) TX bytes:598159146 (570.4 MiB)
Interrupt:17 Memory:fe9e0000-fea00000
eth3 Link encap:Ethernet HWaddr 00:1B:21:72:9B:56
UP BROADCAST RUNNING SLAVE MULTICAST MTU:9000 Metric:1
RX packets:426838 errors:0 dropped:0 overruns:0 frame:0
TX packets:393807 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:36997550 (35.2 MiB) TX bytes:596136786 (568.5 MiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Configuring High-Availability Networking
There are seven bonding modes, which you can read about in detail here. However, the RHCS stack only supports one of the modes, called Active/Passive Bonding, also known as mode=1.
Configuring Your Switches
This method provides for no performance gains, but instead treats the slaved interfaces as independent paths. One of which acts as the primary while the other sits dormant. On failure, the bond switches to the backup interface, promoting it to the primary interface. Which is normally primary, and under what conditions a restored link might return to the primary role, are configurable should you wish to do so.
Your managed switch will no doubt have one or more bonding, also known as trunking configuration options. Likewise, your switches may be stackable. It is strongly advised that you do *not* stack your switches. Being unstacked, it is obviously not possible to configure trunking. Should you decide to disregard this, be very sure to extensive test failure and recovery of both switches under real-world work loads.
Still on the topic of switches; Do not configure STP (spanning tree protocol) on any port connected to your cluster nodes! When a switch is added to the network, as is the case after restoring a lost switch, STP-enabled switches and port on those switches may block traffic for a period of time while STP renegotiates and reconfigures. This takes more than enough time to cause a cluster to partition. You may still enable and configure STP if you need to do so, simply ensure that you only do in on the appropriate ports.
Preparing The Bonding Driver
Before we modify the network, we will need to create the following file:
You will need to create a file called
With the switches unstacked and STP disabled, we can now configure your bonding interface.
vim /etc/modprobe.d/bonding.conf
alias bond0 bonding
alias bond1 bonding
alias bond2 bonding
If you only have four interfaces and plan to merge the SN and BCN networks, you can omit the bond2 entry.
You can then copy and paste the alias ... entries from the file above into the terminal to avoid the need to reboot.
Deciding Which NICs to Bond
If all of the interfaces in your server are identical, you can probably skip this step. Before you jump though, consider that not all of the PCIe interfaces may have all of their lanes connected, resulting in differing speeds. If you are unsure, I strongly recommend you run these tests.
TODO: Upload network_profiler.pl here and explain it's use.
- Before we do that though, let's look at how we will verify the current link speed using ipperf (local copy of iperf-2.0.5-1.el6.x86_64.rpm).
Once you've determined the various capabilities on your interfaces, pair them off with their closest-performing partners.
Keep in mind:
- Any interface piggy-backing on an IPMI interface *must* be part of the BCN bond!
- The fasted interfaces should be paired for the SN bond.
- The lowest latency interfaces should be used the BCN bond.
- The lowest remaining two interfaces should be used in the IFN bond.
Creating the Bonds
Warning: This step will almost certainly leave you without a network access to your servers. It is *strongly* advised that you do the next steps when you have physical access to your servers. If that is simply not possible, then proceed with extreme caution. |
In my case, I found that bonding the following optimal configuration:
- eth0 and eth3 bonded as bond0.
- eth1 and eth2 bonded as bond1.
I did not have enough interfaces for three bonds, so I will configure the following:
- bond0 will be the IFN interface on the 192.168.1.0/24 subnet.
- bond1 will be the merged BCN and SN interfaces on the 192.168.3.0/24 subnet.
TODO: Create/show the diffs for the following ifcfg-ethX files.
- Create bond0 our of eth0 and eth3:
vim /etc/sysconfig/network-scripts/ifcfg-eth0
# Internet Facing Network - Link 1
HWADDR="BC:AE:C5:44:8A:DE"
DEVICE="eth0"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="no"
MASTER="bond0"
SLAVE="yes"
vim /etc/sysconfig/network-scripts/ifcfg-eth3
# Storage and Back Channel Networks - Link 2
HWADDR="00:0E:0C:59:46:E4"
DEVICE="eth3"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="no"
MASTER="bond0"
SLAVE="yes"
vim /etc/sysconfig/network-scripts/ifcfg-bond0
# Internet Facing Network - Bonded Interface
DEVICE="bond0"
BOOTPROTO="static"
NM_CONTROLLED="no"
ONBOOT="yes"
IPADDR="192.168.1.73"
NETMASK="255.255.255.0"
GATEWAY="192.168.1.254"
DNS1="192.139.81.117"
DNS2="192.139.81.1"
# Clustering *only* supports mode=1 (active-passive)
BONDING_OPTS="mode=1 miimon=100 use_carrier=0 updelay=0 downdelay=0"
GFS2
Try adding noatime to the /etc/fstab options. h/t to Dak1n1; "it avoids cluster reads from turning into unnecessary writes, and can improve performance".
Any questions, feedback, advice, complaints or meanderings are welcome. | |||
Alteeve's Niche! | Alteeve Enterprise Support | Community Support | |
© 2024 Alteeve. Intelligent Availability® is a registered trademark of Alteeve's Niche! Inc. 1997-2024 | |||
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions. |