Watchdog Recovery: Difference between revisions

From Alteeve Wiki
Jump to navigation Jump to search
 
(28 intermediate revisions by the same user not shown)
Line 3: Line 3:
{{note|1=This tutorial is written using [[Fedora]] 16.}}
{{note|1=This tutorial is written using [[Fedora]] 16.}}


The new <span class="code">fence_sanlock</span> fence agent provides a new fencing option to users who may not have full out of band management or switched PDUs. It aims to provide a critical function in clusters to users who otherwise would have no (affordable) options.
The new <span class="code">fence_sanlock</span> and <span class="code">checkquorum.wdmd</span> fence agents provide new fencing options to users who may not have full out of band management, switched PDUs or similar tradition fence device. They aim to provide a critical function in clusters to users who otherwise would have no (affordable) options.


{{warning|1=This technology is TechPreview! There is no support for this fence method yet. Feedback and bug reports are much appreciated.}}
{{warning|1=This technology is Tech Preview! There is no support for this fence method yet. Feedback and bug reports are much appreciated.}}


= About Fencing =  
= About Fencing =  
Line 24: Line 24:


Where "fabric fencing" can be thought of as a form of ostracism and "power fencing" can be thought of as a form of murder, watchdog fencing can be thought of as a form of suicide. Morbid, but accurate.
Where "fabric fencing" can be thought of as a form of ostracism and "power fencing" can be thought of as a form of murder, watchdog fencing can be thought of as a form of suicide. Morbid, but accurate.
== Options ==
There are currently 2 mechanisms to trigger a node recover via watchdog device:
# <span class="code">fence_sanlock</span>: Preferred method, but always requires shared storage.
# <span class="code">checkquorum.wdmd</span>: Only requires shared storage for 2-node clusters.
== Differences Between fence_sanlock And checkquorum.wdmd ==
When choosing which type of watchdog-based fencing, consider the following;
{|class="wikitable"
!Method!!Advantages!!Disadvantages
|-
|<span class="code">fence_sanlock</span>
|
* Preferred method, it is a real fence method.
* Can recover the cluster if a node fails completely.
|
* Requires shared storage in all cases.
|-
|<span class="code">checkquorum.wdmd</span>
|
* Shared storage not needed for 3+ nodes.
|
* Not a real fence method.
* Fence action only considered complete when the lost node rejoins.
* Node failures that prevent reboot causes the cluster to remain blocked.
* Network failure causing all nodes to lose quorum will result in a complete cluster restart.
|}


== Important Note On Timing ==
== Important Note On Timing ==
Line 33: Line 64:
When a cluster loses contact with a node, it blocks by design. It is not safe to proceed until all nodes are in a known state, so the users of the cluster services will notice a period of interruption until the lost node recovers or is fenced.
When a cluster loses contact with a node, it blocks by design. It is not safe to proceed until all nodes are in a known state, so the users of the cluster services will notice a period of interruption until the lost node recovers or is fenced.


Putting this together, this timing difference means that any watchdog-based fencing will be much slower than traditional fencing. Your users will most likely experience an outage of several minutes while the <span class="code">fence_sanlock</span> works. For this reason, <span class="code">fence_sanlock</span> should be used '''only''' when no traditional fence methods are unavailable.  
Putting this together, this timing difference means that any watchdog-based fencing will be much slower than traditional fencing. Your users will most likely experience an outage of several minutes while the <span class="code">fence_sanlock</span> works. For this reason, <span class="code">fence_sanlock</span> should be used '''only''' when traditional fence methods are unavailable.  


In short; watchdog fencing is '''not''' a replacement for traditional fencing. It is only a replacement for no fencing at all.
In short; watchdog fencing is '''not''' a replacement for traditional fencing. It is only a replacement for no fencing at all.


= How fence_sanlock Works =
= fence_sanlock =


The <span class="code">fence_sanlock</span> deamon works by combining locks on shared storage with watchdog devices under each node.
The <span class="code">fence_sanlock</span> deamon works by combining locks on shared storage with watchdog devices under each node.
Line 48: Line 79:


The cluster will know when the victim is gone when it is finally able to take the victim's lock. This is safe because, so long as the victim is alive, it will maintain it's lock.
The cluster will know when the victim is gone when it is finally able to take the victim's lock. This is safe because, so long as the victim is alive, it will maintain it's lock.
= Setup =


== Requirements ==
== Requirements ==
Line 56: Line 85:


* A ''hardware'' watchdog timer.
* A ''hardware'' watchdog timer.
* External shared storage.
* External shared storage, 1 [[GiB]] or larger in size.
** Any shared storage will do. [[Setting Up tgtd As A SAN|Here is a short]] <span class="code">tgtd</span> tutorial for creating a SAN on a machine outside of the cluster if you don't have existing shared storage.
** Any shared storage will do. [[Setting Up tgtd As A SAN|Here is a short]] <span class="code">tgtd</span> tutorial for creating a SAN on a machine outside of the cluster if you don't have existing shared storage.


Line 68: Line 97:
* <span class="code">wdmd</span> ver. 2.6 + (available from the <span class="code">sanlock</span> ver. 2.6 + package)
* <span class="code">wdmd</span> ver. 2.6 + (available from the <span class="code">sanlock</span> ver. 2.6 + package)
* <span class="code">fence_sanlock</span> ver. 2.6 +
* <span class="code">fence_sanlock</span> ver. 2.6 +
== Options ==
There are currently 2 mechanisms to trigger a node recover via watchdog device:
# <span class="code">fence_sanlock</span>: Preferred method, but always requires shared storage.
# <span class="code">fcheckquorum.wdmd</span>: Only requires shared storage for 2-node clusters.


== Installation ==
== Installation ==


To install <span class="code">fence_sanlock</span>;
To install <span class="code">fence_sanlock</span>, run;
 
{{note|1=Update this to use <span class="code">yum</span> once the RPMs are available in the main repos.}}


<source lang="bash">
<source lang="bash">
yum install openais modcluster
yum install cman fence-sanlock sanlock sanlock-lib
</source>
</source>
<source lang="text">
Resolving Dependencies
--> Running transaction check
---> Package modcluster.x86_64 0:0.18.7-3.fc16 will be installed
--> Processing Dependency: libfence.so.4()(64bit) for package: modcluster-0.18.7-3.fc16.x86_64
--> Processing Dependency: libcman.so.3()(64bit) for package: modcluster-0.18.7-3.fc16.x86_64
---> Package openais.x86_64 0:1.1.4-2.fc15 will be installed
--> Processing Dependency: openaislib = 1.1.4-2.fc15 for package: openais-1.1.4-2.fc15.x86_64
--> Processing Dependency: corosync >= 1.0.0-1 for package: openais-1.1.4-2.fc15.x86_64
--> Running transaction check
---> Package clusterlib.x86_64 0:3.1.92-1.fc16 will be installed
--> Processing Dependency: libconfdb.so.4(COROSYNC_CONFDB_1.0)(64bit) for package: clusterlib-3.1.92-1.fc16.x86_64
--> Processing Dependency: libconfdb.so.4()(64bit) for package: clusterlib-3.1.92-1.fc16.x86_64
---> Package corosync.x86_64 0:1.4.3-1.fc16 will be installed
---> Package openaislib.x86_64 0:1.1.4-2.fc15 will be installed
--> Running transaction check
---> Package corosynclib.x86_64 0:1.4.3-1.fc16 will be installed
--> Finished Dependency Resolution


Dependencies Resolved
== Configuring Shared Storage ==


====================================================================================================
Any shared storage device can be used. For the purpose if this document, a [[SAN]] LUN, exported by a machine outside of the cluster, and made available as <span class="code">/dev/sdb</span> will be used. If you don't currently have a shared storage device, below is a brief tutorial on setting up a <span class="code">[[tgtd]]</span> based SAN:
Package                  Arch                Version                    Repository            Size
====================================================================================================
Installing:
modcluster              x86_64              0.18.7-3.fc16              updates              189 k
openais                  x86_64              1.1.4-2.fc15              fedora              190 k
Installing for dependencies:
clusterlib              x86_64              3.1.92-1.fc16              updates              72 k
corosync                x86_64              1.4.3-1.fc16              updates              170 k
corosynclib              x86_64              1.4.3-1.fc16              updates              149 k
openaislib              x86_64              1.1.4-2.fc15              fedora                88 k


Transaction Summary
* [[Setting Up tgtd As A SAN]]
====================================================================================================
Install  2 Packages (+4 Dependent packages)


Total download size: 858 k
You can use a [non] clustered LVM, an NFS export or most any type of shared storage. The only requirement being that it is at least 1 [[GiB]] in size or larger.
Installed size: 1.9 M
Is this ok [y/N]: y
Downloading Packages:
(1/6): clusterlib-3.1.92-1.fc16.x86_64.rpm                                  |  72 kB    00:00   
(2/6): corosync-1.4.3-1.fc16.x86_64.rpm                                      | 170 kB    00:00   
(3/6): corosynclib-1.4.3-1.fc16.x86_64.rpm                                  | 149 kB    00:00   
(4/6): modcluster-0.18.7-3.fc16.x86_64.rpm                                  | 189 kB    00:00   
(5/6): openais-1.1.4-2.fc15.x86_64.rpm                                      | 190 kB    00:00   
(6/6): openaislib-1.1.4-2.fc15.x86_64.rpm                                    |  88 kB    00:00   
----------------------------------------------------------------------------------------------------
Total                                                              803 kB/s | 858 kB    00:01   
Running Transaction Check
Running Transaction Test
Transaction Test Succeeded
Running Transaction
  Installing : corosynclib-1.4.3-1.fc16.x86_64                                                  1/6
  Installing : corosync-1.4.3-1.fc16.x86_64                                                    2/6
  Installing : openaislib-1.1.4-2.fc15.x86_64                                                  3/6
  Installing : openais-1.1.4-2.fc15.x86_64                                                      4/6
  Installing : clusterlib-3.1.92-1.fc16.x86_64                                                  5/6
  Installing : modcluster-0.18.7-3.fc16.x86_64                                                  6/6
  Verifying  : openais-1.1.4-2.fc15.x86_64                                                      1/6
  Verifying  : openaislib-1.1.4-2.fc15.x86_64                                                  2/6
  Verifying  : clusterlib-3.1.92-1.fc16.x86_64                                                  3/6
  Verifying  : modcluster-0.18.7-3.fc16.x86_64                                                  4/6
  Verifying  : corosync-1.4.3-1.fc16.x86_64                                                    5/6
  Verifying  : corosynclib-1.4.3-1.fc16.x86_64                                                  6/6


Installed:
Once you have your shared storage, initialize it;
  modcluster.x86_64 0:0.18.7-3.fc16                  openais.x86_64 0:1.1.4-2.fc15               


Dependency Installed:
{{note|1=The <span class="code">Initializing 128 sanlock host leases</span> may take a while to complete, please be patient.}}
  clusterlib.x86_64 0:3.1.92-1.fc16                corosync.x86_64 0:1.4.3-1.fc16                 
  corosynclib.x86_64 0:1.4.3-1.fc16                openaislib.x86_64 0:1.1.4-2.fc15               


Complete!
</source>
<source lang="bash">
<source lang="bash">
rpm -Uvh http://fabbione.fedorapeople.org/watchdog_fencing/cman-3.1.99-1.fc16.x86_64.rpm \
fence_sanlock -o sanlock_init -p /dev/sdb
        http://fabbione.fedorapeople.org/watchdog_fencing/corosync-1.4.4-2.fc16.x86_64.rpm \
        http://fabbione.fedorapeople.org/watchdog_fencing/fence-sanlock-2.6-1.fc16.x86_64.rpm \
        http://fabbione.fedorapeople.org/watchdog_fencing/sanlock-2.6-1.fc16.x86_64.rpm \
        http://fabbione.fedorapeople.org/watchdog_fencing/sanlock-lib-2.6-1.fc16.x86_64.rpm \
        http://fabbione.fedorapeople.org/watchdog_fencing/corosync-1.4.4-2.fc16.x86_64.rpm \
        http://fabbione.fedorapeople.org/watchdog_fencing/corosynclib-1.4.4-2.fc16.x86_64.rpm \
        http://fabbione.fedorapeople.org/watchdog_fencing/clusterlib-3.1.99-1.fc16.x86_64.rpm
</source>
</source>
<source lang="text">
<source lang="text">
Retrieving http://fabbione.fedorapeople.org/watchdog_fencing/cman-3.1.99-1.fc16.x86_64.rpm
Initializing fence sanlock lockspace on /dev/sdb: ok
Retrieving http://fabbione.fedorapeople.org/watchdog_fencing/corosync-1.4.4-2.fc16.x86_64.rpm
Initializing 128 sanlock host leases on /dev/sdb: ok
Retrieving http://fabbione.fedorapeople.org/watchdog_fencing/fence-sanlock-2.6-1.fc16.x86_64.rpm
Retrieving http://fabbione.fedorapeople.org/watchdog_fencing/sanlock-2.6-1.fc16.x86_64.rpm
Retrieving http://fabbione.fedorapeople.org/watchdog_fencing/sanlock-lib-2.6-1.fc16.x86_64.rpm
Retrieving http://fabbione.fedorapeople.org/watchdog_fencing/corosync-1.4.4-2.fc16.x86_64.rpm
Retrieving http://fabbione.fedorapeople.org/watchdog_fencing/corosynclib-1.4.4-2.fc16.x86_64.rpm
Retrieving http://fabbione.fedorapeople.org/watchdog_fencing/clusterlib-3.1.99-1.fc16.x86_64.rpm
warning: package corosync-1.4.4-2.fc16.x86_64 was already added, skipping corosync-1.4.4-2.fc16.x86_64
Preparing...                ########################################### [100%]
  1:corosynclib            ########################################### [ 14%]
  2:corosync              ########################################### [ 29%]
  3:sanlock-lib            ########################################### [ 43%]
  4:sanlock                ########################################### [ 57%]
  5:clusterlib            ########################################### [ 71%]
  6:cman                  ########################################### [ 86%]
  7:fence-sanlock          ########################################### [100%]
</source>
</source>


== Configuring fence_sanlock ==
Note the path, <span class="code">/dev/sdb</span> in this case. You will use it in the next step.


We need to disable <span class="code">wdmd</span> and <span class="code">sanlock</span> and then enable <span class="code">fence_sanlockd</span> daemons.
== Configuration cman ==
 
Below is an example <span class="code">[[cman]]</span> configuration file. The main sections to note are the <span class="code"><fence></span>, <span class="code"><unfence></span> and <span class="code"><fencedevice></span> elements.


<source lang="bash">
<source lang="bash">
systemctl disable wdmd.service
vim /etc/cluster/cluster.conf
systemctl status wdmd.service
</source>
</source>
<source lang="text">
<source lang="xml">
wdmd.service
<?xml version="1.0"?>
  Loaded: loaded (/lib/systemd/system/wdmd.service; disabled)
<cluster config_version="2" name="an-cluster-01">
  Active: inactive (dead)
<cman expected_votes="1" two_node="1"/>
  CGroup: name=systemd:/system/wdmd.service
<clusternodes>
<clusternode name="an-c01n01.alteeve.ca" nodeid="1">
<fence>
<method name="wd">
<device name="watchdog" host_id="1" />
</method>
</fence>
<unfence>
<device name="watchdog" host_id="1" action="on" />
</unfence>
</clusternode>
<clusternode name="an-c01n02.alteeve.ca" nodeid="2">
<fence>
<method name="wd">
<device name="watchdog" host_id="2" />
</method>
</fence>
<unfence>
<device name="watchdog" host_id="2" action="on" />
</unfence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="watchdog" agent="fence_sanlock" path="/dev/sdb"/>
</fencedevices>
<fence_daemon post_join_delay="30"/>
<totem rrp_mode="none" secauth="off"/>
</cluster>
</source>
</source>
The key attributes are:
* <span class="code">host_id="x"</span>; This tells <span class="code">fence_sanlock</span> which node ID to act on. Generally it matches the corresponding <span class="code"><clusternode ... nodeid="x"></span> value. It doesn't have to, though, but it must be unique.
* <span class="code">path="/dev/x"</span>; This tells the cluster where each node's lock can be found. Here we used <span class="code">/dev/sdb/</span>, which is the device we initialized in the previous step.
One it passes <span class="code">ccs_config_validate</span>, push it out to your other node(s).
== Enabling fence_sanlock ==
{{note|1=<span class="code">fence_sanlock</span> fencing and unfencing operations can take up to several minutes to complete. This is normal and expected behaviour. Other than this, the fencing operation will work as any other fence device implementation. From a user perspective there are no operational differences.}}
We need to disable <span class="code">wdmd</span> and <span class="code">sanlock</span> from starting at boot. Then we need enable <span class="code">fence_sanlockd</span> daemons to start on boot. The <span class="code">fence_sanlockd</span> daemon will start the <span class="code">wdmd</span> and <span class="code">sanlock</span> daemons itself.
{{note|1=If you are using an pre-systemd OS, use <span class="code">chkconfig</span> and <span class="code">init.d</span> scripts instead of <span class="code">systemctl</span>.}}
<source lang="bash">
<source lang="bash">
systemctl disable wdmd.service
systemctl disable sanlock.service
systemctl disable sanlock.service
systemctl status sanlock.service
</source>
<source lang="text">
sanlock.service
  Loaded: loaded (/lib/systemd/system/sanlock.service; disabled)
  Active: inactive (dead)
  CGroup: name=systemd:/system/sanlock.service
</source>
<source lang="bash">
systemctl enable fence_sanlockd.service
systemctl enable fence_sanlockd.service
</source>
</source>
<source lang="text">
 
ln -s '/lib/systemd/system/fence_sanlockd.service' '/etc/systemd/system/multi-user.target.wants/fence_sanlockd.service'
If you started the daemons, stop them now.
</source>
 
<source lang="bash">
<source lang="bash">
systemctl status fence_sanlockd.service
systemctl stop wdmd.service
</source>
systemctl stop sanlock.service
<source lang="text">
systemctl stop fence_sanlockd.service
fence_sanlockd.service
  Loaded: loaded (/lib/systemd/system/fence_sanlockd.service; enabled)
  Active: inactive (dead)
  CGroup: name=systemd:/system/fence_sanlockd.service
</source>
</source>
{{note|1=If you are using an pre-systemd OS, use <span class="code">chkconfig</span> instead of <span class="code">systemctl</span>.}}


Now stop <span class="code">cman</span> and then start <span class="code">fence_sanlockd</span>. If you're not running the cluster yet, then stopping <span class="code">cman</span> is not needed.
Now stop <span class="code">cman</span> and then start <span class="code">fence_sanlockd</span>. If you're not running the cluster yet, then stopping <span class="code">cman</span> is not needed.
Line 236: Line 202:
<source lang="bash">
<source lang="bash">
systemctl stop cman.service
systemctl stop cman.service
systemctl status cman.service
</source>
<source lang="text">
cman.service - LSB: Starts and stops cman
  Loaded: loaded (/etc/rc.d/init.d/cman)
  Active: inactive (dead)
  CGroup: name=systemd:/system/cman.service
</source>
<source lang="bash">
enable fence_sanlockd.service
</source>
<source lang="text">
ln -s '/lib/systemd/system/fence_sanlockd.service' '/etc/systemd/system/multi-user.target.wants/fence_sanlockd.service'
</source>
<source lang="bash">
systemctl start fence_sanlockd.service
systemctl start fence_sanlockd.service
systemctl status fence_sanlockd.service
</source>
<source lang="text">
fence_sanlockd.service
  Loaded: loaded (/lib/systemd/system/fence_sanlockd.service; enabled)
  Active: active (running) since Thu, 11 Oct 2012 16:53:12 -0400; 3s ago
Process: 2000 ExecStart=/lib/systemd/systemd-fence_sanlockd start (code=exited, status=0/SUCCESS)
Main PID: 2067 (fence_sanlockd)
  CGroup: name=systemd:/system/fence_sanlockd.service
  └ 2067 fence_sanlockd -w
</source>
</source>


Now we can re-enable the <span class="code">cman</span> service.
Now we can re-enable the <span class="code">cman</span> service.
{{note|1=As stated earlier, the start-up time and time-to-fence of the cluster will be much longer with <span class="code">fence_sanlock</span> configured. It could take up to five minutes, or more depending on configuration, for the <span class="code">unfence</span> step to complete. Please  be patient.}}


<source lang="bash">
<source lang="bash">
systemctl start cman.service
systemctl start cman.service
systemctl status cman.service
</source>
</source>
== Testing fence_sanlock ==
If everything is working, you should now be able to recover from a hard-crash on a node in the cluster. To test this, you can <span class="code">echo c</span> to <span class="code">/proc/sysrq-trigger</span> and the other node should, eventually, fence the lost node and restore cluster services.
<source lang="bash">
echo c > /proc/sysrq-trigger
</source>
In the system log of the other node, you will see messages similar to those in the example below.
<source lang="text">
<source lang="text">
cman.service - LSB: Starts and stops cman
Oct 21 23:25:23 an-c01n02 corosync[1652]:  [TOTEM ] A processor failed, forming new configuration.
  Loaded: loaded (/etc/rc.d/init.d/cman)
Oct 21 23:25:25 an-c01n02 corosync[1652]:  [CLM  ] CLM CONFIGURATION CHANGE
  Active: active (running) since Thu, 11 Oct 2012 16:58:10 -0400; 7s ago
Oct 21 23:25:25 an-c01n02 corosync[1652]:  [CLM  ] New Configuration:
Process: 2072 ExecStart=/etc/rc.d/init.d/cman start (code=exited, status=0/SUCCESS)
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM  ] #011r(0) ip(10.20.10.2)  
  CGroup: name=systemd:/system/cman.service
Oct 21 23:25:25 an-c01n02 corosync[1652]:  [CLM  ] Members Left:
  ├ 2133 corosync -f
Oct 21 23:25:25 an-c01n02 corosync[1652]:  [CLM  ] #011r(0) ip(10.20.10.1)  
  ├ 2191 fenced
Oct 21 23:25:25 an-c01n02 corosync[1652]:  [CLM  ] Members Joined:
  └ 2203 dlm_controld
Oct 21 23:25:25 an-c01n02 corosync[1652]:  [QUORUM] Members[1]: 2
Oct 21 23:25:25 an-c01n02 corosync[1652]:  [CLM  ] CLM CONFIGURATION CHANGE
Oct 21 23:25:25 an-c01n02 corosync[1652]:  [CLM  ] New Configuration:
Oct 21 23:25:25 an-c01n02 corosync[1652]:  [CLM  ] #011r(0) ip(10.20.10.2)
Oct 21 23:25:25 an-c01n02 corosync[1652]:  [CLM  ] Members Left:
Oct 21 23:25:25 an-c01n02 corosync[1652]:  [CLM  ] Members Joined:
Oct 21 23:25:25 an-c01n02 corosync[1652]:  [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CPG  ] chosen downlist: sender r(0) ip(10.20.10.2) ; members(old:2 left:1)
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 21 23:25:25 an-c01n02 kernel: [ 1075.915453] dlm: closing connection to node 1
Oct 21 23:25:25 an-c01n02 fenced[1708]: fencing node an-c01n01.alteeve.ca
Oct 21 23:25:26 an-c01n02 fence_sanlock: 1997 host_id 1 gen 1 ver 1 timestamp 391
Oct 21 23:27:20 an-c01n02 fenced[1708]: fence an-c01n01.alteeve.ca success
</source>
</source>


== Example Configuration ==
This shows that <span class="code">fence_sanlock</span> is working properly.
 
= checkquorum.wdmd =
 
The <span class="code">checkquorum.wdmd</span> is not really a fence device in the traditional sense. The only way for a <span class="code">fence</span> action to be considered a success is to have the failed node rejoin the cluster in a clean (freshly started) state. If a node fails in such a way that it can node start back up, the cluster will remain blocked until cleared by an administrator issuing a <span class="code">fence_ack_manual</span> on the survivor node.


ToDo.
The <span class="code">checkquorum.wdmd</span> works by having <span class="code">[[wdmd]]</span> stop resetting the system's watchdog timer if the node loses [[quorum]]. For this reason, <span class="code">checkquorum.wdmd</span> will not work if the 2-node (<span class="code"><cman expected_votes="1" two_node="1"/></span>) option is set in <span class="code">/etc/cluster/cluster.conf</span>. This is because quorum is effectively disabled. When configured for 2-node clusters, a node will never lose quorum, thus, <span class="code">corosync</span> will never stop resetting the watchdog timer.  


{{note|1=<span class="code">fence_sanlock</span> fencing and unfencing operations can take up to several minutes to complete. This is normal and expected behaviour. Other than this, the fencing operation will work as any other fence device implementation. From a user perspective there are no operational differences.}}
To get around this 2-node limitation, we can use <span class="code">[[qdisk]]</span> on shared storage. This provides a third vote and allows quorum to be used properly.
 
{{note|1=The <span class="code">qdisk</span> device '''must''' be in <span class="code">master_wins</span> mode. Please see <span class="code">man 5 qdisk</span> for more information.}}


= How checkquorum.wdmd Works =
== Requirements ==


ToDo.
You will need;


== Configuring checkquorum.wdmd ==
* A ''hardware'' watchdog timer.
* ''Two-Node clusters only''; External shared storage, 10 [[MiB]] or larger in size, for a <span class="code">[[qdisk]]</span> partition.


Setting up <span class="code">checkquorum.wdwm</span>:
From the hardware end of things, you have to have a hardware watchdog timer. Many workstation and server mainboards offer this as a built-in feature which can be enabled in the [[BIOS]]. If your system does not have this, add-in and external watchdog timers can be used and are relatively inexpensive. To further save costs, some open-hardware watchdog timer designs are available for those handy with a soldering iron.


Copy <span class="code">checkquorum.wdmd</span> and setup it's permissions and ownership.
{{note|1=Software watchdog timers exist but they '''''are not''''' supported in production. They rely on the host functioning to at least some degree which is a fatal design flaw. A simple test of issuing <span class="code">echo c > /proc/sysrq-trigger</span> will demonstrate the flaw in using software watchdog timers.}}
 
You need to install;
 
* <span class="code">cman</span> ver. 3.1.99 +
* <span class="code">wdmd</span> ver. 2.6 + (available from the <span class="code">sanlock</span> ver. 2.6 + package)
* <span class="code">sanlock</span> ver. 2.6 +
 
== Installation ==
 
To use <span class="code">checkquorum.wdmd</span>, install;


<source lang="bash">
<source lang="bash">
cp /usr/share/cluster/checkquorum.wdmd /etc/wdmd.d/
yum install cman sanlock sanlock-lib
chown root:root /etc/wdmd.d/checkquorum.wdmd
chmod u+x /etc/wdmd.d/checkquorum.wdmd
</source>
</source>


Add <span class="code">-S1</span> to <span class="code">/etc/sysconfig/wdmd</span> to the startup.
== Configuring qdisk (2-Node Only) ==
 
{{note|1=If you have a cluster with three or more nodes, you can skip this step.}}
 
{{warning|1=If you are already using <span class="code">qdisk</span>, do not create a new <span class="code">qdisk</span> device while the cluster is running! It will cause the cluster to malfunction.}}
 
In this document, the [[SAN]] device is mounted on each node as <span class="code">/dev/sdb</span>. If you do not have an existing SAN, below is a brief tutorial on setting up a <span class="code">[[tgtd]]</span> based SAN:
 
* [[Setting Up tgtd As A SAN]]
 
Create the quorum disk using the following command;
 
{{note|1=In the example below, the label <span class="code">an-cluster-01.wdmd</span> is used. This is a free-form label between 1 and 128 characters in length.}}


<source lang="bash">
<source lang="bash">
cp /etc/sysconfig/wdmd /etc/sysconfig/wdmd.orig
mkqdisk -c /dev/sdb -l an-cluster-01.wdmd
sed -i 's/#WDMDOPTS/WDMDOPTS/' /etc/sysconfig/wdmd
diff -u /etc/sysconfig/wdmd.orig /etc/sysconfig/wdmd
</source>
</source>
<source lang="diff">
<source lang="text">
--- /etc/sysconfig/wdmd.orig 2012-10-11 18:01:50.935041431 -0400
mkqdisk v3.1.99
+++ /etc/sysconfig/wdmd 2012-10-11 18:05:53.101049589 -0400
 
@@ -4,5 +4,5 @@
Writing new quorum disk label 'an-cluster-01.wdmd' to /dev/sdb.
# Include "-G sanlock" in the option string.
</source>
#
<source lang="text">
# To enable use of test scripts
WARNING: About to destroy all data on /dev/sdb; proceed [N/y] ? y
-#WDMDOPTS="-G sanlock -S 1"
</source>
+WDMDOPTS="-G sanlock -S 1"
<source lang="text">
Initializing status block for node 1...
Initializing status block for node 2...
Initializing status block for node 3...
Initializing status block for node 4...
Initializing status block for node 5...
Initializing status block for node 6...
Initializing status block for node 7...
Initializing status block for node 8...
Initializing status block for node 9...
Initializing status block for node 10...
Initializing status block for node 11...
Initializing status block for node 12...
Initializing status block for node 13...
Initializing status block for node 14...
Initializing status block for node 15...
Initializing status block for node 16...
</source>
</source>


Stop <span class="code">cman</span> if it is running, the restart the <span class="code">wdmd</span> daemon and then start <span class="code">cman</span> again.
With this now created, edit your <span class="code">cluster.conf</span> file to use the quorum disk. This involves removing the <span class="code"><cman expected_votes="1" two_node="1"/></span> entry and replacing it with the quorum configuration. The important value we will need to set is <span class="code">label="an-cluster-01.wdmd"</span> which tells <span class="code">cman</span> which is it's <span class="code">qdisk</span> device. For more information of the various available <span class="code">qdisk</span> attributes and values by reading <span class="code">man 5 qdisk</span>.


<source lang="bash">
<source lang="bash">
systemctl stop cman.service
vim /etc/cluster/cluster.conf
systemctl stop wdmd.service
</source>
systemctl start wdmd.service
<source lang="xml">
systemctl start cman.service
<?xml version="1.0"?>
<cluster config_version="3" name="an-cluster-01">
<quorumd label="an-cluster-01.wdmd"/>
<clusternodes>
<clusternode name="an-c01n01.alteeve.ca" nodeid="1"/>
<clusternode name="an-c01n02.alteeve.ca" nodeid="2"/>
</clusternodes>
<fencedevices/>
<fence_daemon post_join_delay="30"/>
<totem rrp_mode="none" secauth="off"/>
</cluster>
</source>
</source>


There is no need to define any fence device in <span class="code">/etc/cluster/cluster.conf</span> for <span class="code">wdmd</span> fencing to work. When <span class="code">cman</span> start after <span class="code">wdmd</span> has been configured, you can see that it is detected in the system logs.
{{note|1=Fence methods and fence devices are '''not''' defined under each node when using <span class="code">checkquorum.wdmd</span>.}}


<source lang="text">
If you want to configure <span class="code">checkquorum.wdmd</span>, you can do so by creating or editing <span class="code">/etc/sysconfig/checkquorum</span>.
Oct 11 18:10:10 an-c01n01 cman[6404]: Starting cman... [  OK  ]
Oct 11 18:10:10 an-c01n01 cman[6404]: Waiting for quorum... [  OK  ]
Oct 11 18:10:10 an-c01n01 cman[6404]: Starting fenced... [  OK  ]
Oct 11 18:10:10 an-c01n01 fenced[6528]: fenced 3.1.99 started
Oct 11 18:10:10 an-c01n01 dlm_controld[6541]: dlm_controld 3.1.99 started
Oct 11 18:10:10 an-c01n01 dlm_controld[6541]: could not set SCHED_RR priority 99 err 1
Oct 11 18:10:11 an-c01n01 cman[6404]: Starting dlm_controld... [  OK  ]
Oct 11 18:10:11 an-c01n01 cman[6404]: Tuning DLM kernel config... [  OK  ]
Oct 11 18:10:11 an-c01n01 cman[6404]: fence_sanlockd detected. Unfencing might take several minutes!
Oct 11 18:10:11 an-c01n01 cman[6404]: Unfencing self... [  OK  ]
</source>
 
If you want to configure <span class="code">checkquorum.wdmd</span>, you can do so by editing <span class="code">/etc/sysconfig/checkquorum</span>.


Configuration options:
Configuration options:
Line 366: Line 369:


If <span class="code">kdump</span> is running, it is advised that you also use <span class="code">fence_kdump</span>. This will allow the failed node to inform the other nodes that it has rebooted, speeding up recovery time. This is optional and should be considered an optimization.
If <span class="code">kdump</span> is running, it is advised that you also use <span class="code">fence_kdump</span>. This will allow the failed node to inform the other nodes that it has rebooted, speeding up recovery time. This is optional and should be considered an optimization.
{{warning|1=The <span class="code">fence_kdump</span> agent is '''''totally untested'''''. Please use it cautiously.}}


There are limitation to using <span class="code">fence_kdump</span>;
There are limitation to using <span class="code">fence_kdump</span>;
Line 375: Line 380:
* If all communication is lost between the nodes, such as a failure in the core switch(es), all nodes will lose quorum and reboot.
* If all communication is lost between the nodes, such as a failure in the core switch(es), all nodes will lose quorum and reboot.


= Using fence_sanlock And checkquorum.wdmd =
== Using checkquorum.wdmd ==
 
Before <span class="code">checkquorum.wdwm</span> can be used, we must first copy the <span class="code">checkquorum.wdmd</span> into the watchdog script directory and make it executable. Then we will restart the <span class="code">wdmd</span> to enable the new script.


At this time, using both <span class="code">fence_sanlock</span> and <span class="code">checkquorum.wdmd</span> is not supported.
Copy <span class="code">checkquorum.wdmd</span> and setup it's permissions and ownership.


= Setting Up IPMI-Based Watchdog Timer =
<source lang="bash">
cp /usr/share/cluster/checkquorum.wdmd /etc/wdmd.d/
chown root:root /etc/wdmd.d/checkquorum.wdmd
chmod u+x /etc/wdmd.d/checkquorum.wdmd
</source>


Different devices will have different methods for installing and setting up.
Stop <span class="code">cman</span> if it is running, the restart the <span class="code">wdmd</span> daemon and then start <span class="code">cman</span> again.


<source lang="bash">
<source lang="bash">
yum install freeipmi-bmc-watchdog watchdog OpenIPMI
systemctl stop cman.service
service ipmi start  
systemctl stop wdmd.service
ipmitool mc watchdog off
systemctl start wdmd.service
ipmitool mc watchdog get
systemctl start cman.service
ipmitool mc watchdog reset
</source>
</source>


== Testing checkquorum.wdmd ==
If everything is working, you should now be able to recover from a hard-crash on a node in the cluster. To test this, you can <span class="code">echo c</span> to <span class="code">/proc/sysrq-trigger</span> and the other node should, eventually, fence the lost node. Remember; The cluster will remain blocked until ''either'';
* The lost node reboots and rejoins the cluster in a clean state.
''or'';
* You manually clear the fence by issuing <span class="code">fence_ack_manual</span>.
{{warning|1=You '''must''' be certain that the failed node has been manually powered off ''before'' manually clearing the fence. Failing to do so could cause a serious failure in your cluster!}}


<span class="code"></span>
<source lang="bash">
<source lang="bash">
echo c > /proc/sysrq-trigger
</source>
</source>
In the system log of the other node, you will see messages similar to those in the example below.
<source lang="text">
<source lang="text">
Oct 22 22:01:25 an-c01n01 qdiskd[2138]: Writing eviction notice for node 2
Oct 22 22:01:26 an-c01n01 qdiskd[2138]: Node 2 evicted
Oct 22 22:01:28 an-c01n01 corosync[2084]:  [TOTEM ] A processor failed, forming new configuration.
Oct 22 22:01:30 an-c01n01 corosync[2084]:  [CLM  ] CLM CONFIGURATION CHANGE
Oct 22 22:01:30 an-c01n01 corosync[2084]:  [CLM  ] New Configuration:
Oct 22 22:01:30 an-c01n01 corosync[2084]:  [CLM  ] #011r(0) ip(10.20.10.1)
Oct 22 22:01:30 an-c01n01 corosync[2084]:  [CLM  ] Members Left:
Oct 22 22:01:30 an-c01n01 corosync[2084]:  [CLM  ] #011r(0) ip(10.20.10.2)
Oct 22 22:01:30 an-c01n01 corosync[2084]:  [CLM  ] Members Joined:
Oct 22 22:01:30 an-c01n01 corosync[2084]:  [QUORUM] Members[1]: 1
Oct 22 22:01:30 an-c01n01 corosync[2084]:  [CLM  ] CLM CONFIGURATION CHANGE
Oct 22 22:01:30 an-c01n01 corosync[2084]:  [CLM  ] New Configuration:
Oct 22 22:01:30 an-c01n01 corosync[2084]:  [CLM  ] #011r(0) ip(10.20.10.1)
Oct 22 22:01:30 an-c01n01 corosync[2084]:  [CLM  ] Members Left:
Oct 22 22:01:30 an-c01n01 corosync[2084]:  [CLM  ] Members Joined:
Oct 22 22:01:30 an-c01n01 corosync[2084]:  [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 22 22:01:30 an-c01n01 corosync[2084]:  [CPG  ] chosen downlist: sender r(0) ip(10.20.10.1) ; members(old:2 left:1)
Oct 22 22:01:30 an-c01n01 kernel: [13038.313753] dlm: closing connection to node 2
Oct 22 22:01:30 an-c01n01 corosync[2084]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 22:01:30 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed
Oct 22 22:01:33 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:33 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:33 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed
Oct 22 22:01:36 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:37 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:37 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed
</source>
</source>


Note the errors;
<source lang="text">
Oct 22 22:01:30 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed
</source>
This is expected and can be safely ignored. After a minute or so, the hung node should reboot. Once it's back online and rejoins the cluster, the cluster should go back to normal operation.
= Using fence_sanlock And checkquorum.wdmd =
At this time, using both <span class="code">fence_sanlock</span> and <span class="code">checkquorum.wdmd</span> is not supported.
= Permissions =
[[Digimer|I]] give unrestricted permission to [https://redhat.com Red Hat, Inc.] to copy this document in whole or in part.


{{footer}}
{{footer}}

Latest revision as of 21:01, 11 December 2012

 AN!Wiki :: How To :: Watchdog Recovery

Note: This tutorial is written using Fedora 16.

The new fence_sanlock and checkquorum.wdmd fence agents provide new fencing options to users who may not have full out of band management, switched PDUs or similar tradition fence device. They aim to provide a critical function in clusters to users who otherwise would have no (affordable) options.

Warning: This technology is Tech Preview! There is no support for this fence method yet. Feedback and bug reports are much appreciated.

About Fencing

Traditionally in clustering, all nodes must be in a known state. In practice, this meant that when a node stopped responding, the rest of the cluster could not safely proceed until the silent node was put into a known state.

The action of putting a node into a known state is called "fencing". Typically, this is done by one of the other nodes in the cluster either isolating or forcibly powering off the lost node.

  • With isolation, the lost node would not itself be touched, but it's network link(s) would be disabled. This would ensure that even if the node recovered, it would no longer have access to the cluster or it's shared storage. This form of clustering is called "fabric fencing".
  • The far more common form of fencing is to forcibly power off the lost node. This is done by using an external device, like a server's out-of-band management card (IPMI, iLO, etc) or by using a network-connected power bar, called a PDU.

In either case, the purpose of fencing is to ensure that the lost node will be able to access clustered resources, like shared storage, or provide clustered services in an asynchronous manner. Skipping this crucial step could cause data loss, so it is critical to always use fencing in clusters.

Watchdog Timers

Many mother boards have "watchdog" timers built in. These timers will cause the host machine to reboot if the system appears to freeze for a period of time. The new fence_sanlock agent combines these with SAN storage to provide an alternative fence method.

Where "fabric fencing" can be thought of as a form of ostracism and "power fencing" can be thought of as a form of murder, watchdog fencing can be thought of as a form of suicide. Morbid, but accurate.

Options

There are currently 2 mechanisms to trigger a node recover via watchdog device:

  1. fence_sanlock: Preferred method, but always requires shared storage.
  2. checkquorum.wdmd: Only requires shared storage for 2-node clusters.

Differences Between fence_sanlock And checkquorum.wdmd

When choosing which type of watchdog-based fencing, consider the following;

Method Advantages Disadvantages
fence_sanlock
  • Preferred method, it is a real fence method.
  • Can recover the cluster if a node fails completely.
  • Requires shared storage in all cases.
checkquorum.wdmd
  • Shared storage not needed for 3+ nodes.
  • Not a real fence method.
  • Fence action only considered complete when the lost node rejoins.
  • Node failures that prevent reboot causes the cluster to remain blocked.
  • Network failure causing all nodes to lose quorum will result in a complete cluster restart.

Important Note On Timing

Watchdog timers work by having a constant count down running. The host has to periodically and reliably reset this timer. If the watchdog timer is allowed to expire, the host machine will be reset. This timeout is often measured in minutes.

Traditional fencing methods which communicate with external devices that can report success as soon as the target node has been fenced. This process is usually measured in a small number of seconds.

When a cluster loses contact with a node, it blocks by design. It is not safe to proceed until all nodes are in a known state, so the users of the cluster services will notice a period of interruption until the lost node recovers or is fenced.

Putting this together, this timing difference means that any watchdog-based fencing will be much slower than traditional fencing. Your users will most likely experience an outage of several minutes while the fence_sanlock works. For this reason, fence_sanlock should be used only when traditional fence methods are unavailable.

In short; watchdog fencing is not a replacement for traditional fencing. It is only a replacement for no fencing at all.

fence_sanlock

The fence_sanlock deamon works by combining locks on shared storage with watchdog devices under each node.

On start-up, each node "unfences", during which time the node takes a lock from the shared storage and begins resetting the watchdog timer at set intervals. So long as the node is healthy, this reset of the watchdog timer repeats indefinitely.

If there is a failure on the node itself, the node will self-fence by allowing it's watchdog timer to expire. This will happen either a period of time after losing quorum or immediately if corosync fails.

When another node in the cluster wants to call a fence against a victim, it will try to take it's lock on the shared storage. If the victim is not already in the process of fencing itself, it will detect the attempt to take it's lock and allow it's watchdog timer to expire. Of course, if the victim has crashed entirely, it's watchdog device will already be counting down. In any case, the victim will eventually reboot.

The cluster will know when the victim is gone when it is finally able to take the victim's lock. This is safe because, so long as the victim is alive, it will maintain it's lock.

Requirements

You will need;

  • A hardware watchdog timer.
  • External shared storage, 1 GiB or larger in size.
    • Any shared storage will do. Here is a short tgtd tutorial for creating a SAN on a machine outside of the cluster if you don't have existing shared storage.

From the hardware end of things, you have to have a hardware watchdog timer. Many workstation and server mainboards offer this as a built-in feature which can be enabled in the BIOS. If your system does not have this, add-in and external watchdog timers can be used and are relatively inexpensive. To further save costs, some open-hardware watchdog timer designs are available for those handy with a soldering iron.

Note: Software watchdog timers exist but they are not supported in production. They rely on the host functioning to at least some degree which is a fatal design flaw. A simple test of issuing echo c > /proc/sysrq-trigger will demonstrate the flaw in using software watchdog timers.

You need to install;

  • cman ver. 3.1.99 +
  • wdmd ver. 2.6 + (available from the sanlock ver. 2.6 + package)
  • fence_sanlock ver. 2.6 +

Installation

To install fence_sanlock, run;

yum install cman fence-sanlock sanlock sanlock-lib

Configuring Shared Storage

Any shared storage device can be used. For the purpose if this document, a SAN LUN, exported by a machine outside of the cluster, and made available as /dev/sdb will be used. If you don't currently have a shared storage device, below is a brief tutorial on setting up a tgtd based SAN:

You can use a [non] clustered LVM, an NFS export or most any type of shared storage. The only requirement being that it is at least 1 GiB in size or larger.

Once you have your shared storage, initialize it;

Note: The Initializing 128 sanlock host leases may take a while to complete, please be patient.
fence_sanlock -o sanlock_init -p /dev/sdb
Initializing fence sanlock lockspace on /dev/sdb: ok
Initializing 128 sanlock host leases on /dev/sdb: ok

Note the path, /dev/sdb in this case. You will use it in the next step.

Configuration cman

Below is an example cman configuration file. The main sections to note are the <fence>, <unfence> and <fencedevice> elements.

vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="2" name="an-cluster-01">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-c01n01.alteeve.ca" nodeid="1">
			<fence>
				<method name="wd">
					<device name="watchdog" host_id="1" />
				</method>
			</fence>
			<unfence>
				<device name="watchdog" host_id="1" action="on" />
			</unfence>
		</clusternode>
		<clusternode name="an-c01n02.alteeve.ca" nodeid="2">
			<fence>
				<method name="wd">
					<device name="watchdog" host_id="2" />
				</method>
			</fence>
			<unfence>
				<device name="watchdog" host_id="2" action="on" />
			</unfence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="watchdog" agent="fence_sanlock" path="/dev/sdb"/>
	</fencedevices>
	<fence_daemon post_join_delay="30"/>
	<totem rrp_mode="none" secauth="off"/>
</cluster>

The key attributes are:

  • host_id="x"; This tells fence_sanlock which node ID to act on. Generally it matches the corresponding <clusternode ... nodeid="x"> value. It doesn't have to, though, but it must be unique.
  • path="/dev/x"; This tells the cluster where each node's lock can be found. Here we used /dev/sdb/, which is the device we initialized in the previous step.

One it passes ccs_config_validate, push it out to your other node(s).

Enabling fence_sanlock

Note: fence_sanlock fencing and unfencing operations can take up to several minutes to complete. This is normal and expected behaviour. Other than this, the fencing operation will work as any other fence device implementation. From a user perspective there are no operational differences.

We need to disable wdmd and sanlock from starting at boot. Then we need enable fence_sanlockd daemons to start on boot. The fence_sanlockd daemon will start the wdmd and sanlock daemons itself.

Note: If you are using an pre-systemd OS, use chkconfig and init.d scripts instead of systemctl.
systemctl disable wdmd.service
systemctl disable sanlock.service
systemctl enable fence_sanlockd.service

If you started the daemons, stop them now.

systemctl stop wdmd.service
systemctl stop sanlock.service
systemctl stop fence_sanlockd.service

Now stop cman and then start fence_sanlockd. If you're not running the cluster yet, then stopping cman is not needed.

systemctl stop cman.service
systemctl start fence_sanlockd.service

Now we can re-enable the cman service.

Note: As stated earlier, the start-up time and time-to-fence of the cluster will be much longer with fence_sanlock configured. It could take up to five minutes, or more depending on configuration, for the unfence step to complete. Please be patient.
systemctl start cman.service

Testing fence_sanlock

If everything is working, you should now be able to recover from a hard-crash on a node in the cluster. To test this, you can echo c to /proc/sysrq-trigger and the other node should, eventually, fence the lost node and restore cluster services.

echo c > /proc/sysrq-trigger

In the system log of the other node, you will see messages similar to those in the example below.

Oct 21 23:25:23 an-c01n02 corosync[1652]:   [TOTEM ] A processor failed, forming new configuration.
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] New Configuration:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] #011r(0) ip(10.20.10.2) 
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] Members Left:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] #011r(0) ip(10.20.10.1) 
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] Members Joined:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [QUORUM] Members[1]: 2
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] New Configuration:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] #011r(0) ip(10.20.10.2) 
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] Members Left:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] Members Joined:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.10.2) ; members(old:2 left:1)
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 21 23:25:25 an-c01n02 kernel: [ 1075.915453] dlm: closing connection to node 1
Oct 21 23:25:25 an-c01n02 fenced[1708]: fencing node an-c01n01.alteeve.ca
Oct 21 23:25:26 an-c01n02 fence_sanlock: 1997 host_id 1 gen 1 ver 1 timestamp 391
Oct 21 23:27:20 an-c01n02 fenced[1708]: fence an-c01n01.alteeve.ca success

This shows that fence_sanlock is working properly.

checkquorum.wdmd

The checkquorum.wdmd is not really a fence device in the traditional sense. The only way for a fence action to be considered a success is to have the failed node rejoin the cluster in a clean (freshly started) state. If a node fails in such a way that it can node start back up, the cluster will remain blocked until cleared by an administrator issuing a fence_ack_manual on the survivor node.

The checkquorum.wdmd works by having wdmd stop resetting the system's watchdog timer if the node loses quorum. For this reason, checkquorum.wdmd will not work if the 2-node (<cman expected_votes="1" two_node="1"/>) option is set in /etc/cluster/cluster.conf. This is because quorum is effectively disabled. When configured for 2-node clusters, a node will never lose quorum, thus, corosync will never stop resetting the watchdog timer.

To get around this 2-node limitation, we can use qdisk on shared storage. This provides a third vote and allows quorum to be used properly.

Note: The qdisk device must be in master_wins mode. Please see man 5 qdisk for more information.

Requirements

You will need;

  • A hardware watchdog timer.
  • Two-Node clusters only; External shared storage, 10 MiB or larger in size, for a qdisk partition.

From the hardware end of things, you have to have a hardware watchdog timer. Many workstation and server mainboards offer this as a built-in feature which can be enabled in the BIOS. If your system does not have this, add-in and external watchdog timers can be used and are relatively inexpensive. To further save costs, some open-hardware watchdog timer designs are available for those handy with a soldering iron.

Note: Software watchdog timers exist but they are not supported in production. They rely on the host functioning to at least some degree which is a fatal design flaw. A simple test of issuing echo c > /proc/sysrq-trigger will demonstrate the flaw in using software watchdog timers.

You need to install;

  • cman ver. 3.1.99 +
  • wdmd ver. 2.6 + (available from the sanlock ver. 2.6 + package)
  • sanlock ver. 2.6 +

Installation

To use checkquorum.wdmd, install;

yum install cman sanlock sanlock-lib

Configuring qdisk (2-Node Only)

Note: If you have a cluster with three or more nodes, you can skip this step.
Warning: If you are already using qdisk, do not create a new qdisk device while the cluster is running! It will cause the cluster to malfunction.

In this document, the SAN device is mounted on each node as /dev/sdb. If you do not have an existing SAN, below is a brief tutorial on setting up a tgtd based SAN:

Create the quorum disk using the following command;

Note: In the example below, the label an-cluster-01.wdmd is used. This is a free-form label between 1 and 128 characters in length.
mkqdisk -c /dev/sdb -l an-cluster-01.wdmd
mkqdisk v3.1.99

Writing new quorum disk label 'an-cluster-01.wdmd' to /dev/sdb.
WARNING: About to destroy all data on /dev/sdb; proceed [N/y] ? y
Initializing status block for node 1...
Initializing status block for node 2...
Initializing status block for node 3...
Initializing status block for node 4...
Initializing status block for node 5...
Initializing status block for node 6...
Initializing status block for node 7...
Initializing status block for node 8...
Initializing status block for node 9...
Initializing status block for node 10...
Initializing status block for node 11...
Initializing status block for node 12...
Initializing status block for node 13...
Initializing status block for node 14...
Initializing status block for node 15...
Initializing status block for node 16...

With this now created, edit your cluster.conf file to use the quorum disk. This involves removing the <cman expected_votes="1" two_node="1"/> entry and replacing it with the quorum configuration. The important value we will need to set is label="an-cluster-01.wdmd" which tells cman which is it's qdisk device. For more information of the various available qdisk attributes and values by reading man 5 qdisk.

vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="3" name="an-cluster-01">
	<quorumd label="an-cluster-01.wdmd"/>
	<clusternodes>
		<clusternode name="an-c01n01.alteeve.ca" nodeid="1"/>
		<clusternode name="an-c01n02.alteeve.ca" nodeid="2"/>
	</clusternodes>
	<fencedevices/>
	<fence_daemon post_join_delay="30"/>
	<totem rrp_mode="none" secauth="off"/>
</cluster>
Note: Fence methods and fence devices are not defined under each node when using checkquorum.wdmd.

If you want to configure checkquorum.wdmd, you can do so by creating or editing /etc/sysconfig/checkquorum.

Configuration options:

Option Value(s) Description
waittime natural number This is the number of seconds to wait after losing quorum before declaring a failure. What this means will depend on what action is set to.
action (See below) This is the action to taken either immediately is corosync crashes or after waittime seconds when quorum was lost. This delay is used in case the node is able to rejoin the cluster, thus regaining quorum.
autodetect (default) If kdump is running, an attempt to crash dump is made. If kdump is not running, an error will be returned to wdmd which in turn will allow watchdog to reboot the machine.
hardreboot This will trigger a kernel hard-reboot.
crashdump This will trigger a kdump action in the kernel.
watchdog This will return an error to wdmd which in turn will allow watchdog to reboot the machine.

If kdump is running, it is advised that you also use fence_kdump. This will allow the failed node to inform the other nodes that it has rebooted, speeding up recovery time. This is optional and should be considered an optimization.

Warning: The fence_kdump agent is totally untested. Please use it cautiously.

There are limitation to using fence_kdump;

  • Note that checkquorum.wdmd does not work when two_node=1 is set in /etc/cluster/cluster.conf unless it is used in combination with qdiskd "master win" mode. Given the need for shared storage however, it is better to just use fence_sanlock in this case.
  • When using checkquorum.wdmd, fencing is considered complete only after the failed node has rebooted and rejoined the cluster. If this fails, or if there is a failure in the cluster's network, the cluster will hang indefinitely. Likewise, if the node has failed completely and does not restart, the cluster will also hang.
  • If all communication is lost between the nodes, such as a failure in the core switch(es), all nodes will lose quorum and reboot.

Using checkquorum.wdmd

Before checkquorum.wdwm can be used, we must first copy the checkquorum.wdmd into the watchdog script directory and make it executable. Then we will restart the wdmd to enable the new script.

Copy checkquorum.wdmd and setup it's permissions and ownership.

cp /usr/share/cluster/checkquorum.wdmd /etc/wdmd.d/
chown root:root /etc/wdmd.d/checkquorum.wdmd
chmod u+x /etc/wdmd.d/checkquorum.wdmd

Stop cman if it is running, the restart the wdmd daemon and then start cman again.

systemctl stop cman.service
systemctl stop wdmd.service
systemctl start wdmd.service
systemctl start cman.service

Testing checkquorum.wdmd

If everything is working, you should now be able to recover from a hard-crash on a node in the cluster. To test this, you can echo c to /proc/sysrq-trigger and the other node should, eventually, fence the lost node. Remember; The cluster will remain blocked until either;

  • The lost node reboots and rejoins the cluster in a clean state.

or;

  • You manually clear the fence by issuing fence_ack_manual.
Warning: You must be certain that the failed node has been manually powered off before manually clearing the fence. Failing to do so could cause a serious failure in your cluster!
echo c > /proc/sysrq-trigger

In the system log of the other node, you will see messages similar to those in the example below.

Oct 22 22:01:25 an-c01n01 qdiskd[2138]: Writing eviction notice for node 2
Oct 22 22:01:26 an-c01n01 qdiskd[2138]: Node 2 evicted
Oct 22 22:01:28 an-c01n01 corosync[2084]:   [TOTEM ] A processor failed, forming new configuration.
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] New Configuration:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] #011r(0) ip(10.20.10.1) 
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] Members Left:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] #011r(0) ip(10.20.10.2) 
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] Members Joined:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [QUORUM] Members[1]: 1
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] New Configuration:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] #011r(0) ip(10.20.10.1) 
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] Members Left:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] Members Joined:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.10.1) ; members(old:2 left:1)
Oct 22 22:01:30 an-c01n01 kernel: [13038.313753] dlm: closing connection to node 2
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 22:01:30 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed
Oct 22 22:01:33 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:33 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:33 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed
Oct 22 22:01:36 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:37 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:37 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed

Note the errors;

Oct 22 22:01:30 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed

This is expected and can be safely ignored. After a minute or so, the hung node should reboot. Once it's back online and rejoins the cluster, the cluster should go back to normal operation.

Using fence_sanlock And checkquorum.wdmd

At this time, using both fence_sanlock and checkquorum.wdmd is not supported.

Permissions

I give unrestricted permission to Red Hat, Inc. to copy this document in whole or in part.

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.