Watchdog Recovery
| Alteeve Wiki :: How To :: Watchdog Recovery | 
|  | Note: This tutorial is written using Fedora 16. | 
The new fence_sanlock and checkquorum.wdmd fence agents provide new fencing options to users who may not have full out of band management, switched PDUs or similar tradition fence device. They aim to provide a critical function in clusters to users who otherwise would have no (affordable) options.
|  | Warning: This technology is Tech Preview! There is no support for this fence method yet. Feedback and bug reports are much appreciated. | 
About Fencing
Traditionally in clustering, all nodes must be in a known state. In practice, this meant that when a node stopped responding, the rest of the cluster could not safely proceed until the silent node was put into a known state.
The action of putting a node into a known state is called "fencing". Typically, this is done by one of the other nodes in the cluster either isolating or forcibly powering off the lost node.
- With isolation, the lost node would not itself be touched, but it's network link(s) would be disabled. This would ensure that even if the node recovered, it would no longer have access to the cluster or it's shared storage. This form of clustering is called "fabric fencing".
- The far more common form of fencing is to forcibly power off the lost node. This is done by using an external device, like a server's out-of-band management card (IPMI, iLO, etc) or by using a network-connected power bar, called a PDU.
In either case, the purpose of fencing is to ensure that the lost node will be able to access clustered resources, like shared storage, or provide clustered services in an asynchronous manner. Skipping this crucial step could cause data loss, so it is critical to always use fencing in clusters.
Watchdog Timers
Many mother boards have "watchdog" timers built in. These timers will cause the host machine to reboot if the system appears to freeze for a period of time. The new fence_sanlock agent combines these with SAN storage to provide an alternative fence method.
Where "fabric fencing" can be thought of as a form of ostracism and "power fencing" can be thought of as a form of murder, watchdog fencing can be thought of as a form of suicide. Morbid, but accurate.
Options
There are currently 2 mechanisms to trigger a node recover via watchdog device:
- fence_sanlock: Preferred method, but always requires shared storage.
- checkquorum.wdmd: Only requires shared storage for 2-node clusters.
Differences Between fence_sanlock And checkquorum.wdmd
When choosing which type of watchdog-based fencing, consider the following;
| Method | Advantages | Disadvantages | 
|---|---|---|
| fence_sanlock | 
 | 
 | 
| checkquorum.wdmd | 
 | 
 | 
Important Note On Timing
Watchdog timers work by having a constant count down running. The host has to periodically and reliably reset this timer. If the watchdog timer is allowed to expire, the host machine will be reset. This timeout is often measured in minutes.
Traditional fencing methods which communicate with external devices that can report success as soon as the target node has been fenced. This process is usually measured in a small number of seconds.
When a cluster loses contact with a node, it blocks by design. It is not safe to proceed until all nodes are in a known state, so the users of the cluster services will notice a period of interruption until the lost node recovers or is fenced.
Putting this together, this timing difference means that any watchdog-based fencing will be much slower than traditional fencing. Your users will most likely experience an outage of several minutes while the fence_sanlock works. For this reason, fence_sanlock should be used only when traditional fence methods are unavailable.
In short; watchdog fencing is not a replacement for traditional fencing. It is only a replacement for no fencing at all.
fence_sanlock
The fence_sanlock deamon works by combining locks on shared storage with watchdog devices under each node.
On start-up, each node "unfences", during which time the node takes a lock from the shared storage and begins resetting the watchdog timer at set intervals. So long as the node is healthy, this reset of the watchdog timer repeats indefinitely.
If there is a failure on the node itself, the node will self-fence by allowing it's watchdog timer to expire. This will happen either a period of time after losing quorum or immediately if corosync fails.
When another node in the cluster wants to call a fence against a victim, it will try to take it's lock on the shared storage. If the victim is not already in the process of fencing itself, it will detect the attempt to take it's lock and allow it's watchdog timer to expire. Of course, if the victim has crashed entirely, it's watchdog device will already be counting down. In any case, the victim will eventually reboot.
The cluster will know when the victim is gone when it is finally able to take the victim's lock. This is safe because, so long as the victim is alive, it will maintain it's lock.
Requirements
You will need;
- A hardware watchdog timer.
- External shared storage, 1 GiB or larger in size.
- Any shared storage will do. Here is a short tgtd tutorial for creating a SAN on a machine outside of the cluster if you don't have existing shared storage.
 
From the hardware end of things, you have to have a hardware watchdog timer. Many workstation and server mainboards offer this as a built-in feature which can be enabled in the BIOS. If your system does not have this, add-in and external watchdog timers can be used and are relatively inexpensive. To further save costs, some open-hardware watchdog timer designs are available for those handy with a soldering iron.
You need to install;
- cman ver. 3.1.99 +
- wdmd ver. 2.6 + (available from the sanlock ver. 2.6 + package)
- fence_sanlock ver. 2.6 +
Installation
To install fence_sanlock, run;
yum install cman fence-sanlock sanlock sanlock-lib
Any shared storage device can be used. For the purpose if this document, a SAN LUN, exported by a machine outside of the cluster, and made available as /dev/sdb will be used. If you don't currently have a shared storage device, below is a brief tutorial on setting up a tgtd based SAN:
You can use a [non] clustered LVM, an NFS export or most any type of shared storage. The only requirement being that it is at least 1 GiB in size or larger.
Once you have your shared storage, initialize it;
|  | Note: The Initializing 128 sanlock host leases may take a while to complete, please be patient. | 
fence_sanlock -o sanlock_init -p /dev/sdb
Initializing fence sanlock lockspace on /dev/sdb: ok
Initializing 128 sanlock host leases on /dev/sdb: ok
Note the path, /dev/sdb in this case. You will use it in the next step.
Configuration cman
Below is an example cman configuration file. The main sections to note are the <fence>, <unfence> and <fencedevice> elements.
vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="2" name="an-cluster-01">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-c01n01.alteeve.ca" nodeid="1">
			<fence>
				<method name="wd">
					<device name="watchdog" host_id="1" />
				</method>
			</fence>
			<unfence>
				<device name="watchdog" host_id="1" action="on" />
			</unfence>
		</clusternode>
		<clusternode name="an-c01n02.alteeve.ca" nodeid="2">
			<fence>
				<method name="wd">
					<device name="watchdog" host_id="2" />
				</method>
			</fence>
			<unfence>
				<device name="watchdog" host_id="2" action="on" />
			</unfence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="watchdog" agent="fence_sanlock" path="/dev/sdb"/>
	</fencedevices>
	<fence_daemon post_join_delay="30"/>
	<totem rrp_mode="none" secauth="off"/>
</cluster>
The key attributes are:
- host_id="x"; This tells fence_sanlock which node ID to act on. Generally it matches the corresponding <clusternode ... nodeid="x"> value. It doesn't have to, though, but it must be unique.
- path="/dev/x"; This tells the cluster where each node's lock can be found. Here we used /dev/sdb/, which is the device we initialized in the previous step.
One it passes ccs_config_validate, push it out to your other node(s).
Enabling fence_sanlock
We need to disable wdmd and sanlock from starting at boot. Then we need enable fence_sanlockd daemons to start on boot. The fence_sanlockd daemon will start the wdmd and sanlock daemons itself.
|  | Note: If you are using an pre-systemd OS, use chkconfig and init.d scripts instead of systemctl. | 
systemctl disable wdmd.service
systemctl disable sanlock.service
systemctl enable fence_sanlockd.service
If you started the daemons, stop them now.
systemctl stop wdmd.service
systemctl stop sanlock.service
systemctl stop fence_sanlockd.service
Now stop cman and then start fence_sanlockd. If you're not running the cluster yet, then stopping cman is not needed.
systemctl stop cman.service
systemctl start fence_sanlockd.service
Now we can re-enable the cman service.
systemctl start cman.service
Testing fence_sanlock
If everything is working, you should now be able to recover from a hard-crash on a node in the cluster. To test this, you can echo c to /proc/sysrq-trigger and the other node should, eventually, fence the lost node and restore cluster services.
echo c > /proc/sysrq-trigger
In the system log of the other node, you will see messages similar to those in the example below.
Oct 21 23:25:23 an-c01n02 corosync[1652]:   [TOTEM ] A processor failed, forming new configuration.
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] New Configuration:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] #011r(0) ip(10.20.10.2) 
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] Members Left:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] #011r(0) ip(10.20.10.1) 
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] Members Joined:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [QUORUM] Members[1]: 2
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] New Configuration:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] #011r(0) ip(10.20.10.2) 
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] Members Left:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] Members Joined:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.10.2) ; members(old:2 left:1)
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 21 23:25:25 an-c01n02 kernel: [ 1075.915453] dlm: closing connection to node 1
Oct 21 23:25:25 an-c01n02 fenced[1708]: fencing node an-c01n01.alteeve.ca
Oct 21 23:25:26 an-c01n02 fence_sanlock: 1997 host_id 1 gen 1 ver 1 timestamp 391
Oct 21 23:27:20 an-c01n02 fenced[1708]: fence an-c01n01.alteeve.ca success
This shows that fence_sanlock is working properly.
checkquorum.wdmd
The checkquorum.wdmd is not really a fence device in the traditional sense. The only way for a fence action to be considered a success is to have the failed node rejoin the cluster in a clean (freshly started) state. If a node fails in such a way that it can node start back up, the cluster will remain blocked until cleared by an administrator issuing a fence_ack_manual on the survivor node.
The checkquorum.wdmd works by having wdmd stop resetting the system's watchdog timer if the node loses quorum. For this reason, checkquorum.wdmd will not work if the 2-node (<cman expected_votes="1" two_node="1"/>) option is set in /etc/cluster/cluster.conf. This is because quorum is effectively disabled. When configured for 2-node clusters, a node will never lose quorum, thus, corosync will never stop resetting the watchdog timer.
To get around this 2-node limitation, we can use qdisk on shared storage. This provides a third vote and allows quorum to be used properly.
|  | Note: The qdisk device must be in master_wins mode. Please see man 5 qdisk for more information. | 
Requirements
You will need;
- A hardware watchdog timer.
- Two-Node clusters only; External shared storage, 10 MiB or larger in size, for a qdisk partition.
From the hardware end of things, you have to have a hardware watchdog timer. Many workstation and server mainboards offer this as a built-in feature which can be enabled in the BIOS. If your system does not have this, add-in and external watchdog timers can be used and are relatively inexpensive. To further save costs, some open-hardware watchdog timer designs are available for those handy with a soldering iron.
You need to install;
- cman ver. 3.1.99 +
- wdmd ver. 2.6 + (available from the sanlock ver. 2.6 + package)
- sanlock ver. 2.6 +
Installation
To use checkquorum.wdmd, install;
yum install cman sanlock sanlock-lib
Configuring qdisk (2-Node Only)
|  | Note: If you have a cluster with three or more nodes, you can skip this step. | 
|  | Warning: If you are already using qdisk, do not create a new qdisk device while the cluster is running! It will cause the cluster to malfunction. | 
In this document, the SAN device is mounted on each node as /dev/sdb. If you do not have an existing SAN, below is a brief tutorial on setting up a tgtd based SAN:
Create the quorum disk using the following command;
|  | Note: In the example below, the label an-cluster-01.wdmd is used. This is a free-form label between 1 and 128 characters in length. | 
mkqdisk -c /dev/sdb -l an-cluster-01.wdmd
mkqdisk v3.1.99
Writing new quorum disk label 'an-cluster-01.wdmd' to /dev/sdb.
WARNING: About to destroy all data on /dev/sdb; proceed [N/y] ? y
Initializing status block for node 1...
Initializing status block for node 2...
Initializing status block for node 3...
Initializing status block for node 4...
Initializing status block for node 5...
Initializing status block for node 6...
Initializing status block for node 7...
Initializing status block for node 8...
Initializing status block for node 9...
Initializing status block for node 10...
Initializing status block for node 11...
Initializing status block for node 12...
Initializing status block for node 13...
Initializing status block for node 14...
Initializing status block for node 15...
Initializing status block for node 16...
With this now created, edit your cluster.conf file to use the quorum disk. This involves removing the <cman expected_votes="1" two_node="1"/> entry and replacing it with the quorum configuration. The important value we will need to set is label="an-cluster-01.wdmd" which tells cman which is it's qdisk device. For more information of the various available qdisk attributes and values by reading man 5 qdisk.
vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="3" name="an-cluster-01">
	<quorumd label="an-cluster-01.wdmd"/>
	<clusternodes>
		<clusternode name="an-c01n01.alteeve.ca" nodeid="1"/>
		<clusternode name="an-c01n02.alteeve.ca" nodeid="2"/>
	</clusternodes>
	<fencedevices/>
	<fence_daemon post_join_delay="30"/>
	<totem rrp_mode="none" secauth="off"/>
</cluster>
|  | Note: Fence methods and fence devices are not defined under each node when using checkquorum.wdmd. | 
If you want to configure checkquorum.wdmd, you can do so by creating or editing /etc/sysconfig/checkquorum.
Configuration options:
| Option | Value(s) | Description | 
|---|---|---|
| waittime | natural number | This is the number of seconds to wait after losing quorum before declaring a failure. What this means will depend on what action is set to. | 
| action | (See below) | This is the action to taken either immediately is corosync crashes or after waittime seconds when quorum was lost. This delay is used in case the node is able to rejoin the cluster, thus regaining quorum. | 
| autodetect (default) | If kdump is running, an attempt to crash dump is made. If kdump is not running, an error will be returned to wdmd which in turn will allow watchdog to reboot the machine. | |
| hardreboot | This will trigger a kernel hard-reboot. | |
| crashdump | This will trigger a kdump action in the kernel. | |
| watchdog | This will return an error to wdmd which in turn will allow watchdog to reboot the machine. | 
If kdump is running, it is advised that you also use fence_kdump. This will allow the failed node to inform the other nodes that it has rebooted, speeding up recovery time. This is optional and should be considered an optimization.
|  | Warning: The fence_kdump agent is totally untested. Please use it cautiously. | 
There are limitation to using fence_kdump;
- Note that checkquorum.wdmd does not work when two_node=1 is set in /etc/cluster/cluster.conf unless it is used in combination with qdiskd "master win" mode. Given the need for shared storage however, it is better to just use fence_sanlock in this case.
- When using checkquorum.wdmd, fencing is considered complete only after the failed node has rebooted and rejoined the cluster. If this fails, or if there is a failure in the cluster's network, the cluster will hang indefinitely. Likewise, if the node has failed completely and does not restart, the cluster will also hang.
- If all communication is lost between the nodes, such as a failure in the core switch(es), all nodes will lose quorum and reboot.
Using checkquorum.wdmd
Before checkquorum.wdwm can be used, we must first copy the checkquorum.wdmd into the watchdog script directory and make it executable. Then we will restart the wdmd to enable the new script.
Copy checkquorum.wdmd and setup it's permissions and ownership.
cp /usr/share/cluster/checkquorum.wdmd /etc/wdmd.d/
chown root:root /etc/wdmd.d/checkquorum.wdmd
chmod u+x /etc/wdmd.d/checkquorum.wdmd
Stop cman if it is running, the restart the wdmd daemon and then start cman again.
systemctl stop cman.service
systemctl stop wdmd.service
systemctl start wdmd.service
systemctl start cman.service
Testing checkquorum.wdmd
If everything is working, you should now be able to recover from a hard-crash on a node in the cluster. To test this, you can echo c to /proc/sysrq-trigger and the other node should, eventually, fence the lost node. Remember; The cluster will remain blocked until either;
- The lost node reboots and rejoins the cluster in a clean state.
or;
- You manually clear the fence by issuing fence_ack_manual.
|  | Warning: You must be certain that the failed node has been manually powered off before manually clearing the fence. Failing to do so could cause a serious failure in your cluster! | 
echo c > /proc/sysrq-trigger
In the system log of the other node, you will see messages similar to those in the example below.
Oct 22 22:01:25 an-c01n01 qdiskd[2138]: Writing eviction notice for node 2
Oct 22 22:01:26 an-c01n01 qdiskd[2138]: Node 2 evicted
Oct 22 22:01:28 an-c01n01 corosync[2084]:   [TOTEM ] A processor failed, forming new configuration.
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] New Configuration:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] #011r(0) ip(10.20.10.1) 
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] Members Left:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] #011r(0) ip(10.20.10.2) 
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] Members Joined:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [QUORUM] Members[1]: 1
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] New Configuration:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] #011r(0) ip(10.20.10.1) 
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] Members Left:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] Members Joined:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.10.1) ; members(old:2 left:1)
Oct 22 22:01:30 an-c01n01 kernel: [13038.313753] dlm: closing connection to node 2
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 22:01:30 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed
Oct 22 22:01:33 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:33 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:33 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed
Oct 22 22:01:36 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:37 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:37 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed
Note the errors;
Oct 22 22:01:30 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed
This is expected and can be safely ignored. After a minute or so, the hung node should reboot. Once it's back online and rejoins the cluster, the cluster should go back to normal operation.
Using fence_sanlock And checkquorum.wdmd
At this time, using both fence_sanlock and checkquorum.wdmd is not supported.
Permissions
I give unrestricted permission to Red Hat, Inc. to copy this document in whole or in part.
| Any questions, feedback, advice, complaints or meanderings are welcome. | |||
| Alteeve's Niche! | Alteeve Enterprise Support | Community Support | |
| © 2025 Alteeve. Intelligent Availability® is a registered trademark of Alteeve's Niche! Inc. 1997-2025 | |||
| legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions. | |||