Watchdog Recovery

From Alteeve Wiki
Jump to navigation Jump to search

 AN!Wiki :: How To :: Watchdog Recovery

Note: This tutorial is written using Fedora 16.

The new fence_sanlock and checkquorum.wdmd fence agents provide new fencing options to users who may not have full out of band management, switched PDUs or similar tradition fence device. They aim to provide a critical function in clusters to users who otherwise would have no (affordable) options.

Warning: This technology is TechPreview! There is no support for this fence method yet. Feedback and bug reports are much appreciated.

About Fencing

Traditionally in clustering, all nodes must be in a known state. In practice, this meant that when a node stopped responding, the rest of the cluster could not safely proceed until the silent node was put into a known state.

The action of putting a node into a known state is called "fencing". Typically, this is done by one of the other nodes in the cluster either isolating or forcibly powering off the lost node.

  • With isolation, the lost node would not itself be touched, but it's network link(s) would be disabled. This would ensure that even if the node recovered, it would no longer have access to the cluster or it's shared storage. This form of clustering is called "fabric fencing".
  • The far more common form of fencing is to forcibly power off the lost node. This is done by using an external device, like a server's out-of-band management card (IPMI, iLO, etc) or by using a network-connected power bar, called a PDU.

In either case, the purpose of fencing is to ensure that the lost node will be able to access clustered resources, like shared storage, or provide clustered services in an asynchronous manner. Skipping this crucial step could cause data loss, so it is critical to always use fencing in clusters.

Watchdog Timers

Many mother boards have "watchdog" timers built in. These timers will cause the host machine to reboot if the system appears to freeze for a period of time. The new fence_sanlock agent combines these with SAN storage to provide an alternative fence method.

Where "fabric fencing" can be thought of as a form of ostracism and "power fencing" can be thought of as a form of murder, watchdog fencing can be thought of as a form of suicide. Morbid, but accurate.

Options

There are currently 2 mechanisms to trigger a node recover via watchdog device:

  1. fence_sanlock: Preferred method, but always requires shared storage.
  2. checkquorum.wdmd: Only requires shared storage for 2-node clusters.

Important Note On Timing

Watchdog timers work by having a constant count down running. The host has to periodically and reliably reset this timer. If the watchdog timer is allowed to expire, the host machine will be reset. This timeout is often measured in minutes.

Traditional fencing methods which communicate with external devices that can report success as soon as the target node has been fenced. This process is usually measured in a small number of seconds.

When a cluster loses contact with a node, it blocks by design. It is not safe to proceed until all nodes are in a known state, so the users of the cluster services will notice a period of interruption until the lost node recovers or is fenced.

Putting this together, this timing difference means that any watchdog-based fencing will be much slower than traditional fencing. Your users will most likely experience an outage of several minutes while the fence_sanlock works. For this reason, fence_sanlock should be used only when traditional fence methods are unavailable.

In short; watchdog fencing is not a replacement for traditional fencing. It is only a replacement for no fencing at all.

fence_sanlock

The fence_sanlock deamon works by combining locks on shared storage with watchdog devices under each node.

On start-up, each node "unfences", during which time the node takes a lock from the shared storage and begins resetting the watchdog timer at set intervals. So long as the node is healthy, this reset of the watchdog timer repeats indefinitely.

If there is a failure on the node itself, the node will self-fence by allowing it's watchdog timer to expire. This will happen either a period of time after losing quorum or immediately if corosync fails.

When another node in the cluster wants to call a fence against a victim, it will try to take it's lock on the shared storage. If the victim is not already in the process of fencing itself, it will detect the attempt to take it's lock and allow it's watchdog timer to expire. Of course, if the victim has crashed entirely, it's watchdog device will already be counting down. In any case, the victim will eventually reboot.

The cluster will know when the victim is gone when it is finally able to take the victim's lock. This is safe because, so long as the victim is alive, it will maintain it's lock.

Requirements

You will need;

  • A hardware watchdog timer.
  • External shared storage, 1 GiB or larger in size.
    • Any shared storage will do. Here is a short tgtd tutorial for creating a SAN on a machine outside of the cluster if you don't have existing shared storage.

From the hardware end of things, you have to have a hardware watchdog timer. Many workstation and server mainboards offer this as a built-in feature which can be enabled in the BIOS. If your system does not have this, add-in and external watchdog timers can be used and are relatively inexpensive. To further save costs, some open-hardware watchdog timer designs are available for those handy with a soldering iron.

Note: Software watchdog timers exist but they are not supported in production. They rely on the host functioning to at least some degree which is a fatal design flaw. A simple test of issuing echo c > /proc/sysrq-trigger will demonstrate the flaw in using software watchdog timers.

You need to install;

  • cman ver. 3.1.99 +
  • wdmd ver. 2.6 + (available from the sanlock ver. 2.6 + package)
  • fence_sanlock ver. 2.6 +

Installation

To install fence_sanlock, run;

yum install cman fence-sanlock sanlock sanlock-lib

Configuring Shared Storage

Any shared storage device can be used. For the purpose if this document, a SAN LUN, exported by a machine outside of the cluster, and made available as /dev/sdb will be used. If you don't currently have a shared storage device, below is a brief tutorial on setting up a tgtd based SAN:

You can use a [non] clustered LVM, an NFS export or most any type of shared storage. The only requirement being that it is at least 1 GiB in size or larger.

Once you have your shared storage, initialize it;

Note: The Initializing 128 sanlock host leases may take a while to complete, please be patient.
fence_sanlock -o sanlock_init -p /dev/sdb
Initializing fence sanlock lockspace on /dev/sdb: ok
Initializing 128 sanlock host leases on /dev/sdb: ok

Note the path, /dev/sdb in this case. You will use it in the next step.

Configuration cman

Below is an example cman configuration file. The main sections to note are the <fence>, <unfence> and <fencedevice> elements.

vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="2" name="an-cluster-01">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-c01n01.alteeve.ca" nodeid="1">
			<fence>
				<method name="wd">
					<device name="watchdog" host_id="1" />
				</method>
			</fence>
			<unfence>
				<device name="watchdog" host_id="1" action="on" />
			</unfence>
		</clusternode>
		<clusternode name="an-c01n02.alteeve.ca" nodeid="2">
			<fence>
				<method name="wd">
					<device name="watchdog" host_id="2" />
				</method>
			</fence>
			<unfence>
				<device name="watchdog" host_id="2" action="on" />
			</unfence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="watchdog" agent="fence_sanlock" path="/dev/sdb/"/>
	</fencedevices>
	<fence_daemon post_join_delay="30"/>
	<totem rrp_mode="none" secauth="off"/>
</cluster>

The key attributes are:

  • host_id="x"; This tells fence_sanlock which node ID to act on. It must match the corresponding <clusternode ... nodeid="x"> value.
  • path="/dev/x/"; This tells the cluster where each node's lock can be found. Here we used /dev/sdb/, which is the device we initialized in the previous step, with a trailing /.

Enabling fence_sanlock

Note: fence_sanlock fencing and unfencing operations can take up to several minutes to complete. This is normal and expected behaviour. Other than this, the fencing operation will work as any other fence device implementation. From a user perspective there are no operational differences.

We need to disable wdmd and sanlock and then enable fence_sanlockd daemons.

systemctl disable wdmd.service
systemctl disable sanlock.service
systemctl enable fence_sanlockd.service
Note: If you are using an pre-systemd OS, use chkconfig instead of systemctl.

Now stop cman and then start fence_sanlockd. If you're not running the cluster yet, then stopping cman is not needed.

systemctl stop cman.service
enable fence_sanlockd.service
systemctl start fence_sanlockd.service

Now we can re-enable the cman service.

systemctl start cman.service

How checkquorum.wdmd Works

ToDo.

checkquorum.wdmd

Setting up checkquorum.wdwm:

Copy checkquorum.wdmd and setup it's permissions and ownership.

cp /usr/share/cluster/checkquorum.wdmd /etc/wdmd.d/
chown root:root /etc/wdmd.d/checkquorum.wdmd
chmod u+x /etc/wdmd.d/checkquorum.wdmd

Add -S1 to /etc/sysconfig/wdmd to the startup.

cp /etc/sysconfig/wdmd /etc/sysconfig/wdmd.orig
sed -i 's/#WDMDOPTS/WDMDOPTS/' /etc/sysconfig/wdmd
diff -u /etc/sysconfig/wdmd.orig /etc/sysconfig/wdmd
--- /etc/sysconfig/wdmd.orig	2012-10-11 18:01:50.935041431 -0400
+++ /etc/sysconfig/wdmd	2012-10-11 18:05:53.101049589 -0400
@@ -4,5 +4,5 @@
 # Include "-G sanlock" in the option string.
 #
 # To enable use of test scripts
-#WDMDOPTS="-G sanlock -S 1"
+WDMDOPTS="-G sanlock -S 1"

Stop cman if it is running, the restart the wdmd daemon and then start cman again.

systemctl stop cman.service
systemctl stop wdmd.service
systemctl start wdmd.service
systemctl start cman.service

There is no need to define any fence device in /etc/cluster/cluster.conf for wdmd fencing to work. If you want to configure checkquorum.wdmd, you can do so by editing /etc/sysconfig/checkquorum.

Configuration options:

Option Value(s) Description
waittime natural number This is the number of seconds to wait after losing quorum before declaring a failure. What this means will depend on what action is set to.
action (See below) This is the action to taken either immediately is corosync crashes or after waittime seconds when quorum was lost. This delay is used in case the node is able to rejoin the cluster, thus regaining quorum.
autodetect (default) If kdump is running, an attempt to crash dump is made. If kdump is not running, an error will be returned to wdmd which in turn will allow watchdog to reboot the machine.
hardreboot This will trigger a kernel hard-reboot.
crashdump This will trigger a kdump action in the kernel.
watchdog This will return an error to wdmd which in turn will allow watchdog to reboot the machine.

If kdump is running, it is advised that you also use fence_kdump. This will allow the failed node to inform the other nodes that it has rebooted, speeding up recovery time. This is optional and should be considered an optimization.

Warning: The fence_kdump agent is totally untested. Please use it cautiously.

There are limitation to using fence_kdump;

  • Note that checkquorum.wdmd does not work when two_node=1 is set in /etc/cluster/cluster.conf unless it is used in combination with qdiskd "master win" mode. Given the need for shared storage however, it is better to just use fence_sanlock in this case.
  • When using checkquorum.wdmd, fencing is considered complete only after the failed node has rebooted and rejoined the cluster. If this fails, or if there is a failure in the cluster's network, the cluster will hang indefinitely. Likewise, if the node has failed completely and does not restart, the cluster will also hang.
  • If all communication is lost between the nodes, such as a failure in the core switch(es), all nodes will lose quorum and reboot.

Using fence_sanlock And checkquorum.wdmd

At this time, using both fence_sanlock and checkquorum.wdmd is not supported.

Setting Up IPMI-Based Watchdog Timer

Different devices will have different methods for installing and setting up.

# Install and start the ipmi daemon
yum install freeipmi-bmc-watchdog watchdog OpenIPMI
systemctl enable ipmi.service
systemctl start ipmi.service

# See the current state of the watchdog timer
ipmitool mc watchdog get

# Disable the timer.
ipmitool mc watchdog off

# Enable the timer.
ipmitool mc watchdog reset



 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.