Revision as of 19:41, 21 December 2013

AN!Wiki :: How To :: Anvil! Tutorial 3

Warning: This tutorial is incomplete, flawed and generally sucks at this time. Do not follow this and expect anything to work. In large part, it's a dumping ground for notes and little else. This warning will be removed when the tutorial is completed.

Warning: This tutorial is built on Red Hat's Enterprise Linux 7 beta. Red Hat never confirms what a future release will contain until it is actually released, so there is a real chance that what is in the beta will not be in the final release.

This is the third AN!Cluster tutorial built on Red Hat's Enterprise Linux 7. It improves on the RHEL 5, RHCS stable 2 and RHEL 6, RHCS stable3 tutorials.

As with the previous tutorials, the end goal of this tutorial is a 2-node cluster providing a platform for high-availability virtual servers. It's design attempts to remove all single points of failure from the system. Power and networking are made fully redundant in this version, along with minimizing the node failures which would lead to service interruption. This tutorial also covers the AN!Utilities; AN!Cluster Dashboard, AN!Cluster Monitor and AN!Safe Cluster Shutdown.

As it the previous tutorial, KVM will be the hypervisor used for facilitating virtual machines. The old cman and rgmanager tools are replaced in favour of pacemaker for resource management.

Before We Begin

This tutorial does not require prior cluster experience, but it does expect familiarity with Linux and a low-intermediate understanding of networking. Where possible, steps are explained in detail and rationale is provided for why certain decisions are made.

For those with cluster experience;

Please be careful not to skip too much. There are some major and some subtle changes from previous tutorials.

OS Setup

Warning: I used Fedora 19 at this point, obviously things will change, possibly a lot, once RHEL 7 is released.

Install

Not all of these are required, but most are used at one point or another in this tutorial.

yum install bridge-utils corosync net-tools ntp pacemaker pcs rsync syslinux wget fence-agents-all

Optional stuff:

yum install gpm man vim screen mlocate syslinux

If you want to use your mouse at the node's terminal, run the following;

systemctl enable gpm.service
systemctl start gpm.service

Setting the Hostname

Fedora 19 is very different from EL6.

Note: The '--pretty' line currently doesn't work as there is a bug (rhbz#895299) with single-quotes.

Note: The '--static' option is currently needed to prevent the '.' from being removed. See this bug (rhbz#896756).

Use a format that works for you. For the tutorial, node names are based on the following;

A two-letter prefix identifying the company/user (an, for "Alteeve's Niche!")
A sequential cluster ID number in the form of cXX (c01 for "Cluster 01", c02 for Cluster 02, etc)
A sequential node ID number in the form of nYY

In my case, this is my third cluster and I use the company prefix an, so my two nodes will be;

an-c03n01 - node 1
an-c03n02 - node 2

Folks who've read my earlier tutorials will note that this is a departure in naming. I find this method spans and scales much better. Further, it the simply required in order to use the AN! Cluster Dashboard.

hostnamectl set-hostname an-c03n01.alteeve.ca --static
hostnamectl set-hostname --pretty "Alteeve's Niche! - Cluster 03, Node 01"

If you want the new host name to take effect immediately, you can use the traditional hostname command:

hostname an-c03n01.alteeve.ca

Alternatively

If you have trouble with those commands, you can directly edit the files that contain the host names.

The host name is stored in /etc/hostname:

echo an-c03n01.alteeve.ca > /etc/hostname 
cat /etc/hostname

an-c03n01.alteeve.ca

The "pretty" host name is stored in /etc/machine-info as the unquoted value for the PRETTY_HOSTNAME value.

vim /etc/machine-info

PRETTY_HOSTNAME=Alteeves Niche! - Cluster 01, Node 01

If you can't get the hostname command to work for some reason, you can reboot to have the system read the new values.

Optional - Video Problems

On my servers, Fedora 19 doesn't detect or use the video card properly. To resolve this, I need to add nomodeset to the kernel line when installing and again after the install is complete.

Once installed

Edit the /etc/default/grub and append nomodeset to the end of the GRUB_CMDLINE_LINUX variable.

vim /etc/default/grub

GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_CMDLINE_LINUX="nomodeset rd.md=0 rd.lvm=0 rd.dm=0 $([ -x /usr/sbin/rhcrashkernel-param ] && /usr/sbin/rhcrashkernel-param || :) rd.luks=0 vconsole.keymap=us nomodeset"
GRUB_DISABLE_RECOVERY="true"
GRUB_THEME="/boot/grub2/themes/system/theme.txt"

Save that. and then rewrite the grub2 configuration file.

grub2-mkconfig -o /boot/grub2/grub.cfg

Next time you reboot, you should get a stock 80x25 character display. It's not much, but it will work on esoteric video cards or weird monitors.

What Security?

This section will be re-added at the end. For now;

setenforce 0
sed -i 's/SELINUX=.*/SELINUX=disabled/' /etc/selinux/config
systemctl disable firewalld.service
systemctl stop firewalld.service

Network

We want static, named network devices. Follow this;

Changing Ethernet Device Names in EL7 and Fedora 15+

Then, use these configuration files;

Build the bridge;

vim /etc/sysconfig/network-scripts/ifcfg-ifn-vbr1

# Internet-Facing Network - Bridge
DEVICE="ifn-vbr1"
TYPE="Bridge"
BOOTPROTO="none"
IPADDR="10.255.10.1"
NETMASK="255.255.0.0"
GATEWAY="10.255.255.254"
DNS1="8.8.8.8"
DNS2="8.8.4.4"
DEFROUTE="yes"

Now build the bonds;

vim /etc/sysconfig/network-scripts/ifcfg-ifn-bond1

# Internet-Facing Network - Bond
DEVICE="ifn-bond1"
BRIDGE="ifn-vbr1"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=ifn1"

vim /etc/sysconfig/network-scripts/ifcfg-sn-bond1

# Storage Network - Bond
DEVICE="sn-bond1"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=sn1"
IPADDR="10.10.10.1"
NETMASK="255.255.0.0"

vim /etc/sysconfig/network-scripts/ifcfg-bcn-bond1

# Back-Channel Network - Bond
DEVICE="bcn-bond1"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=bcn1"
IPADDR="10.20.10.1"
NETMASK="255.255.0.0"

Now tell the interfaces to be slaves to their bonds;

Internet-Facing Network;

vim /etc/sysconfig/network-scripts/ifcfg-ifn1

# Internet-Facing Network - Link 1
DEVICE="ifn1"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
SLAVE="yes"
MASTER="ifn-bond1"

vim /etc/sysconfig/network-scripts/ifcfg-ifn2

# Back-Channel Network - Link 2
DEVICE="ifn2"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
SLAVE="yes"
MASTER="ifn-bond1"

Storage Network;

vim /etc/sysconfig/network-scripts/ifcfg-sn1

# Storage Network - Link 1
DEVICE="sn1"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
SLAVE="yes"
MASTER="sn-bond1"

vim /etc/sysconfig/network-scripts/ifcfg-sn2

# Storage Network - Link 1
DEVICE="sn2"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
SLAVE="yes"
MASTER="sn-bond1"

Back-Channel Network

vim /etc/sysconfig/network-scripts/ifcfg-bcn1

# Back-Channel Network - Link 1
DEVICE="bcn1"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
SLAVE="yes"
MASTER="bcn-bond1"

vim /etc/sysconfig/network-scripts/ifcfg-bcn2

# Storage Network - Link 1
DEVICE="bcn2"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
SLAVE="yes"
MASTER="bcn-bond1"

Now restart the network, confirm that the bonds and bridge are up and you are ready to proceed.

Setup The hosts File

You can use DNS if you prefer. For now, lets use /etc/hosts for node name resolution.

vim /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

# AN!Cluster 01, Node 01
10.255.10.1     an-c01n01.ifn
10.10.10.1      an-c01n01.sn
10.20.10.1      an-c01n01.bcn an-c01n01 an-c01n01.alteeve.ca
10.20.11.1      an-c01n01.ipmi

# AN!Cluster 01, Node 02
10.255.10.2     an-c01n02.ifn
10.10.10.2      an-c01n02.sn
10.20.10.2      an-c01n02.bcn an-c01n02 an-c01n02.alteeve.ca
10.20.11.2      an-c01n02.ipmi

# Foundation Pack
10.20.2.7       an-p03 an-p03.alteeve.ca

Setup SSH

Same as before.

Populating And Pushing ~/.ssh/known_hosts

Same as before.

ssh root@an-c03n01.alteeve.ca

The authenticity of host 'an-c03n01.alteeve.ca (10.20.30.1)' can't be established.
RSA key fingerprint is 7b:dd:0d:aa:c5:f5:9e:a6:b6:4d:40:69:d6:80:4d:09.
Are you sure you want to continue connecting (yes/no)?

Type yes

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'an-c03n01.alteeve.ca,10.20.30.1' (RSA) to the list of known hosts.
Last login: Thu Feb 14 15:18:33 2013 from 10.20.5.100

You will now be logged into the an-c03n01 node, which in this case is the same machine on a new session in the same terminal.

[root@an-c03n01 ~]#

You can logout by typing exit.

exit

logout
Connection to an-c03n01.alteeve.ca closed.

Now we have to repeat the steps for all the other variations on the names of the hosts. This is annoying and tedious, sorry.

ssh root@an-c03n01
ssh root@an-c03n01.bcn
ssh root@an-c03n01.sn
ssh root@an-c03n01.ifn
ssh root@an-c03n02.alteeve.ca
ssh root@an-c03n02
ssh root@an-c03n02.bcn
ssh root@an-c03n02.sn
ssh root@an-c03n02.ifn

Your ~/.ssh/known_hosts file will now be populated with both nodes' ssh fingerprints. Copy it over to the second node to save all that typing a second time.

rsync -av ~/.ssh/known_hosts root@an-c03n02:/root/.ssh/

Keeping Time in Sync

It's not as critical as it used to be to keep the clocks on the nodes in sync, but it's still a good idea.

systemctl start ntpd.service
systemctl enable ntpd.service

Configuring IPMI

F19 specifics based on the IPMI tutorial.

yum -y install ipmitools OpenIPMI
systemctl start ipmi.service
systemctl enable ipmi.service

ln -s '/usr/lib/systemd/system/ipmi.service' '/etc/systemd/system/multi-user.target.wants/ipmi.service'

Our servers use lan channel 2, yours might be 1 or something else. Experiment.

ipmitool lan print 2

Set in Progress         : Set Complete
Auth Type Support       : NONE MD5 PASSWORD 
Auth Type Enable        : Callback : NONE MD5 PASSWORD 
                        : User     : NONE MD5 PASSWORD 
                        : Operator : NONE MD5 PASSWORD 
                        : Admin    : NONE MD5 PASSWORD 
                        : OEM      : NONE MD5 PASSWORD 
IP Address Source       : BIOS Assigned Address
IP Address              : 10.20.51.1
Subnet Mask             : 255.255.0.0
MAC Address             : 00:19:99:9a:d8:e8
SNMP Community String   : public
IP Header               : TTL=0x40 Flags=0x40 Precedence=0x00 TOS=0x10
Default Gateway IP      : 10.20.255.254
802.1q VLAN ID          : Disabled
802.1q VLAN Priority    : 0
RMCP+ Cipher Suites     : 0,1,2,3,6,7,8,17
Cipher Suite Priv Max   : OOOOOOOOXXXXXXX
                        :     X=Cipher Suite Unused
                        :     c=CALLBACK
                        :     u=USER
                        :     o=OPERATOR
                        :     a=ADMIN
                        :     O=OEM

I need to set the IPs to 10.20.31.1/16 and 10.20.31.2/16 for nodes 1 and 2, respectively. I also want to set the password to secret for the admin user.

Node 01 IP;

ipmitool lan set 2 ipsrc static
ipmitool lan set 2 ipaddr 10.20.31.
ipmitool lan set 2 netmask 255.255.0.0
ipmitool lan set 2 defgw ipaddr 10.20.255.254
ipmitool lan print 2

Set in Progress         : Set Complete
Auth Type Support       : NONE MD5 PASSWORD 
Auth Type Enable        : Callback : NONE MD5 PASSWORD 
                        : User     : NONE MD5 PASSWORD 
                        : Operator : NONE MD5 PASSWORD 
                        : Admin    : NONE MD5 PASSWORD 
                        : OEM      : NONE MD5 PASSWORD 
IP Address Source       : Static Address
IP Address              : 10.20.31.1
Subnet Mask             : 255.255.0.0
MAC Address             : 00:19:99:9a:d8:e8
SNMP Community String   : public
IP Header               : TTL=0x40 Flags=0x40 Precedence=0x00 TOS=0x10
Default Gateway IP      : 10.20.255.254
802.1q VLAN ID          : Disabled
802.1q VLAN Priority    : 0
RMCP+ Cipher Suites     : 0,1,2,3,6,7,8,17
Cipher Suite Priv Max   : OOOOOOOOXXXXXXX
                        :     X=Cipher Suite Unused
                        :     c=CALLBACK
                        :     u=USER
                        :     o=OPERATOR
                        :     a=ADMIN
                        :     O=OEM

Node 01 IP;

ipmitool lan set 2 ipsrc static
ipmitool lan set 2 ipaddr 10.20.31.2
ipmitool lan set 2 netmask 255.255.0.0
ipmitool lan set 2 defgw ipaddr 10.20.255.254
ipmitool lan print 2

Set in Progress         : Set Complete
Auth Type Support       : NONE MD5 PASSWORD 
Auth Type Enable        : Callback : NONE MD5 PASSWORD 
                        : User     : NONE MD5 PASSWORD 
                        : Operator : NONE MD5 PASSWORD 
                        : Admin    : NONE MD5 PASSWORD 
                        : OEM      : NONE MD5 PASSWORD 
IP Address Source       : Static Address
IP Address              : 10.20.31.2
Subnet Mask             : 255.255.0.0
MAC Address             : 00:19:99:9a:b1:78
SNMP Community String   : public
IP Header               : TTL=0x40 Flags=0x40 Precedence=0x00 TOS=0x10
Default Gateway IP      : 10.20.255.254
802.1q VLAN ID          : Disabled
802.1q VLAN Priority    : 0
RMCP+ Cipher Suites     : 0,1,2,3,6,7,8,17
Cipher Suite Priv Max   : OOOOOOOOXXXXXXX
                        :     X=Cipher Suite Unused
                        :     c=CALLBACK
                        :     u=USER
                        :     o=OPERATOR
                        :     a=ADMIN
                        :     O=OEM

Set the password.

ipmitool user list 2

ID  Name	     Callin  Link Auth	IPMI Msg   Channel Priv Limit
1                    true    true       true       Unknown (0x00)
2   admin            true    true       true       OEM
Get User Access command failed (channel 2, user 3): Unknown (0x32)

(ignore the error, it's harmless... *BOOM*)

We want to set admin's password, so we do:

Note: The 2 below is the ID number, not the LAN channel.

ipmitool user set password 2 secret

Done!

Configuring the Cluster

Now we're getting down to business!

For this section, we will be working on an-c03n01 and using ssh to perform tasks on an-c03n02.

Note: TODO: explain what this is and how it works.

Enable the pcs Daemon

Note: Most of this section comes more or less verbatim from the main Clusters from Scratch tutorial.

We will use pcs, the Pacemaker Configuration System, to configure our cluster.

systemctl start pcsd.service
systemctl enable pcsd.service

ln -s '/usr/lib/systemd/system/pcsd.service' '/etc/systemd/system/multi-user.target.wants/pcsd.service'

Now we need to set a password for the hacluster user. This is the account used by pcs on one node to talk to the pcs daemon on the other node. For this tutorial, we will use the password secret. You will want to use a stronger password, of course.

echo secret | passwd --stdin hacluster

Changing password for user hacluster.
passwd: all authentication tokens updated successfully.

Initializing the Cluster

One of the biggest reasons we're using the pcs tool, over something like crm, is that it has been written to simplify the setup of clusters on Red Hat style operating systems. It will configure corosync automatically.

First, we need to know what hostname we will need to use for pcs.

Node 01:

hostname

an-c03n01.alteeve.ca

Node 02:

hostname

an-c03n02.alteeve.ca

Next, authenticate against the cluster nodes.

Both nodes:

pcs cluster auth an-c03n01.alteeve.ca an-c03n02.alteeve.ca -u hacluster

This will ask you for the user name and password. The default user name is hacluster and we set the password to secret.

Password: 
an-c03n02.alteeve.ca: Authorized

Do this on one node only:

Now to initialize the cluster's communication and membership layer.

pcs cluster setup --name an-cluster-03 an-c03n01.alteeve.ca an-c03n02.alteeve.ca

an-c03n01.alteeve.ca: Succeeded
an-c03n02.alteeve.ca: Succeeded

This will create the corosync configuration file /etc/corosync/corosync.conf;

cat /etc/corosync/corosync.conf

totem {
version: 2
secauth: off
cluster_name: an-cluster-03
transport: udpu
}

nodelist {
  node {
        ring0_addr: an-c03n01.alteeve.ca
        nodeid: 1
       }
  node {
        ring0_addr: an-c03n02.alteeve.ca
        nodeid: 2
       }
}

quorum {
provider: corosync_votequorum
}

logging {
to_syslog: yes
}

Start the Cluster For the First Time

This starts the cluster communication and membership layer for the first time.

On one node only;

pcs cluster start --all

an-c03n01.alteeve.ca: Starting Cluster...
an-c03n02.alteeve.ca: Starting Cluster...

After a few moments, you should be able to check the status;

pcs status

Cluster name: an-cluster-03
WARNING: no stonith devices and stonith-enabled is not false
Last updated: Mon Jun 24 23:28:29 2013
Last change: Mon Jun 24 23:28:10 2013 via crmd on an-c03n01.alteeve.ca
Current DC: NONE
2 Nodes configured, unknown expected votes
0 Resources configured.


Node an-c03n01.alteeve.ca (1): UNCLEAN (offline)
Node an-c03n02.alteeve.ca (2): UNCLEAN (offline)

Full list of resources:

The other node should show almost the identical output.

Warning: We only disable stonith long enough to configure it. You should NEVER run a cluster without fencing. No matter how simple it is and certainly not because "it's just a test cluster". Fencing is always, always required. Without it, your cluster with hang, crash and fail in unexpected and hard to debug ways.

The two main things here are errors about stonith being unconfigured. We will fix this very shortly, but for just this moment, we will disable it and quorum.

pcs property set stonith-enabled=false
pcs status

Cluster name: an-cluster-03
Last updated: Tue Jun 25 00:12:21 2013
Last change: Tue Jun 25 01:42:04 2013 via cibadmin on an-c03n01.alteeve.ca
Stack: corosync
Current DC: an-c03n02.alteeve.ca (2) - partition with quorum
Version: 1.1.10-3.1670.377aefd.git.el7-377aefd
2 Nodes configured, unknown expected votes
0 Resources configured.


Online: [ an-c03n01.alteeve.ca an-c03n02.alteeve.ca ]

Full list of resources:

Disabling Quorum

Note: Show the math.

With quorum enabled, a two node cluster will lose quorum once either node fails. So we have to disable quorum.

By default, pacemaker uses quorum. You don't see this initially though;

pcs property

Cluster Properties:
 dc-version: 1.1.9-0.1318.a7966fb.git.fc18-a7966fb
 cluster-infrastructure: corosync

To disable it, we set no-quorum-policy=ignore.

pcs property set no-quorum-policy=ignore
pcs property

Cluster Properties:
 dc-version: 1.1.9-0.1318.a7966fb.git.fc18-a7966fb
 cluster-infrastructure: corosync
 no-quorum-policy: ignore

Enabling and Configuring Fencing

We will use IPMI and PDU based fence devices for redundancy.

You can see the list of available fence agents here. You will need to find the one for your hardware fence devices.

pcs stonith list

fence_alom - Fence agent for Sun ALOM
fence_apc - Fence agent for APC over telnet/ssh
fence_apc_snmp - Fence agent for APC over SNMP
fence_baytech - I/O Fencing agent for Baytech RPC switches in combination with a Cyclades Terminal
                Server
fence_bladecenter - Fence agent for IBM BladeCenter
fence_brocade - Fence agent for Brocade over telnet
fence_bullpap - I/O Fencing agent for Bull FAME architecture controlled by a PAP management console.
fence_cisco_mds - Fence agent for Cisco MDS
fence_cisco_ucs - Fence agent for Cisco UCS
fence_cpint - I/O Fencing agent for GFS on s390 and zSeries VM clusters
fence_drac - fencing agent for Dell Remote Access Card
fence_drac5 - Fence agent for Dell DRAC CMC/5
fence_eaton_snmp - Fence agent for Eaton over SNMP
fence_egenera - I/O Fencing agent for the Egenera BladeFrame
fence_eps - Fence agent for ePowerSwitch
fence_hpblade - Fence agent for HP BladeSystem
fence_ibmblade - Fence agent for IBM BladeCenter over SNMP
fence_idrac - Fence agent for IPMI over LAN
fence_ifmib - Fence agent for IF MIB
fence_ilo - Fence agent for HP iLO
fence_ilo2 - Fence agent for HP iLO
fence_ilo3 - Fence agent for IPMI over LAN
fence_ilo_mp - Fence agent for HP iLO MP
fence_imm - Fence agent for IPMI over LAN
fence_intelmodular - Fence agent for Intel Modular
fence_ipdu - Fence agent for iPDU over SNMP
fence_ipmilan - Fence agent for IPMI over LAN
fence_kdump - Fence agent for use with kdump
fence_ldom - Fence agent for Sun LDOM
fence_lpar - Fence agent for IBM LPAR
fence_mcdata - I/O Fencing agent for McData FC switches
fence_rackswitch - fence_rackswitch - I/O Fencing agent for RackSaver RackSwitch
fence_rhevm - Fence agent for RHEV-M REST API
fence_rsa - Fence agent for IBM RSA
fence_rsb - I/O Fencing agent for Fujitsu-Siemens RSB
fence_sanbox2 - Fence agent for QLogic SANBox2 FC switches
fence_scsi - fence agent for SCSI-3 persistent reservations
fence_virsh - Fence agent for virsh
fence_vixel - I/O Fencing agent for Vixel FC switches
fence_vmware - Fence agent for VMWare
fence_vmware_soap - Fence agent for VMWare over SOAP API
fence_wti - Fence agent for WTI
fence_xcat - I/O Fencing agent for xcat environments
fence_xenapi - XenAPI based fencing for the Citrix XenServer virtual machines.
fence_zvm - I/O Fencing agent for GFS on s390 and zSeries VM clusters

We will use fence_ipmilan and fence_apc_snmp.

Configuring IPMI Fencing

Every fence agent has a possibly unique subset of options that can be used. You can see a brief description of these options with the pcs stonith describe fence_X command. Let's look at the options available for fence_ipmilan.

pcs stonith describe fence_ipmilan

Stonith options for: fence_ipmilan
  auth: IPMI Lan Auth type (md5, password, or none)
  ipaddr: IPMI Lan IP to talk to
  passwd: Password (if required) to control power on IPMI device
  passwd_script: Script to retrieve password (if required)
  lanplus: Use Lanplus
  login: Username/Login (if required) to control power on IPMI device
  action: Operation to perform. Valid operations: on, off, reboot, status, list, diag, monitor or metadata
  timeout: Timeout (sec) for IPMI operation
  cipher: Ciphersuite to use (same as ipmitool -C parameter)
  method: Method to fence (onoff or cycle)
  power_wait: Wait X seconds after on/off operation
  delay: Wait X seconds before fencing is started
  privlvl: Privilege level on IPMI device
  verbose: Verbose mode

One of the nice things about pcs is that it allows us to create a test file to prepare all our changes in. Then, when we're happy with the changes, merge them into the running cluster. So let's make a copy called stonith_cfg

pcs cluster cib stonith_cfg

Now add IPMI fencing.

#                  unique name    fence agent   target node                           device addr             options
pcs stonith create fence_n01_ipmi fence_ipmilan pcmk_host_list="an-c03n01.alteeve.ca" ipaddr="an-c03n01.ipmi" action="reboot" login="admin" passwd="secret" delay=15 op monitor interval=60s
pcs stonith create fence_n02_ipmi fence_ipmilan pcmk_host_list="an-c03n02.alteeve.ca" ipaddr="an-c03n02.ipmi" action="reboot" login="admin" passwd="secret" op monitor interval=60s

Note that fence_n01_ipmi has a delay=15 set but fence_n02_ipmi does not. If the network connection breaks between the two nodes, they will both try to fence each other at the same time. If acpid is running, the slower node will not die right away. It will continue to run for up to four more seconds, ample time for it to also initiate a fence against the faster node. The end result is that both nodes get fenced. The ten-second delay protects against this by causing an-c03n02 to pause for 10 seconds before initiating a fence against an-c03n01. If both nodes are alive, an-c03n02 will power off before the 10 seconds pass, so it will never fence an-c03n01. However, if an-c03n01 really is dead, after the ten seconds have elapsed, fencing will proceed as normal.

Note: At the time of writing, pcmk_reboot_action is needed to override pacemaker's global fence action and pcmk_reboot_action is not recognized by pcs. Both of these issues will be resolved shortly; Pacemaker will honour action="..." in v1.1.10 and pcs will recognize pcmk_* special attributes "real soon now". Until then, the --force switch is needed.

Next, add the PDU fencing. This requires distinct "off" and "on" actions for each outlet on each PDU. With two nodes, each with two PSUs, this translates to eight commands. The "off" commands will be monitored to alert us if the PDU fails for some reason. There is no reason to monitor the "on" actions (it would be redundant). Note also that we don't bother using a "delay". The IPMI fence method will go first, before the PDU actions, so the PDU is already delayed.

# Node 1 - off
pcs stonith create fence_n01_pdu1_off fence_apc_snmp pcmk_host_list="an-c03n01.alteeve.ca" ipaddr="an-p01" action="off" port="1" op monitor interval="60s"
pcs stonith create fence_n01_pdu2_off fence_apc_snmp pcmk_host_list="an-c03n01.alteeve.ca" ipaddr="an-p02" action="off" port="1" power_wait="5" op monitor interval="60s"

# Node 1 - on
pcs stonith create fence_n01_pdu1_on fence_apc_snmp pcmk_host_list="an-c03n01.alteeve.ca" ipaddr="an-p01" action="on" port="1"
pcs stonith create fence_n01_pdu2_on fence_apc_snmp pcmk_host_list="an-c03n01.alteeve.ca" ipaddr="an-p02" action="on" port="1"

# Node 2 - off
pcs stonith create fence_n02_pdu1_off fence_apc_snmp pcmk_host_list="an-c03n02.alteeve.ca" ipaddr="an-p01" action="off" port="2" op monitor interval="60s"
pcs stonith create fence_n02_pdu2_off fence_apc_snmp pcmk_host_list="an-c03n02.alteeve.ca" ipaddr="an-p02" action="off" port="2" power_wait="5" op monitor interval="60s"

# Node 2 - on
pcs stonith create fence_n02_pdu1_on fence_apc_snmp pcmk_host_list="an-c03n02.alteeve.ca" ipaddr="an-p01" action="on" port="2"
pcs stonith create fence_n02_pdu2_on fence_apc_snmp pcmk_host_list="an-c03n02.alteeve.ca" ipaddr="an-p02" action="on" port="2"

We can check the new configuration now;

pcs status

Cluster name: an-cluster-03
Last updated: Tue Jul  2 16:41:55 2013
Last change: Tue Jul  2 16:41:44 2013 via cibadmin on an-c03n01.alteeve.ca
Stack: corosync
Current DC: an-c03n01.alteeve.ca (1) - partition with quorum
Version: 1.1.9-3.fc19-781a388
2 Nodes configured, unknown expected votes
10 Resources configured.


Online: [ an-c03n01.alteeve.ca an-c03n02.alteeve.ca ]

Full list of resources:

 fence_n01_ipmi	(stonith:fence_ipmilan):	Started an-c03n01.alteeve.ca 
 fence_n02_ipmi	(stonith:fence_ipmilan):	Started an-c03n02.alteeve.ca 
 fence_n01_pdu1_off	(stonith:fence_apc_snmp):	Started an-c03n01.alteeve.ca 
 fence_n01_pdu2_off	(stonith:fence_apc_snmp):	Started an-c03n02.alteeve.ca 
 fence_n02_pdu1_off	(stonith:fence_apc_snmp):	Started an-c03n01.alteeve.ca 
 fence_n02_pdu2_off	(stonith:fence_apc_snmp):	Started an-c03n02.alteeve.ca 
 fence_n01_pdu1_on	(stonith:fence_apc_snmp):	Started an-c03n01.alteeve.ca 
 fence_n01_pdu2_on	(stonith:fence_apc_snmp):	Started an-c03n02.alteeve.ca 
 fence_n02_pdu1_on	(stonith:fence_apc_snmp):	Started an-c03n01.alteeve.ca 
 fence_n02_pdu2_on	(stonith:fence_apc_snmp):	Started an-c03n02.alteeve.ca

Before we proceed, we need to tell pacemaker to use fencing;

pcs property set stonith-enabled=true
pcs property

Cluster Properties:
Cluster Properties:
 cluster-infrastructure: corosync
 dc-version: 1.1.9-3.fc19-781a388
 no-quorum-policy: ignore
 stonith-enabled: true

Excellent!

Configuring Fence Levels

The goal of fence levels is to tell pacemaker that there are "fence methods" to try and to impose an order on those methods. Each method composes one or more fence primitives and, when 2 or more primitives are tied together, that all primitives must succeed for the overall method to succeed.

So in our case; the order we want is;

IPMI -> PDUs

The reason is that when IPMI fencing succeeds, we can be very certain the node is truly fenced. When PDU fencing succeeds, it only confirms that the power outlets were cycled. If someone moved a node's power cables to another outlet, we'll get a false positive. On that topic, tie-down the node's PSU cables to the PDU's cable tray when possible, clearly label the power cables and wrap the fingers of anyone who might move them around.

The PDU fencing needs to be implemented using four steps;

PDU 1, outlet X -> off
PDU 2, outlet X -> off
- The power_wait="5" setting for the fence_n0X_pdu2_off primitives will cause a 5 second delay here, giving ample time to ensure the nodes lose power
PDU 1, outlet X -> on
PDU 2, outlet X -> on

This is to ensure that both outlets are off at the same time, ensuring that the node loses power. This works because fencing_topology acts serially.

Putting all this together, we issue this command;

pcs stonith level add 1 an-c03n01.alteeve.ca fence_n01_ipmi
pcs stonith level add 1 an-c03n02.alteeve.ca fence_n02_ipmi

The 1 tells pacemaker that this is our highest priority fence method. We can see that this was set using pcs;

pcs stonith level

 Node: an-c03n01.alteeve.ca
  Level 1 - fence_n01_ipmi
 Node: an-c03n02.alteeve.ca
  Level 1 - fence_n02_ipmi

Now we'll tell pacemaker to use the PDUs as the second fence method. Here we tie together the two off calls and the two on calls into a single method.

pcs stonith level add 2 an-c03n01.alteeve.ca fence_n01_pdu1_off,fence_n01_pdu2_off,fence_n01_pdu1_on,fence_n01_pdu2_on
pcs stonith level add 2 an-c03n02.alteeve.ca fence_n02_pdu1_off,fence_n02_pdu2_off,fence_n02_pdu1_on,fence_n02_pdu2_on

Check again and we'll see that the new methods were added.

pcs stonith level

 Node: an-c03n01.alteeve.ca
  Level 1 - fence_n01_ipmi
  Level 2 - fence_n01_pdu1_off,fence_n01_pdu2_off,fence_n01_pdu1_on,fence_n01_pdu2_on
 Node: an-c03n02.alteeve.ca
  Level 1 - fence_n02_ipmi
  Level 2 - fence_n02_pdu1_off,fence_n02_pdu2_off,fence_n02_pdu1_on,fence_n02_pdu2_on

For those of us who are XML fans, this is what the cib looks like now:

cat /var/lib/pacemaker/cib/cib.xml

<cib epoch="18" num_updates="0" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Thu Jul 18 13:15:53 2013" update-origin="an-c03n01.alteeve.ca" update-client="cibadmin" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="1">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.9-dde1c52"/>
        <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/>
        <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/>
      </cluster_property_set>
    </crm_config>
    <nodes>
      <node id="1" uname="an-c03n01.alteeve.ca"/>
      <node id="2" uname="an-c03n02.alteeve.ca"/>
    </nodes>
    <resources>
      <primitive class="stonith" id="fence_n01_ipmi" type="fence_ipmilan">
        <instance_attributes id="fence_n01_ipmi-instance_attributes">
          <nvpair id="fence_n01_ipmi-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="an-c03n01.alteeve.ca"/>
          <nvpair id="fence_n01_ipmi-instance_attributes-ipaddr" name="ipaddr" value="an-c03n01.ipmi"/>
          <nvpair id="fence_n01_ipmi-instance_attributes-action" name="action" value="reboot"/>
          <nvpair id="fence_n01_ipmi-instance_attributes-login" name="login" value="admin"/>
          <nvpair id="fence_n01_ipmi-instance_attributes-passwd" name="passwd" value="secret"/>
          <nvpair id="fence_n01_ipmi-instance_attributes-delay" name="delay" value="15"/>
        </instance_attributes>
        <operations>
          <op id="fence_n01_ipmi-monitor-interval-60s" interval="60s" name="monitor"/>
        </operations>
      </primitive>
      <primitive class="stonith" id="fence_n02_ipmi" type="fence_ipmilan">
        <instance_attributes id="fence_n02_ipmi-instance_attributes">
          <nvpair id="fence_n02_ipmi-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="an-c03n02.alteeve.ca"/>
          <nvpair id="fence_n02_ipmi-instance_attributes-ipaddr" name="ipaddr" value="an-c03n02.ipmi"/>
          <nvpair id="fence_n02_ipmi-instance_attributes-action" name="action" value="reboot"/>
          <nvpair id="fence_n02_ipmi-instance_attributes-login" name="login" value="admin"/>
          <nvpair id="fence_n02_ipmi-instance_attributes-passwd" name="passwd" value="secret"/>
        </instance_attributes>
        <operations>
          <op id="fence_n02_ipmi-monitor-interval-60s" interval="60s" name="monitor"/>
        </operations>
      </primitive>
      <primitive class="stonith" id="fence_n01_pdu1_off" type="fence_apc_snmp">
        <instance_attributes id="fence_n01_pdu1_off-instance_attributes">
          <nvpair id="fence_n01_pdu1_off-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="an-c03n01.alteeve.ca"/>
          <nvpair id="fence_n01_pdu1_off-instance_attributes-ipaddr" name="ipaddr" value="an-p01"/>
          <nvpair id="fence_n01_pdu1_off-instance_attributes-action" name="action" value="off"/>
          <nvpair id="fence_n01_pdu1_off-instance_attributes-port" name="port" value="1"/>
        </instance_attributes>
        <operations>
          <op id="fence_n01_pdu1_off-monitor-interval-60s" interval="60s" name="monitor"/>
        </operations>
      </primitive>
      <primitive class="stonith" id="fence_n01_pdu2_off" type="fence_apc_snmp">
        <instance_attributes id="fence_n01_pdu2_off-instance_attributes">
          <nvpair id="fence_n01_pdu2_off-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="an-c03n01.alteeve.ca"/>
          <nvpair id="fence_n01_pdu2_off-instance_attributes-ipaddr" name="ipaddr" value="an-p02"/>
          <nvpair id="fence_n01_pdu2_off-instance_attributes-action" name="action" value="off"/>
          <nvpair id="fence_n01_pdu2_off-instance_attributes-port" name="port" value="1"/>
          <nvpair id="fence_n01_pdu2_off-instance_attributes-power_wait" name="power_wait" value="5"/>
        </instance_attributes>
        <operations>
          <op id="fence_n01_pdu2_off-monitor-interval-60s" interval="60s" name="monitor"/>
        </operations>
      </primitive>
      <primitive class="stonith" id="fence_n01_pdu1_on" type="fence_apc_snmp">
        <instance_attributes id="fence_n01_pdu1_on-instance_attributes">
          <nvpair id="fence_n01_pdu1_on-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="an-c03n01.alteeve.ca"/>
          <nvpair id="fence_n01_pdu1_on-instance_attributes-ipaddr" name="ipaddr" value="an-p01"/>
          <nvpair id="fence_n01_pdu1_on-instance_attributes-action" name="action" value="on"/>
          <nvpair id="fence_n01_pdu1_on-instance_attributes-port" name="port" value="1"/>
        </instance_attributes>
        <operations>
          <op id="fence_n01_pdu1_on-monitor-interval-60s" interval="60s" name="monitor"/>
        </operations>
      </primitive>
      <primitive class="stonith" id="fence_n01_pdu2_on" type="fence_apc_snmp">
        <instance_attributes id="fence_n01_pdu2_on-instance_attributes">
          <nvpair id="fence_n01_pdu2_on-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="an-c03n01.alteeve.ca"/>
          <nvpair id="fence_n01_pdu2_on-instance_attributes-ipaddr" name="ipaddr" value="an-p02"/>
          <nvpair id="fence_n01_pdu2_on-instance_attributes-action" name="action" value="on"/>
          <nvpair id="fence_n01_pdu2_on-instance_attributes-port" name="port" value="1"/>
        </instance_attributes>
        <operations>
          <op id="fence_n01_pdu2_on-monitor-interval-60s" interval="60s" name="monitor"/>
        </operations>
      </primitive>
      <primitive class="stonith" id="fence_n02_pdu1_off" type="fence_apc_snmp">
        <instance_attributes id="fence_n02_pdu1_off-instance_attributes">
          <nvpair id="fence_n02_pdu1_off-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="an-c03n02.alteeve.ca"/>
          <nvpair id="fence_n02_pdu1_off-instance_attributes-ipaddr" name="ipaddr" value="an-p01"/>
          <nvpair id="fence_n02_pdu1_off-instance_attributes-action" name="action" value="off"/>
          <nvpair id="fence_n02_pdu1_off-instance_attributes-port" name="port" value="2"/>
        </instance_attributes>
        <operations>
          <op id="fence_n02_pdu1_off-monitor-interval-60s" interval="60s" name="monitor"/>
        </operations>
      </primitive>
      <primitive class="stonith" id="fence_n02_pdu2_off" type="fence_apc_snmp">
        <instance_attributes id="fence_n02_pdu2_off-instance_attributes">
          <nvpair id="fence_n02_pdu2_off-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="an-c03n02.alteeve.ca"/>
          <nvpair id="fence_n02_pdu2_off-instance_attributes-ipaddr" name="ipaddr" value="an-p02"/>
          <nvpair id="fence_n02_pdu2_off-instance_attributes-action" name="action" value="off"/>
          <nvpair id="fence_n02_pdu2_off-instance_attributes-port" name="port" value="2"/>
          <nvpair id="fence_n02_pdu2_off-instance_attributes-power_wait" name="power_wait" value="5"/>
        </instance_attributes>
        <operations>
          <op id="fence_n02_pdu2_off-monitor-interval-60s" interval="60s" name="monitor"/>
        </operations>
      </primitive>
      <primitive class="stonith" id="fence_n02_pdu1_on" type="fence_apc_snmp">
        <instance_attributes id="fence_n02_pdu1_on-instance_attributes">
          <nvpair id="fence_n02_pdu1_on-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="an-c03n02.alteeve.ca"/>
          <nvpair id="fence_n02_pdu1_on-instance_attributes-ipaddr" name="ipaddr" value="an-p01"/>
          <nvpair id="fence_n02_pdu1_on-instance_attributes-action" name="action" value="on"/>
          <nvpair id="fence_n02_pdu1_on-instance_attributes-port" name="port" value="2"/>
        </instance_attributes>
        <operations>
          <op id="fence_n02_pdu1_on-monitor-interval-60s" interval="60s" name="monitor"/>
        </operations>
      </primitive>
      <primitive class="stonith" id="fence_n02_pdu2_on" type="fence_apc_snmp">
        <instance_attributes id="fence_n02_pdu2_on-instance_attributes">
          <nvpair id="fence_n02_pdu2_on-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="an-c03n02.alteeve.ca"/>
          <nvpair id="fence_n02_pdu2_on-instance_attributes-ipaddr" name="ipaddr" value="an-p02"/>
          <nvpair id="fence_n02_pdu2_on-instance_attributes-action" name="action" value="on"/>
          <nvpair id="fence_n02_pdu2_on-instance_attributes-port" name="port" value="2"/>
        </instance_attributes>
        <operations>
          <op id="fence_n02_pdu2_on-monitor-interval-60s" interval="60s" name="monitor"/>
        </operations>
      </primitive>
    </resources>
    <constraints/>
    <fencing-topology>
      <fencing-level devices="fence_n01_ipmi" id="fl-an-c03n01.alteeve.ca-1" index="1" target="an-c03n01.alteeve.ca"/>
      <fencing-level devices="fence_n02_ipmi" id="fl-an-c03n02.alteeve.ca-1" index="1" target="an-c03n02.alteeve.ca"/>
      <fencing-level devices="fence_n01_pdu1_off,fence_n01_pdu2_off,fence_n01_pdu1_on,fence_n01_pdu2_on" id="fl-an-c03n01.alteeve.ca-2" index="2" target="an-c03n01.alteeve.ca"/>
      <fencing-level devices="fence_n02_pdu1_off,fence_n02_pdu2_off,fence_n02_pdu1_on,fence_n02_pdu2_on" id="fl-an-c03n02.alteeve.ca-2" index="2" target="an-c03n02.alteeve.ca"/>
    </fencing-topology>
  </configuration>
</cib>

Fencing using fence_virsh

Note: To write this section, I used two virtual machines called pcmk1 and pcmk2.

If you are trying to learn fencing using KVM or Xen virtual machines, you can use the fence_virsh. You can also use fence_virtd, which is actually recommended by many, but I have found it to be rather unreliable.

To use fence_virsh, first install it.

yum -y install fence-agents-virsh

Resolving Dependencies
--> Running transaction check
---> Package fence-agents-virsh.x86_64 0:4.0.3-1.fc19 will be installed
--> Processing Dependency: /usr/bin/virsh for package: fence-agents-virsh-4.0.3-1.fc19.x86_64
--> Running transaction check
---> Package libvirt-client.x86_64 0:1.0.5.5-1.fc19 will be installed
--> Processing Dependency: pm-utils for package: libvirt-client-1.0.5.5-1.fc19.x86_64
--> Processing Dependency: nc for package: libvirt-client-1.0.5.5-1.fc19.x86_64
--> Processing Dependency: libnuma.so.1(libnuma_1.2)(64bit) for package: libvirt-client-1.0.5.5-1.fc19.x86_64
--> Processing Dependency: libnuma.so.1(libnuma_1.1)(64bit) for package: libvirt-client-1.0.5.5-1.fc19.x86_64
--> Processing Dependency: gnutls-utils for package: libvirt-client-1.0.5.5-1.fc19.x86_64
--> Processing Dependency: cyrus-sasl-md5 for package: libvirt-client-1.0.5.5-1.fc19.x86_64
--> Processing Dependency: libyajl.so.2()(64bit) for package: libvirt-client-1.0.5.5-1.fc19.x86_64
--> Processing Dependency: libwsman_curl_client_transport.so.1()(64bit) for package: libvirt-client-1.0.5.5-1.fc19.x86_64
--> Processing Dependency: libwsman_client.so.1()(64bit) for package: libvirt-client-1.0.5.5-1.fc19.x86_64
--> Processing Dependency: libwsman.so.1()(64bit) for package: libvirt-client-1.0.5.5-1.fc19.x86_64
--> Processing Dependency: libnuma.so.1()(64bit) for package: libvirt-client-1.0.5.5-1.fc19.x86_64
--> Running transaction check
---> Package cyrus-sasl-md5.x86_64 0:2.1.26-9.fc19 will be installed
---> Package gnutls-utils.x86_64 0:3.1.11-1.fc19 will be installed
---> Package libwsman1.x86_64 0:2.3.6-6.fc19 will be installed
---> Package nmap-ncat.x86_64 2:6.40-2.fc19 will be installed
---> Package numactl-libs.x86_64 0:2.0.8-4.fc19 will be installed
---> Package pm-utils.x86_64 0:1.4.1-24.fc19 will be installed
---> Package yajl.x86_64 0:2.0.4-2.fc19 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

=========================================================================================================
 Package                        Arch               Version                     Repository           Size
=========================================================================================================
Installing:
 fence-agents-virsh             x86_64             4.0.3-1.fc19                updates             7.7 k
Installing for dependencies:
 cyrus-sasl-md5                 x86_64             2.1.26-9.fc19               updates              54 k
 gnutls-utils                   x86_64             3.1.11-1.fc19               fedora              261 k
 libvirt-client                 x86_64             1.0.5.5-1.fc19              updates             4.9 M
 libwsman1                      x86_64             2.3.6-6.fc19                fedora              120 k
 nmap-ncat                      x86_64             2:6.40-2.fc19               updates             198 k
 numactl-libs                   x86_64             2.0.8-4.fc19                fedora               28 k
 pm-utils                       x86_64             1.4.1-24.fc19               updates             139 k
 yajl                           x86_64             2.0.4-2.fc19                fedora               38 k

Transaction Summary
=========================================================================================================
Install  1 Package (+8 Dependent packages)

Total download size: 5.7 M
Installed size: 23 M
Downloading packages:
(1/9): fence-agents-virsh-4.0.3-1.fc19.x86_64.rpm                                 | 7.7 kB  00:00:01     
(2/9): cyrus-sasl-md5-2.1.26-9.fc19.x86_64.rpm                                    |  54 kB  00:00:01     
(3/9): libwsman1-2.3.6-6.fc19.x86_64.rpm                                          | 120 kB  00:00:01     
(4/9): numactl-libs-2.0.8-4.fc19.x86_64.rpm                                       |  28 kB  00:00:00     
(5/9): pm-utils-1.4.1-24.fc19.x86_64.rpm                                          | 139 kB  00:00:00     
(6/9): nmap-ncat-6.40-2.fc19.x86_64.rpm                                           | 198 kB  00:00:01     
(7/9): yajl-2.0.4-2.fc19.x86_64.rpm                                               |  38 kB  00:00:00     
(8/9): libvirt-client-1.0.5.5-1.fc19.x86_64.rpm                                   | 4.9 MB  00:00:12     
(9/9): gnutls-utils-3.1.11-1.fc19.x86_64.rpm                                      | 261 kB  00:01:28     
---------------------------------------------------------------------------------------------------------
Total                                                                     66 kB/s | 5.7 MB     01:28     
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Warning: RPMDB altered outside of yum.
  Installing : cyrus-sasl-md5-2.1.26-9.fc19.x86_64                                                   1/9 
  Installing : yajl-2.0.4-2.fc19.x86_64                                                              2/9 
  Installing : 2:nmap-ncat-6.40-2.fc19.x86_64                                                        3/9 
  Installing : libwsman1-2.3.6-6.fc19.x86_64                                                         4/9 
  Installing : numactl-libs-2.0.8-4.fc19.x86_64                                                      5/9 
  Installing : gnutls-utils-3.1.11-1.fc19.x86_64                                                     6/9 
  Installing : pm-utils-1.4.1-24.fc19.x86_64                                                         7/9 
  Installing : libvirt-client-1.0.5.5-1.fc19.x86_64                                                  8/9 
  Installing : fence-agents-virsh-4.0.3-1.fc19.x86_64                                                9/9 
  Verifying  : pm-utils-1.4.1-24.fc19.x86_64                                                         1/9 
  Verifying  : gnutls-utils-3.1.11-1.fc19.x86_64                                                     2/9 
  Verifying  : numactl-libs-2.0.8-4.fc19.x86_64                                                      3/9 
  Verifying  : libwsman1-2.3.6-6.fc19.x86_64                                                         4/9 
  Verifying  : 2:nmap-ncat-6.40-2.fc19.x86_64                                                        5/9 
  Verifying  : yajl-2.0.4-2.fc19.x86_64                                                              6/9 
  Verifying  : fence-agents-virsh-4.0.3-1.fc19.x86_64                                                7/9 
  Verifying  : libvirt-client-1.0.5.5-1.fc19.x86_64                                                  8/9 
  Verifying  : cyrus-sasl-md5-2.1.26-9.fc19.x86_64                                                   9/9 

Installed:
  fence-agents-virsh.x86_64 0:4.0.3-1.fc19                                                               

Dependency Installed:
  cyrus-sasl-md5.x86_64 0:2.1.26-9.fc19                gnutls-utils.x86_64 0:3.1.11-1.fc19              
  libvirt-client.x86_64 0:1.0.5.5-1.fc19               libwsman1.x86_64 0:2.3.6-6.fc19                  
  nmap-ncat.x86_64 2:6.40-2.fc19                       numactl-libs.x86_64 0:2.0.8-4.fc19               
  pm-utils.x86_64 0:1.4.1-24.fc19                      yajl.x86_64 0:2.0.4-2.fc19                       

Complete!

Now test it from the command line. To do this, we need to know a few things;

The VM host is at IP 192.168.122.1
The username and password (-l and -p respectively) are the credentials used to log into VM host over SSH.
The name of the target VM, as shown by virsh, is the node (-n) value

fence_virsh -a 192.168.122.1 -l root -p "secret" -n pcmk2 -o status

Status: ON

Excellent! Now to configure it in pacemaker;

pcs stonith create fence_pcmk1_virsh fence_virsh ipaddr="192.168.122.1" login="root" passwd="secret" port="pcmk1" delay="15" pcmk_host_list="pcmk1.alteeve.ca"
pcs stonith create fence_pcmk2_virsh fence_virsh ipaddr="192.168.122.1" login="root" passwd="secret" port="pcmk2" pcmk_host_list="pcmk2.alteeve.ca"
pcs status

Cluster name: an-pcmk-01
Last updated: Mon Sep 30 12:08:30 2013
Last change: Sun Sep 29 22:18:40 2013 via cibadmin on pcmk2.alteeve.ca
Stack: corosync
Current DC: pcmk1.alteeve.ca (1) - partition with quorum
Version: 1.1.10-1.fc19-368c726
2 Nodes configured
4 Resources configured


Online: [ pcmk1.alteeve.ca pcmk2.alteeve.ca ]

Full list of resources:

 fence_pcmk1_virsh	(stonith:fence_virsh):	Started pcmk1.alteeve.ca 
 fence_pcmk2_virsh	(stonith:fence_virsh):	Started pcmk2.alteeve.ca

Shared Storage

DRBD

We will use DRBD 8.4.

yum -y install drbd drbd-pacemaker drbd-bash-completion

Configure global-common.conf;

vim /etc/drbd.d/global_common.conf

# These are options to set for the DRBD daemon sets the default values for
# resources.
global {
	# This tells DRBD that you allow it to report this installation to 
	# LINBIT for statistical purposes. If you have privacy concerns, set
	# this to 'no'. The default is 'ask' which will prompt you each time
	# DRBD is updated. Set to 'yes' to allow it without being prompted.
	usage-count no;

	# minor-count dialog-refresh disable-ip-verification
}

common {
	handlers {
		pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
		pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
		local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
		# split-brain "/usr/lib/drbd/notify-split-brain.sh root";
		# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
		# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
		# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
		
		# Hook into Pacemaker's fencing.
		fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
	}

	startup {
		# wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
	}

	options {
		# cpu-mask on-no-data-accessible
	}

	disk {
		# size max-bio-bvecs on-io-error fencing disk-barrier disk-flushes
		# disk-drain md-flushes resync-rate resync-after al-extents
                # c-plan-ahead c-delay-target c-fill-target c-max-rate
                # c-min-rate disk-timeout
                fencing resource-and-stonith;
	}

	net {
		# protocol timeout max-epoch-size max-buffers unplug-watermark
		# connect-int ping-int sndbuf-size rcvbuf-size ko-count
		# allow-two-primaries cram-hmac-alg shared-secret after-sb-0pri
		# after-sb-1pri after-sb-2pri always-asbp rr-conflict
		# ping-timeout data-integrity-alg tcp-cork on-congestion
		# congestion-fill congestion-extents csums-alg verify-alg
		# use-rle

		# Protocol "C" tells DRBD not to tell the operating system that
		# the write is complete until the data has reach persistent
		# storage on both nodes. This is the slowest option, but it is
		# also the only one that guarantees consistency between the
		# nodes. It is also required for dual-primary, which we will 
		# be using.
		protocol C;

		# Tell DRBD to allow dual-primary. This is needed to enable 
		# live-migration of our servers.
		allow-two-primaries yes;

		# This tells DRBD what to do in the case of a split-brain when
		# neither node was primary, when one node was primary and when
		# both nodes are primary. In our case, we'll be running
		# dual-primary, so we can not safely recover automatically. The
		# only safe option is for the nodes to disconnect from one
		# another and let a human decide which node to invalidate. Of 
		after-sb-0pri discard-zero-changes;
		after-sb-1pri discard-secondary;
		after-sb-2pri disconnect;
	}
}

And now configure the first resource;

vim /etc/drbd.d/r0.res

# This is the first DRBD resource. If will store the shared file systems and
# the servers designed to run on node 01.
resource r0 {
	# These options here are common to both nodes. If for some reason you
	# need to set unique values per node, you can move these to the
	# 'on <name> { ... }' section.
	
	# This sets the device name of this DRBD resouce.
	device /dev/drbd0;

	# This tells DRBD what the backing device is for this resource.
	disk /dev/sda5;

	# This controls the location of the metadata. When "internal" is used,
	# as we use here, a little space at the end of the backing devices is
	# set aside (roughly 32 MB per 1 TB of raw storage). External metadata
	# can be used to put the metadata on another partition when converting
	# existing file systems to be DRBD backed, when there is no extra space
	# available for the metadata.
	meta-disk internal;

	# NOTE: this is not required or even recommended with pacemaker. remove
	# 	this options as soon as pacemaker is setup.
	startup {
		# This tells DRBD to promote both nodes to 'primary' when this
		# resource starts. However, we will let pacemaker control this
		# so we comment it out, which tells DRBD to leave both nodes
		# as secondary when drbd starts.
		#become-primary-on both;
	}

	# NOTE: Later, make it an option in the dashboard to trigger a manual
	# 	verify and/or schedule periodic automatic runs
	net {
		# TODO: Test performance differences between sha1 and md5
		# This tells DRBD how to do a block-by-block verification of
		# the data stored on the backing devices. Any verification
		# failures will result in the effected block being marked
		# out-of-sync.
		verify-alg md5;

		# TODO: Test the performance hit of this being enabled.
		# This tells DRBD to generate a checksum for each transmitted
		# packet. If the data received data doesn't generate the same
		# sum, a retransmit request is generated. This protects against
		# otherwise-undetected errors in transmission, like 
		# bit-flipping. See:
		# http://www.drbd.org/users-guide/s-integrity-check.html
		data-integrity-alg md5;
	}

	# WARNING: Confirm that these are safe when the controller's BBU is
	#          depleted/failed and the controller enters write-through 
	#          mode.
	disk {
		# TODO: Test the real-world performance differences gained with
		#       these options.
		# This tells DRBD not to bypass the write-back caching on the
		# RAID controller. Normally, DRBD forces the data to be flushed
		# to disk, rather than allowing the write-back cachine to 
		# handle it. Normally this is dangerous, but with BBU-backed
		# caching, it is safe. The first option disables disk flushing
		# and the second disabled metadata flushes.
		disk-flushes no;
		md-flushes no;
	}

	# This sets up the resource on node 01. The name used below must be the
	# named returned by "uname -n".
	on an-c03n01.alteeve.ca {
		# This is the address and port to use for DRBD traffic on this
		# node. Multiple resources can use the same IP but the ports
		# must differ. By convention, the first resource uses 7788, the
		# second uses 7789 and so on, incrementing by one for each
		# additional resource. 
		address 10.10.30.1:7788;
	}
	on an-c03n02.alteeve.ca {
		address 10.10.30.2:7788;
	}
}

Disable drbd from starting on boot.

systemctl disable drbd.service

drbd.service is not a native service, redirecting to /sbin/chkconfig.
Executing /sbin/chkconfig drbd off

Load the config;

modprobe drbd

Now check the config;

drbdadm dump

  --==  Thank you for participating in the global usage survey  ==--
The server's response is:

you are the 69th user to install this version
/etc/drbd.d/r0.res:3: in resource r0:
become-primary-on is set to both, but allow-two-primaries is not set.

Ignore that error. It has been reported and does not effect operation.

Create the metadisk;

drbdadm create-md r0

Writing meta data...
initializing activity log
NOT initializing bitmap
New drbd meta data block successfully created.
success

Start the DRBD resource on both nodes;

drbdadm up r0

Once /proc/drbd shows both nodes connected, force one to primary and it will sync over the second.

drbdadm primary --force r0

You should see the resource syncing now. Push both nodes to primary;

drbdadm primary r0

DLM, Clustered LVM and GFS2

Install DLM and GFS2;

yum -y install dlm dlm-lib lvm2-cluster gfs2-utils

Disable dlm and clvmd from starting on boot.

systemctl disable clvmd.service

clvmd.service is not a native service, redirecting to /sbin/chkconfig.
Executing /sbin/chkconfig clvmd off

systemctl disable dlm.service

rm '/etc/systemd/system/multi-user.target.wants/dlm.service'

Edit lvm.conf;

diff -u /etc/lvm/lvm.conf.orig /etc/lvm/lvm.conf

--- /etc/lvm/lvm.conf.orig	2013-07-08 16:38:45.603780083 -0500
+++ /etc/lvm/lvm.conf	2013-07-08 16:47:34.434591848 -0500
@@ -65,7 +65,7 @@
 
 
     # By default we accept every block device:
-    filter = [ "a/.*/" ]
+    filter = [ "a|/dev/drbd*|", "r/.*/" ]
 
     # Exclude the cdrom drive
     # filter = [ "r|/dev/cdrom|" ]
@@ -405,7 +405,7 @@
     # Type 3 uses built-in clustered locking.
     # Type 4 uses read-only locking which forbids any operations that might 
     # change metadata.
-    locking_type = 1
+    locking_type = 3
 
     # Set to 0 to fail when a lock request cannot be satisfied immediately.
     wait_for_locks = 1
@@ -421,7 +421,7 @@
     # to 1 an attempt will be made to use local file-based locking (type 1).
     # If this succeeds, only commands against local volume groups will proceed.
     # Volume Groups marked as clustered will be ignored.
-    fallback_to_local_locking = 1
+    fallback_to_local_locking = 0
 
     # Local non-LV directory that holds file-based locks while commands are
     # in progress.  A directory like /tmp that may get wiped on reboot is OK.
@@ -508,7 +508,7 @@
     #
     # If lvmetad has been running while use_lvmetad was 0, it MUST be stopped
     # before changing use_lvmetad to 1 and started again afterwards.
-    use_lvmetad = 1
+    use_lvmetad = 0
 
     # Full path of the utility called to check that a thin metadata device
     # is in a state that allows it to be used.

Disable lvmetad as it's not cluster-aware.

systemctl disable lvm2-lvmetad.service
systemctl disable lvm2-lvmetad.socket
systemctl stop lvm2-lvmetad.service

Note: This will be moved to pacemaker

Start DLM and clvmd;

systemctl start dlm.service
systemctl start clvmd.service

Create the PV, VG and the /shared LV;

pvcreate /dev/drbd0 
vgcreate an-c03n01_vg0 /dev/drbd0
lvcreate -L 40G -n shared an-c03n01_vg0

Format the /dev/an-c03n01_vg0/shared;

mkfs.gfs2 -j 2 -p lock_dlm -t an-cluster-03:shared /dev/an-c03n01_vg0/shared

/dev/an-c03n01_vg0/shared is a symlink to /dev/dm-0
This will destroy any data on /dev/dm-0.
It appears to contain: data
Are you sure you want to proceed? [y/n]y
Device:                    /dev/an-c03n01_vg0/shared
Blocksize:                 4096
Device Size                40.00 GB (10485760 blocks)
Filesystem Size:           40.00 GB (10485758 blocks)
Journals:                  2
Resource Groups:           160
Locking Protocol:          "lock_dlm"
Lock Table:                "an-cluster-03:shared"
UUID:                      e96dbbec-add4-c291-083b-381a866a773d

Create the mount points and mount the new file system on both nodes;

mkdir /shared
mount /dev/an-c03n01_vg0/shared /shared

Odds and Sods

This is a section for random notes. The stuff here will be integrated into the finished tutorial or removed.

Determine multicast Address

Useful if you need to ensure that your switch has persistent multicast addresses set.

corosync-cmapctl | grep mcastaddr

totem.interface.0.mcastaddr (str) = 239.192.122.199

Notes

Pacemaker Logging

Thanks

This list will certainly grow as this tutorial progresses;

Olivier Allart, RCHE for doing a lot of the heavy lifting on the fencing_topology configuration.

Any questions, feedback, advice, complaints or meanderings are welcome.
`Alteeve's Niche!`	`Enterprise Support: Alteeve Support`	`Community Support`
© Alteeve's Niche! Inc. 1997-2024		Anvil! "Intelligent Availability®" Platform
`legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.`

@@ Line 64: / Line 64: @@
 <source lang="bash">
 hostnamectl set-hostname an-c03n01.alteeve.ca --static
-hostnamectl set-hostname --pretty "Alteeve's Niche! - Cluster 01, Node 01"
+hostnamectl set-hostname --pretty "Alteeve's Niche! - Cluster 03, Node 01"
 </source>

Anvil! Tutorial 3: Difference between revisions