AN!Cluster Tutorial 3 (using crm)

From Alteeve Wiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

 AN!Wiki :: How To :: AN!Cluster Tutorial 3 (using crm)

Warning: This tutorial is incomplete, flawed and generally sucks at this time. Do not follow this and expect anything to work. In large part, it's a dumping ground for notes and little else. This warning will be removed when the tutorial is completed.
Warning: This tutorial is built on a guess of what Red Hat's Enterprise Linux 7 will offer, based on what the author sees happening in Fedora upstream. Red Hat never confirms what a future release will contain until it is actually released. As such, this tutorial may turn out to be inappropriate for the final release of RHEL 7. In such a case, the warning above will remain in place until the tutorial is updated to reflect the final release.
Note: This is the crmsh version of the main tutorial, AN!Cluster Tutorial 3.

This is the third AN!Cluster tutorial built on Red Hat's Enterprise Linux 7. It improves on the RHEL 5, RHCS stable 2 and RHEL 6, RHCS stable3 tutorials.

As with the previous tutorials, the end goal of this tutorial is a 2-node cluster providing a platform for high-availability virtual servers. It's design attempts to remove all single points of failure from the system. Power and networking are made fully redundant in this version, along with minimizing the node failures which would lead to service interruption. This tutorial also covers the AN!Utilities; AN!Cluster Dashboard, AN!Cluster Monitor and AN!Safe Cluster Shutdown.

As it the previous tutorial, KVM will be the hypervisor used for facilitating virtual machines. The old cman and rgmanager tools are replaced in favour of pacemaker for resource management.

Before We Begin

This tutorial does not require prior cluster experience, but it does expect familiarity with Linux and a low-intermediate understanding of networking. Where possible, steps are explained in detail and rationale is provided for why certain decisions are made.

For those with cluster experience;

Please be careful not to skip too much. There are some major and some subtle changes from previous tutorials.

OS Setup

Warning: I used Fedora 18 at this point, obviously things will change, possibly a lot, once RHEL 7 is released.

Install

Not all of these are required, but most are used at one point or another in this tutorial.

yum install bridge-utils corosync crmsh gpm man net-tools network ntp pacemaker rsync syslinux vim wget

If you want to use your mouse at the node's terminal, run the following;

systemctl enable gpm.service
systemctl start gpm.service

Make the Network Configuration Static

We don't want NetworkManager in our cluster as it tries to dynamically manage the network and we need our network to be static.

yum remove NetworkManager
Note: This assumes that systemd will be used in RHEL7. This may not be the case come release day.

Now to ensure the static network service starts on boot.

systemctl enable network.service

Setting the Hostname

Fedora 18 is very different from EL6.

Note: The '--pretty' line currently doesn't work as there is a bug (rhbz#895299) with single-quotes.
Note: The '--static' option is currently needed to prevent the '.' from being removed. See this bug (rhbz#896756).

Use a format that works for you. For the tutorial, node names are based on the following;

  • A two-letter prefix identifying the company/user (an, for "Alteeve's Niche!")
  • A sequential cluster ID number in the form of cXX (c01 for "Cluster 01", c02 for Cluster 02, etc)
  • A sequential node ID number in the form of nYY

In my case, this is my third cluster and I use the company prefix an, so my two nodes will be;

  • an-c03n01 - node 1
  • an-c03n02 - node 2

Folks who've read my earlier tutorials will note that this is a departure in naming. I find this method spans and scales much better. Further, it the simply required in order to use the AN! Cluster Dashboard.

hostnamectl set-hostname an-c03n01.alteeve.ca --static
hostnamectl set-hostname --pretty "Alteeve's Niche! - Cluster 01, Node 01"

If you want the new host name to take effect immediately, you can use the traditional hostname command:

hostname an-c03n01.alteeve.ca

Alternatively

If you have trouble with those commands, you can directly edit the files that contain the host names.

The host name is stored in /etc/hostname:

echo an-c03n01.alteeve.ca > /etc/hostname 
cat /etc/hostname
an-c03n01.alteeve.ca

The "pretty" host name is stored in /etc/machine-info as the unquoted value for the PRETTY_HOSTNAME value.

vim /etc/machine-info
PRETTY_HOSTNAME=Alteeves Niche! - Cluster 01, Node 01

If you can't get the hostname command to work for some reason, you can reboot to have the system read the new values.

Optional - Video Problems

On my servers, Fedora 18 doesn't detect or use the video card properly. To resolve this, I need to add nomodeset to the kernel line when installing and again after the install is complete.

Once installed

Edit the /etc/default/grub and append nomodeset to the end of the GRUB_CMDLINE_LINUX variable.

vim /etc/default/grub
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_CMDLINE_LINUX="nomodeset rd.md=0 rd.lvm=0 rd.dm=0 $([ -x /usr/sbin/rhcrashkernel-param ] && /usr/sbin/rhcrashkernel-param || :) rd.luks=0 vconsole.keymap=us nomodeset"
GRUB_DISABLE_RECOVERY="true"
GRUB_THEME="/boot/grub2/themes/system/theme.txt"

Save that. and then rewrite the grub2 configuration file.

grub2-mkconfig -o /boot/grub2/grub.cfg

Next time you reboot, you should get a stock 80x25 character display. It's not much, but it will work on esoteric video cards or weird monitors.

What Security?

This section will be re-added at the end. For now;

setenforce 0
sed -i 's/SELINUX=.*/SELINUX=disabled/' /etc/selinux/config
systemctl disable firewalld.service
systemctl stop firewalld.service

Network

We want static, named network devices. Follow this;

Then, use these configuration files;

Build the bridge;

vim /etc/sysconfig/network-scripts/ifcfg-ifn-vbr1
# Internet-Facing Network - Bridge
DEVICE="ifn-vbr1"
TYPE="Bridge"
BOOTPROTO="none"
IPADDR="10.255.10.1"
NETMASK="255.255.0.0"
GATEWAY="10.255.255.254"
DNS1="8.8.8.8"
DNS2="8.8.4.4"
DEFROUTE="yes"

Now build the bonds;

vim /etc/sysconfig/network-scripts/ifcfg-ifn-bond1
# Internet-Facing Network - Bond
DEVICE="ifn-bond1"
BRIDGE="ifn-vbr1"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=ifn1"
vim /etc/sysconfig/network-scripts/ifcfg-sn-bond1
# Storage Network - Bond
DEVICE="sn-bond1"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=sn1"
IPADDR="10.10.10.1"
NETMASK="255.255.0.0"
vim /etc/sysconfig/network-scripts/ifcfg-bcn-bond1
# Back-Channel Network - Bond
DEVICE="bcn-bond1"
BOOTPROTO="none"
NM_CONTROLLED="no"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=bcn1"
IPADDR="10.20.10.1"
NETMASK="255.255.0.0"

Now tell the interfaces to be slaves to their bonds;

Internet-Facing Network;

vim /etc/sysconfig/network-scripts/ifcfg-ifn1
# Internet-Facing Network - Link 1
DEVICE="ifn1"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
SLAVE="yes"
MASTER="ifn-bond1"
vim /etc/sysconfig/network-scripts/ifcfg-ifn2
# Back-Channel Network - Link 2
DEVICE="ifn2"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
SLAVE="yes"
MASTER="ifn-bond1"

Storage Network;

vim /etc/sysconfig/network-scripts/ifcfg-sn1
# Storage Network - Link 1
DEVICE="sn1"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
SLAVE="yes"
MASTER="sn-bond1"
vim /etc/sysconfig/network-scripts/ifcfg-sn2
# Storage Network - Link 1
DEVICE="sn2"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
SLAVE="yes"
MASTER="sn-bond1"

Back-Channel Network

vim /etc/sysconfig/network-scripts/ifcfg-bcn1
# Back-Channel Network - Link 1
DEVICE="bcn1"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
SLAVE="yes"
MASTER="bcn-bond1"
vim /etc/sysconfig/network-scripts/ifcfg-bcn2
# Storage Network - Link 1
DEVICE="bcn2"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
SLAVE="yes"
MASTER="bcn-bond1"

Now restart the network, confirm that the bonds and bridge are up and you are ready to proceed.

Setup The hosts File

You can use DNS if you prefer. For now, lets use /etc/hosts for node name resolution.

vim /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

# AN!Cluster 01, Node 01
10.255.10.1     an-c01n01.ifn
10.10.10.1      an-c01n01.sn
10.20.10.1      an-c01n01.bcn an-c01n01 an-c01n01.alteeve.ca
10.20.11.1      an-c01n01.ipmi

# AN!Cluster 01, Node 02
10.255.10.2     an-c01n02.ifn
10.10.10.2      an-c01n02.sn
10.20.10.2      an-c01n02.bcn an-c01n02 an-c01n02.alteeve.ca
10.20.11.2      an-c01n02.ipmi

# Foundation Pack
10.20.2.7       an-p03 an-p03.alteeve.ca

Setup SSH

Same as before.

Populating And Pushing ~/.ssh/known_hosts

Same as before.

ssh root@an-c03n01.alteeve.ca
The authenticity of host 'an-c03n01.alteeve.ca (10.20.30.1)' can't be established.
RSA key fingerprint is 7b:dd:0d:aa:c5:f5:9e:a6:b6:4d:40:69:d6:80:4d:09.
Are you sure you want to continue connecting (yes/no)?

Type yes

Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-c03n01.alteeve.ca,10.20.30.1' (RSA) to the list of known hosts.
Last login: Thu Feb 14 15:18:33 2013 from 10.20.5.100

You will now be logged into the an-c03n01 node, which in this case is the same machine on a new session in the same terminal.

[root@an-c03n01 ~]#

You can logout by typing exit.

exit
logout
Connection to an-c03n01.alteeve.ca closed.

Now we have to repeat the steps for all the other variations on the names of the hosts. This is annoying and tedious, sorry.

ssh root@an-c03n01
ssh root@an-c03n01.bcn
ssh root@an-c03n01.sn
ssh root@an-c03n01.ifn
ssh root@an-c03n02.alteeve.ca
ssh root@an-c03n02
ssh root@an-c03n02.bcn
ssh root@an-c03n02.sn
ssh root@an-c03n02.ifn

Your ~/.ssh/known_hosts file will now be populated with both nodes' ssh fingerprints. Copy it over to the second node to save all that typing a second time.

rsync -av ~/.ssh/known_hosts root@an-c03n02:/root/.ssh/

Keeping Time in Sync

It's not as critical as it used to be to keep the clocks on the nodes in sync, but it's still a good idea.

systemctl start ntpd.service
systemctl enable ntpd.service

Configuring the Cluster

Now we're getting down to business!

For this section, we will be working on an-c03n01 and using ssh to perform tasks on an-c03n02.

Note: TODO: explain what this is and how it works.

Enable the pcs Daemon

Note: Most of this section comes more or less verbatim from the main Clusters from Scratch (crmsh) tutorial.

We will use crmsh to configure the cluster.

Choosing a Multicast Address

More on Multicast.

Pick a Multicast address and ensure that it doesn't conflict with any other multicast group used on your subnet. Pay particular attention to the ambiguous bits and how multicast IPs overlap with one another.

For this tutorial, we will use multicast group 239.255.1.1 and port 5405

Configuring Corosync

First, copy the example corosync.conf.

cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf

Now let's edit it.

The main three values to check/change are;

  • mcastaddr; Set the multicast address to use in this cluster. We will use 239.255.1.1, which is the one already set in the example config.
  • mcastport; Set the multicast port to use in this cluster. We will use 5405, which is the one already set in the example config.
  • bindnetaddr; this tells corosync which network to use. Our BCN is 10.20.0.0/255.255.0.0, so we will set this to 10.20.0.0. This is the only value we need to change from the default configuration.

We also need to enable the quorum provider corosync_votequorum and tell it how many votes to expect (one vote per node, two nodes).

vim /etc/corosync/corosync.conf
# Please read the corosync.conf.5 manual page
totem {
	version: 2

	# crypto_cipher and crypto_hash: Used for mutual node authentication.
	# If you choose to enable this, then do remember to create a shared
	# secret with "corosync-keygen".
	# enabling crypto_cipher, requires also enabling of crypto_hash.
	crypto_cipher: none
	crypto_hash: none

	# interface: define at least one interface to communicate
	# over. If you define more than one interface stanza, you must
	# also set rrp_mode.
	interface {
                # Rings must be consecutively numbered, starting at 0.
		ringnumber: 0
		# This is normally the *network* address of the
		# interface to bind to. This ensures that you can use
		# identical instances of this configuration file
		# across all your cluster nodes, without having to
		# modify this option.
		bindnetaddr: 10.20.0.0
		# However, if you have multiple physical network
		# interfaces configured for the same subnet, then the
		# network address alone is not sufficient to identify
		# the interface Corosync should bind to. In that case,
		# configure the *host* address of the interface
		# instead:
		# bindnetaddr: 192.168.1.1
		# When selecting a multicast address, consider RFC
		# 2365 (which, among other things, specifies that
		# 239.255.x.x addresses are left to the discretion of
		# the network administrator). Do not reuse multicast
		# addresses across multiple Corosync clusters sharing
		# the same network.
		mcastaddr: 239.255.1.1
		# Corosync uses the port you specify here for UDP
		# messaging, and also the immediately preceding
		# port. Thus if you set this to 5405, Corosync sends
		# messages over UDP ports 5405 and 5404.
		mcastport: 5405
		# Time-to-live for cluster communication packets. The
		# number of hops (routers) that this ring will allow
		# itself to pass. Note that multicast routing must be
		# specifically enabled on most network routers.
		ttl: 1
	}
}

logging {
	# Log the source file and line where messages are being
	# generated. When in doubt, leave off. Potentially useful for
	# debugging.
	fileline: off
	# Log to standard error. When in doubt, set to no. Useful when
	# running in the foreground (when invoking "corosync -f")
	to_stderr: no
	# Log to a log file. When set to "no", the "logfile" option
	# must not be set.
	to_logfile: yes
	logfile: /var/log/cluster/corosync.log
	# Log to the system log daemon. When in doubt, set to yes.
	to_syslog: yes
	# Log debug messages (very verbose). When in doubt, leave off.
	debug: off
	# Log messages with time stamps. When in doubt, set to on
	# (unless you are only logging to syslog, where double
	# timestamps can be annoying).
	timestamp: on
	logger_subsys {
		subsys: QUORUM
		debug: off
	}
}

quorum {
	# Enable and configure quorum subsystem (default: off)
	# see also corosync.conf.5 and votequorum.5
	provider: corosync_votequorum
	expected_votes: 2
}

Copy the corosync.conf file to the other node.

rsync -av /etc/corosync/corosync.conf root@an-c03n02:/etc/corosync/
sending incremental file list
corosync.conf

sent 2978 bytes  received 31 bytes  6018.00 bytes/sec
total size is 2897  speedup is 0.96

Now start the corosync service on both nodes to start the cluster's communication and membership for the first time!

On both nodes;

systemctl start corosync.service

In syslog, you should see messages like this;

Feb 17 15:59:43 an-c03n01 systemd[1]: Starting Corosync Cluster Engine...
Feb 17 15:59:43 an-c03n01 corosync[1306]:   [MAIN  ] Corosync Cluster Engine ('2.3.0'): started and ready to provide service.
Feb 17 15:59:43 an-c03n01 corosync[1306]:   [MAIN  ] Corosync built-in features: dbus rdma systemd xmlconf snmp pie relro bindnow
Feb 17 15:59:43 an-c03n01 corosync[1307]:   [TOTEM ] Initializing transport (UDP/IP Multicast).
Feb 17 15:59:43 an-c03n01 corosync[1307]:   [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [TOTEM ] The network interface [10.20.30.1] is now up.
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [QB    ] server name: cmap
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [QB    ] server name: cfg
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [QB    ] server name: cpg
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [QUORUM] Using quorum provider corosync_votequorum
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [QB    ] server name: votequorum
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [QB    ] server name: quorum
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [QUORUM] Members[1]: 18748426
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [TOTEM ] A processor joined or left the membership and a new membership (10.20.30.1:100) was formed.
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [QUORUM] Members[2]: 18748426 35525642
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [TOTEM ] A processor joined or left the membership and a new membership (10.20.30.1:108) was formed.
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [QUORUM] This node is within the primary component and will provide service.
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [QUORUM] Members[2]: 18748426 35525642
Feb 17 15:59:44 an-c03n01 corosync[1307]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 17 15:59:44 an-c03n01 corosync[1300]: Starting Corosync Cluster Engine (corosync): [  OK  ]
Feb 17 15:59:44 an-c03n01 systemd[1]: Started Corosync Cluster Engine.

Once corosync is running, on both nodes, you should be able to see that the cluster has been formed.

corosync-cfgtool -s
Printing ring status.
Local node ID 18748426
RING ID 0
	id	= 10.20.30.1
	status	= ring 0 active with no faults

On the other node;

corosync-cfgtool -s
Printing ring status.
Local node ID 35525642
RING ID 0
	id	= 10.20.30.2
	status	= ring 0 active with no faults

From either node, we can verify that both nodes are in the same cluster;

corosync-cmapctl  | grep members
runtime.totem.pg.mrp.srp.members.18748426.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.18748426.ip (str) = r(0) ip(10.20.30.1) 
runtime.totem.pg.mrp.srp.members.18748426.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.18748426.status (str) = joined
runtime.totem.pg.mrp.srp.members.35525642.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.35525642.ip (str) = r(0) ip(10.20.30.2) 
runtime.totem.pg.mrp.srp.members.35525642.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.35525642.status (str) = joined

And another way to confirm, a little cleaner to read;

corosync-quorumtool -l
Membership information
----------------------
    Nodeid      Votes Name
  18748426          1 an-c03n01.bcn (local)
  35525642          1 an-c03n02.bcn

Voila!

Configuring Pacemaker

The remainder of this tutorial will revolve around configuring pacemaker.

Installing crmsh

cd /etc/yum.repos.d/
wget -c http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/Fedora_19/network:ha-clustering:Stable.repo
cd ~
yum install crmsh
network_ha-clustering_Stable                                                      | 1.6 kB  00:00:00     
updates/19/x86_64/metalink                                                        |  19 kB  00:00:00     
network_ha-clustering_Stable/primary                                              | 8.1 kB  00:00:01     
network_ha-clustering_Stable                                                                       35/35
Resolving Dependencies
--> Running transaction check
---> Package crmsh.x86_64 0:1.2.6-4.3 will be installed
--> Processing Dependency: python-dateutil for package: crmsh-1.2.6-4.3.x86_64
--> Processing Dependency: pssh for package: crmsh-1.2.6-4.3.x86_64
--> Running transaction check
---> Package pssh.noarch 0:2.3.1-4.fc19 will be installed
---> Package python-dateutil.noarch 0:1.5-6.fc19 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

=========================================================================================================
 Package                 Arch           Version               Repository                            Size
=========================================================================================================
Installing:
 crmsh                   x86_64         1.2.6-4.3             network_ha-clustering_Stable         469 k
Installing for dependencies:
 pssh                    noarch         2.3.1-4.fc19          fedora                                49 k
 python-dateutil         noarch         1.5-6.fc19            fedora                                85 k

Transaction Summary
=========================================================================================================
Install  1 Package (+2 Dependent packages)

Total download size: 603 k
Installed size: 2.1 M
Is this ok [y/d/N]: y
Downloading packages:
(1/3): python-dateutil-1.5-6.fc19.noarch.rpm                                      |  85 kB  00:00:00     
warning: /var/cache/yum/x86_64/19/network_ha-clustering_Stable/packages/crmsh-1.2.6-4.3.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID 17280ddf: NOKEY
Public key for crmsh-1.2.6-4.3.x86_64.rpm is not installed
(2/3): crmsh-1.2.6-4.3.x86_64.rpm                                                 | 469 kB  00:00:02     
(3/3): pssh-2.3.1-4.fc19.noarch.rpm                                               |  49 kB  00:00:19     
---------------------------------------------------------------------------------------------------------
Total                                                                     32 kB/s | 603 kB     00:19

You will get prompted to verify the key, do so.

Retrieving key from http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/Fedora_19/repodata/repomd.xml.key
Importing GPG key 0x17280DDF:
 Userid     : "network OBS Project <network@build.opensuse.org>"
 Fingerprint: 0080 689b e757 a876 cb7d c269 62eb 1a09 1728 0ddf
 From       : http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/Fedora_19/repodata/repomd.xml.key
Is this ok [y/N]: y
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : pssh-2.3.1-4.fc19.noarch                                                              1/3 
  Installing : python-dateutil-1.5-6.fc19.noarch                                                     2/3 
  Installing : crmsh-1.2.6-4.3.x86_64                                                                3/3 
  Verifying  : python-dateutil-1.5-6.fc19.noarch                                                     1/3 
  Verifying  : pssh-2.3.1-4.fc19.noarch                                                              2/3 
  Verifying  : crmsh-1.2.6-4.3.x86_64                                                                3/3 

Installed:
  crmsh.x86_64 0:1.2.6-4.3                                                                               

Dependency Installed:
  pssh.noarch 0:2.3.1-4.fc19                     python-dateutil.noarch 0:1.5-6.fc19                    

Complete!

Initializing Pacemaker

First, start pacemakerd on both nodes.

systemctl start pacemaker.service

If you watch syslog, you will see messages like:

Feb 19 13:17:16 an-c03n01 systemd[1]: Starting Pacemaker High Availability Cluster Manager...
Feb 19 13:17:16 an-c03n01 systemd[1]: Started Pacemaker High Availability Cluster Manager.
Feb 19 13:17:16 an-c03n01 pacemakerd[1387]: Could not establish pacemakerd connection: Connection refused (111)
Feb 19 13:17:16 an-c03n01 pacemakerd[1387]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Feb 19 13:17:16 an-c03n01 pacemakerd[1387]:   notice: main: Starting Pacemaker 1.1.9-0.1318.a7966fb.git.fc18 (Build: a7966fb):  generated-manpages agent-manpages ncurses libqb-logging libqb-ipc upstart systemd nagios  corosync-native
Feb 19 13:17:16 an-c03n01 pacemakerd[1387]:   notice: corosync_node_name: Unable to get node name for nodeid 0
Feb 19 13:17:16 an-c03n01 pacemakerd[1387]:   notice: get_local_node_name: Defaulting to uname -n for the local corosync node name
Feb 19 13:17:16 an-c03n01 pacemakerd[1387]:   notice: update_node_processes: 0x1602f40 Node 186520586 now known as an-c03n01.alteeve.ca, was:
Feb 19 13:17:16 an-c03n01 pengine[1393]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Feb 19 13:17:16 an-c03n01 attrd[1392]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Feb 19 13:17:16 an-c03n01 attrd[1392]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Feb 19 13:17:16 an-c03n01 stonith-ng[1390]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Feb 19 13:17:16 an-c03n01 stonith-ng[1390]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Feb 19 13:17:16 an-c03n01 cib[1388]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Feb 19 13:17:16 an-c03n01 cib[1388]:   notice: main: Using new config location: /var/lib/pacemaker/cib
Feb 19 13:17:16 an-c03n01 cib[1388]:  warning: retrieveCib: Cluster configuration not found: /var/lib/pacemaker/cib/cib.xml
Feb 19 13:17:16 an-c03n01 cib[1388]:  warning: readCibXmlFile: Primary configuration corrupt or unusable, trying backup...
Feb 19 13:17:16 an-c03n01 cib[1388]:  warning: readCibXmlFile: Continuing with an empty configuration.
Feb 19 13:17:16 an-c03n01 pacemakerd[1387]:   notice: update_node_processes: 0x1908a10 Node 203297802 now known as an-c03n02.alteeve.ca, was:
Feb 19 13:17:16 an-c03n01 attrd[1392]:   notice: corosync_node_name: Unable to get node name for nodeid 186520586
Feb 19 13:17:16 an-c03n01 attrd[1392]:   notice: get_local_node_name: Defaulting to uname -n for the local corosync node name
Feb 19 13:17:16 an-c03n01 stonith-ng[1390]:   notice: corosync_node_name: Unable to get node name for nodeid 186520586
Feb 19 13:17:16 an-c03n01 attrd[1392]:   notice: main: Starting mainloop...
Feb 19 13:17:16 an-c03n01 stonith-ng[1390]:   notice: get_local_node_name: Defaulting to uname -n for the local corosync node name
Feb 19 13:17:16 an-c03n01 lrmd[1391]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Feb 19 13:17:16 an-c03n01 crmd[1394]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Feb 19 13:17:16 an-c03n01 crmd[1394]:   notice: main: CRM Git Version: a7966fb
Feb 19 13:17:16 an-c03n01 cib[1388]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Feb 19 13:17:16 an-c03n01 cib[1388]:   notice: corosync_node_name: Unable to get node name for nodeid 186520586
Feb 19 13:17:16 an-c03n01 cib[1388]:   notice: get_local_node_name: Defaulting to uname -n for the local corosync node name
Feb 19 13:17:17 an-c03n01 stonith-ng[1390]:   notice: setup_cib: Watching for stonith topology changes
Feb 19 13:17:17 an-c03n01 crmd[1394]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Feb 19 13:17:17 an-c03n01 crmd[1394]:   notice: corosync_node_name: Unable to get node name for nodeid 186520586
Feb 19 13:17:17 an-c03n01 crmd[1394]:   notice: get_local_node_name: Defaulting to uname -n for the local corosync node name
Feb 19 13:17:17 an-c03n01 crmd[1394]:   notice: init_quorum_connection: Quorum acquired
Feb 19 13:17:17 an-c03n01 crmd[1394]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node an-c03n01.alteeve.ca[186520586] - state is now member
Feb 19 13:17:17 an-c03n01 crmd[1394]:   notice: corosync_node_name: Unable to get node name for nodeid 203297802
Feb 19 13:17:17 an-c03n01 crmd[1394]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[203297802] - state is now member
Feb 19 13:17:17 an-c03n01 crmd[1394]:   notice: do_started: The local CRM is operational

You can see that pacemakerd is running using ps.

ps axf
...
 1336 ?        Ssl    0:10 corosync
 1387 ?        Ssl    0:00 /usr/sbin/pacemakerd -f
 1388 ?        Ssl    0:00  \_ /usr/libexec/pacemaker/cib
 1390 ?        Ss     0:00  \_ /usr/libexec/pacemaker/stonithd
 1391 ?        Ss     0:00  \_ /usr/libexec/pacemaker/lrmd
 1392 ?        Ss     0:00  \_ /usr/libexec/pacemaker/attrd
 1393 ?        Ss     0:00  \_ /usr/libexec/pacemaker/pengine
 1394 ?        Ss     0:00  \_ /usr/libexec/pacemaker/crmd

You can see the initial cib.xml using cibadmin.

cibadmin --query --local
<cib epoch="4" num_updates="6" admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.7" cib-last-written="Tue Feb 19 13:17:38 2013" update-origin="an-c03n02.alteeve.ca" update-client="crmd" have-quorum="1" dc-uuid="203297802">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.9-0.1318.a7966fb.git.fc18-a7966fb"/>
        <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/>
      </cluster_property_set>
    </crm_config>
    <nodes>
      <node id="203297802" uname="an-c03n02.alteeve.ca"/>
      <node id="186520586" uname="an-c03n01.alteeve.ca"/>
    </nodes>
    <resources/>
    <constraints/>
  </configuration>
  <status>
    <node_state id="203297802" uname="an-c03n02.alteeve.ca" in_ccm="true" crmd="online" crm-debug-origin="do_state_transition" join="member" expected="member">
      <lrm id="203297802">
        <lrm_resources/>
      </lrm>
      <transient_attributes id="203297802">
        <instance_attributes id="status-203297802">
          <nvpair id="status-203297802-probe_complete" name="probe_complete" value="true"/>
        </instance_attributes>
      </transient_attributes>
    </node_state>
    <node_state id="186520586" uname="an-c03n01.alteeve.ca" in_ccm="true" crmd="online" crm-debug-origin="do_state_transition" join="member" expected="member">
      <lrm id="186520586">
        <lrm_resources/>
      </lrm>
      <transient_attributes id="186520586">
        <instance_attributes id="status-186520586">
          <nvpair id="status-186520586-probe_complete" name="probe_complete" value="true"/>
        </instance_attributes>
      </transient_attributes>
    </node_state>
  </status>
</cib>

Disable Quorum

Explain why...

crm configure property no-quorum-policy=ignore

Configuring Fencing aka STONITH

Fencing/STONITH kills nodes dead. (explain all this)

We'll be using two fence methods; IPMI and switched PDUs.

To see the list of available fence agents;

stonith_admin --list-installed
 fence_zvm
 fence_xenapi
 fence_xcat
 fence_wti
 fence_vmware_soap
 fence_vmware_helper
 fence_vmware
 fence_vixel
 fence_virsh
 fence_scsi
 fence_sanbox2
 fence_rsb
 fence_rsa
 fence_rhevm
 fence_rackswitch
 fence_pcmk
 fence_nss_wrapper
 fence_na
 fence_mcdata
 fence_lpar
 fence_legacy
 fence_ldom
 fence_kdump_send
 fence_kdump
 fence_ipmilan
 fence_ipdu
 fence_intelmodular
 fence_imm
 fence_ilo_mp
 fence_ilo3
 fence_ilo2
 fence_ilo
 fence_ifmib
 fence_idrac
 fence_ibmblade
 fence_hpblade
 fence_eps
 fence_egenera
 fence_eaton_snmp
 fence_drac5
 fence_drac
 fence_cpint
 fence_cisco_ucs
 fence_cisco_mds
 fence_bullpap
 fence_brocade
 fence_bladecenter
 fence_baytech
 fence_apc_snmp
 fence_apc
 fence_alom
 fence_ack_manual
52 devices found

Note that each agent will shame some options and have unique options. Be sure to check the man page and metadata for each agent when preparing to configure it.

Configuring IPMI Fencing

For this, we will use fence_ipmilan.

Check the agent's metadata;

stonith_admin --metadata --agent fence_ipmilan
<resource-agent name="fence_ipmilan" shortdesc="Fence agent for IPMI over LAN">
  <symlink name="fence_ilo3" shortdesc="Fence agent for HP iLO2"/>
  <symlink name="fence_idrac" shortdesc="Fence agent for Dell iDRAC"/>
  <symlink name="fence_imm" shortdesc="Fence agent for IBM Integrated Management Module"/>
  <longdesc>
  </longdesc>
  <parameters>
    <parameter name="auth" unique="0">
      <getopt mixed="-A"/>
      <content type="string"/>
      <shortdesc lang="en">
      </shortdesc>
    </parameter>
    <parameter name="ipaddr" unique="0">
      <getopt mixed="-a"/>
      <content type="string"/>
      <shortdesc lang="en">
      </shortdesc>
    </parameter>
    <parameter name="passwd" unique="0">
      <getopt mixed="-p"/>
      <content type="string"/>
      <shortdesc lang="en">
      </shortdesc>
    </parameter>
    <parameter name="passwd_script" unique="0">
      <getopt mixed="-S"/>
      <content type="string"/>
      <shortdesc lang="en">
      </shortdesc>
    </parameter>
    <parameter name="lanplus" unique="0">
      <getopt mixed="-P"/>
      <content type="boolean"/>
      <shortdesc lang="en">
      </shortdesc>
    </parameter>
    <parameter name="login" unique="0">
      <getopt mixed="-l"/>
      <content type="string"/>
      <shortdesc lang="en">
      </shortdesc>
    </parameter>
    <parameter name="action" unique="0">
      <getopt mixed="-o"/>
      <content type="string" default="reboot"/>
      <shortdesc lang="en">
      </shortdesc>
    </parameter>
    <parameter name="timeout" unique="0">
      <getopt mixed="-t"/>
      <content type="string"/>
      <shortdesc lang="en">
      </shortdesc>
    </parameter>
    <parameter name="cipher" unique="0">
      <getopt mixed="-C"/>
      <content type="string"/>
      <shortdesc lang="en">
      </shortdesc>
    </parameter>
    <parameter name="method" unique="0">
      <getopt mixed="-M"/>
      <content type="string" default="onoff"/>
      <shortdesc lang="en">
      </shortdesc>
    </parameter>
    <parameter name="power_wait" unique="0">
      <getopt mixed="-T"/>
      <content type="string" default="2"/>
      <shortdesc lang="en">
      </shortdesc>
    </parameter>
    <parameter name="delay" unique="0">
      <getopt mixed="-f"/>
      <content type="string"/>
      <shortdesc lang="en">
      </shortdesc>
    </parameter>
    <parameter name="privlvl" unique="0">
      <getopt mixed="-L"/>
      <content type="string"/>
      <shortdesc lang="en">
      </shortdesc>
    </parameter>
    <parameter name="verbose" unique="0">
      <getopt mixed="-v"/>
      <content type="boolean"/>
      <shortdesc lang="en">
      </shortdesc>
    </parameter>
  </parameters>
  <actions>
    <action name="on"/>
    <action name="off"/>
    <action name="reboot"/>
    <action name="status"/>
    <action name="diag"/>
    <action name="list"/>
    <action name="monitor"/>
    <action name="metadata"/>
    <action name="stop" timeout="20s"/>
    <action name="start" timeout="20s"/>
  </actions>
</resource-agent>

For the next little bit, we will be working in the CRM shell. Open the crm shell;

crm
crm(live)#

To be safe, we'll create a shadow CIB to work in. When we're happy, we'll push it into the live CIB.

cib new stonith
crm(live)# cib new stonith
INFO: stonith shadow CIB created

Now configure pacemaker to call the fence agent with the arguments needed to kill an-c03n01 and an-c03n02;

configure primitive fence_an-c03n01 stonith::fence_ipmilan params pcmk_host_list="an-c03n01" ipaddr="an-c03n01.ipmi" login="admin" passwd="admin" pcmk_host_check="static-list" op monitor interval="60"
configure primitive fence_an-c03n02 stonith::fence_ipmilan params pcmk_host_list="an-c03n02" ipaddr="an-c03n02.ipmi" login="admin" passwd="admin" pcmk_host_check="static-list" op monitor interval="60"

Next, we have to enable stonith;

configure property stonith-enabled="true"

Lastly, push the stonith shadow CIB over to the active one.

cib commit stonith
INFO: commited 'stonith' shadow CIB to the cluster

Now we can exit back to the shell.

quit
bye


 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.