RHCS Stable 3 Tutorial - Multinode VM Cluster
Alteeve Wiki :: How To :: RHCS Stable 3 Tutorial - Multinode VM Cluster |
Warning: This document is old, abandoned and very out of date. DON'T USE ANYTHING HERE! Consider it only as historical note taking. |
Overview
This tutorial will walk you through building two distinct clusters:
- A 2-Node cluster using DRBD for real-time replicated storage backing an iSCSI SAN using Pacemaker for high availability.
- A 5-Node cluster hosting KVM virtual servers, each VM hosted on a dedicated LUN from the SAN cluster, using Pacemaker for high availability.
Pacemaker
This is a compression/adaption of beekhof's http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch. Credit to him, errors are mine.
Installing from git
# Dependencies
yum -y groupinstall development
yum -y install cluster-glue-libs-devel glib2-devel libxml2-devel libxslt-devel cman-devel corosync-devel libtool-ltdl-devel bzip2-devel fence-agents cluster-glue
# libqb
git clone git://github.com/asalkeld/libqb.git
cd libqb
./autogen.sh
./configure
make
make install
cd ~
# pacemaker
git clone git://github.com/beekhof/pacemaker.git
cd pacemaker
./autogen.sh
./configure --with-corosync --with-cman
make
make install
Base Cluster
notes:
- Create a multicast calculator.
- ais_addr is set in beekhof's scriptlet to use the last interface on the system. Manually choose the BCN interface IP.
- Check the ais_* values with `env | grep ais_`.
- Does pacemaker 1.1 in EL6 support a second ring?
- Pacemaker doesn't care, it's a corosync service.
- for f in /etc/corosync/corosync.conf /etc/corosync/service.d/pcmk /etc/hosts; do scp $f pcmk-2:$f ; done
cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf
vim /etc/corosync/corosync.conf
Add/edit the following three lines to the 'interface { }' section:
interface {
ringnumber: 0
# Interface to use for cluster comms (BCN). Must be the
# lowest IP in the subnet. We use /16, so '.0.0'.
bindnetaddr: 10.20.0.0
# Multicast IP used for CPG. Must be unique per cluster.
mcastaddr: 226.94.1.1
# Multicast TCP port. Must be unique per ring.
mcastport: 5405
ttl: 1
}
Create the pacemaker service file.
vim /etc/corosync/service.d/pcmk
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver: 1
}
Copy the two files to the other node.
rsync -av /etc/corosync/service.d/pcmk root@an-node02:/etc/corosync/service.d/
rsync -av /etc/corosync/corosync.conf root@an-node02:/etc/corosync/
</source>
Make the log directory /var/log/cluster writeable by members of the root group, which the pacemaker user is.
chmod g+rwx /var/log/cluster
chown hacluster:haclient /usr/var/lib/pengine
chown hacluster:haclient /usr/var/lib/heartbeat/crm
chown hacluster:haclient /var/lib/heartbeat/cores/hacluster
Start the cluster:
/etc/init.d/corosync start
Starting Corosync Cluster Engine (corosync): [ OK ]
In the log file of the first node to start (second machine joining can be seen at the end):
Jan 10 00:13:20 an-node01 corosync[6637]: [MAIN ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.
Jan 10 00:13:20 an-node01 corosync[6637]: [MAIN ] Corosync built-in features: nss dbus rdma snmp
Jan 10 00:13:20 an-node01 corosync[6637]: [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
Jan 10 00:13:20 an-node01 corosync[6637]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Jan 10 00:13:20 an-node01 corosync[6637]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jan 10 00:13:21 an-node01 corosync[6637]: [TOTEM ] The network interface [10.20.0.1] is now up.
Jan 10 00:13:21 an-node01 corosync[6637]: [pcmk ] Logging: Initialized pcmk_startup
Jan 10 00:13:21 an-node01 corosync[6637]: [SERV ] Service engine loaded: Pacemaker Cluster Manager 1.1.6
Jan 10 00:13:21 an-node01 corosync[6637]: [SERV ] Service engine loaded: corosync extended virtual synchrony service
Jan 10 00:13:21 an-node01 corosync[6637]: [SERV ] Service engine loaded: corosync configuration service
Jan 10 00:13:21 an-node01 corosync[6637]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
Jan 10 00:13:21 an-node01 corosync[6637]: [SERV ] Service engine loaded: corosync cluster config database access v1.01
Jan 10 00:13:21 an-node01 corosync[6637]: [SERV ] Service engine loaded: corosync profile loading service
Jan 10 00:13:21 an-node01 corosync[6637]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Jan 10 00:13:21 an-node01 corosync[6637]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
Jan 10 00:13:21 an-node01 corosync[6637]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 10 00:13:21 an-node01 corosync[6637]: [MAIN ] discarded unknown message 552 for service 0 (max id 1)
Jan 10 00:13:21 an-node01 corosync[6637]: [CPG ] chosen downlist: sender r(0) ip(10.20.0.1) ; members(old:0 left:0)
Jan 10 00:13:21 an-node01 corosync[6637]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 10 00:13:22 an-node01 corosync[6637]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 10 00:13:22 an-node01 corosync[6637]: [MAIN ] discarded unknown message 552 for service 0 (max id 1)
Jan 10 00:13:22 an-node01 corosync[6637]: [MAIN ] discarded unknown message 552 for service 0 (max id 1)
Jan 10 00:13:22 an-node01 corosync[6637]: [MAIN ] discarded unknown message 552 for service 0 (max id 1)
Jan 10 00:13:22 an-node01 corosync[6637]: [CPG ] chosen downlist: sender r(0) ip(10.20.0.1) ; members(old:1 left:0)
Jan 10 00:13:22 an-node01 corosync[6637]: [MAIN ] Completed service synchronization, ready to provide service.
In the log file of the second node to start (joins the existing cluster):
Jan 10 00:13:20 an-node02 corosync[12221]: [MAIN ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.
Jan 10 00:13:20 an-node02 corosync[12221]: [MAIN ] Corosync built-in features: nss dbus rdma snmp
Jan 10 00:13:20 an-node02 corosync[12221]: [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
Jan 10 00:13:20 an-node02 corosync[12221]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Jan 10 00:13:20 an-node02 corosync[12221]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jan 10 00:13:20 an-node02 corosync[12221]: [TOTEM ] The network interface [10.20.0.2] is now up.
Jan 10 00:13:20 an-node02 corosync[12221]: [pcmk ] Logging: Initialized pcmk_startup
Jan 10 00:13:20 an-node02 corosync[12221]: [SERV ] Service engine loaded: Pacemaker Cluster Manager 1.1.6
Jan 10 00:13:20 an-node02 corosync[12221]: [SERV ] Service engine loaded: corosync extended virtual synchrony service
Jan 10 00:13:20 an-node02 corosync[12221]: [SERV ] Service engine loaded: corosync configuration service
Jan 10 00:13:20 an-node02 corosync[12221]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
Jan 10 00:13:20 an-node02 corosync[12221]: [SERV ] Service engine loaded: corosync cluster config database access v1.01
Jan 10 00:13:20 an-node02 corosync[12221]: [SERV ] Service engine loaded: corosync profile loading service
Jan 10 00:13:20 an-node02 corosync[12221]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Jan 10 00:13:20 an-node02 corosync[12221]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
Jan 10 00:13:20 an-node02 corosync[12221]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 10 00:13:20 an-node02 corosync[12221]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 10 00:13:20 an-node02 corosync[12221]: [MAIN ] discarded unknown message 552 for service 0 (max id 1)
Jan 10 00:13:20 an-node02 corosync[12221]: [MAIN ] discarded unknown message 552 for service 0 (max id 1)
Jan 10 00:13:20 an-node02 corosync[12221]: [MAIN ] discarded unknown message 552 for service 0 (max id 1)
Jan 10 00:13:20 an-node02 corosync[12221]: [CPG ] chosen downlist: sender r(0) ip(10.20.0.1) ; members(old:1 left:0)
Jan 10 00:13:20 an-node02 corosync[12221]: [MAIN ] Completed service synchronization, ready to provide service.
Start pacemaker:
/etc/init.d/pacemaker start
Starting Pacemaker Cluster Manager: [ OK ]
In the log files:
May 28 15:56:54 an-node01 pacemakerd: [3527]: info: Invoked: pacemakerd
May 28 15:56:54 an-node01 pacemakerd: [3527]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/root
May 28 15:56:54 an-node01 pacemakerd: [3527]: info: read_config: Reading configure
May 28 15:56:54 an-node01 pacemakerd: [3527]: info: config_find_next: Processing additional service options...
May 28 15:56:54 an-node01 pacemakerd: [3527]: info: get_config_opt: Found 'pacemaker' for option: name
May 28 15:56:54 an-node01 pacemakerd: [3527]: info: get_config_opt: Found '1' for option: ver
May 28 15:56:54 an-node01 pacemakerd: [3527]: info: get_config_opt: Defaulting to 'no' for option: use_logd
May 28 15:56:54 an-node01 pacemakerd: [3527]: info: get_config_opt: Defaulting to 'no' for option: use_mgmtd
May 28 15:56:54 an-node01 pacemakerd: [3527]: info: config_find_next: No additional configuration supplied for: service
May 28 15:56:54 an-node01 pacemakerd: [3527]: info: config_find_next: Processing additional logging options...
May 28 15:56:54 an-node01 pacemakerd: [3527]: info: get_config_opt: Found 'off' for option: debug
May 28 15:56:54 an-node01 pacemakerd: [3527]: info: get_config_opt: Found 'yes' for option: to_logfile
May 28 15:56:54 an-node01 pacemakerd: [3527]: info: get_config_opt: Found '/var/log/cluster/corosync.log' for option: logfile
May 28 15:56:54 an-node01 pacemakerd: [3527]: info: get_config_opt: Found 'yes' for option: to_syslog
May 28 15:56:54 an-node01 pacemakerd: [3527]: info: get_config_opt: Defaulting to 'daemon' for option: syslog_facility
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/root
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: main: Starting Pacemaker 1.1.5-5.el6 (Build: 01e86afaaa6d4a8c4836f68df80ababd6ca3902f): manpages docbook-manpages publican ncurses cman cs-quorum corosync snmp libesmtp
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: main: Maximum core file size is: 18446744073709551615
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: 0x13f7a20 Node 1191422144 now known as an-node01.alteeve.com (was: (null))
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: Node an-node01.alteeve.com now has process list: 00000000000000000000000000000002 (was 00000000000000000000000000000000)
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: G_main_add_SignalHandler: Added signal handler for signal 17
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: start_child: Forked child 3534 for process stonith-ng
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: Node an-node01.alteeve.com now has process list: 00000000000000000000000000100002 (was 00000000000000000000000000000002)
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: start_child: Forked child 3535 for process cib
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: Node an-node01.alteeve.com now has process list: 00000000000000000000000000100102 (was 00000000000000000000000000100002)
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: start_child: Forked child 3536 for process lrmd
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: Node an-node01.alteeve.com now has process list: 00000000000000000000000000100112 (was 00000000000000000000000000100102)
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: start_child: Forked child 3537 for process attrd
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: Node an-node01.alteeve.com now has process list: 00000000000000000000000000101112 (was 00000000000000000000000000100112)
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: start_child: Forked child 3538 for process pengine
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: Node an-node01.alteeve.com now has process list: 00000000000000000000000000111112 (was 00000000000000000000000000101112)
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: start_child: Forked child 3539 for process crmd
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: Node an-node01.alteeve.com now has process list: 00000000000000000000000000111312 (was 00000000000000000000000000111112)
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: main: Starting mainloop
May 28 15:56:54 an-node01 cib: [3535]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/root
May 28 15:56:54 an-node01 cib: [3535]: info: G_main_add_TriggerHandler: Added signal manual handler
May 28 15:56:54 an-node01 cib: [3535]: info: G_main_add_SignalHandler: Added signal handler for signal 17
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: G_main_add_SignalHandler: Added signal handler for signal 17
May 28 15:56:54 an-node01 cib: [3535]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig)
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: get_cluster_type: Cluster type is: 'openais'.
May 28 15:56:54 an-node01 cib: [3535]: info: validate_with_relaxng: Creating RNG parser context
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: init_ais_connection_classic: Creating connection to our Corosync plugin
May 28 15:56:54 an-node01 lrmd: [3536]: info: G_main_add_SignalHandler: Added signal handler for signal 15
May 28 15:56:54 an-node01 crmd: [3539]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
May 28 15:56:54 an-node01 attrd: [3537]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster
May 28 15:56:54 an-node01 crmd: [3539]: info: main: CRM Hg Version: 01e86afaaa6d4a8c4836f68df80ababd6ca3902f
May 28 15:56:54 an-node01 crmd: [3539]: info: crmd_init: Starting crmd
May 28 15:56:54 an-node01 crmd: [3539]: info: G_main_add_SignalHandler: Added signal handler for signal 17
May 28 15:56:54 an-node01 attrd: [3537]: info: main: Starting up
May 28 15:56:54 an-node01 attrd: [3537]: info: get_cluster_type: Cluster type is: 'openais'.
May 28 15:56:54 an-node01 attrd: [3537]: info: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
May 28 15:56:54 an-node01 attrd: [3537]: info: init_ais_connection_classic: Creating connection to our Corosync plugin
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: init_ais_connection_classic: AIS connection established
May 28 15:56:54 an-node01 corosync[3019]: [pcmk ] info: pcmk_ipc: Recorded connection 0x18aa430 for stonith-ng/0
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: get_ais_nodeid: Server details: id=1191422144 uname=an-node01.alteeve.com cname=pcmk
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: init_ais_connection_once: Connection to 'classic openais (with plugin)': established
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: crm_new_peer: Node an-node01.alteeve.com now has id: 1191422144
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: crm_new_peer: Node 1191422144 is now known as an-node01.alteeve.com
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: main: Starting stonith-ng mainloop
May 28 15:56:54 an-node01 lrmd: [3536]: info: G_main_add_SignalHandler: Added signal handler for signal 17
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: crm_update_peer: Node an-node01.alteeve.com: id=1191422144 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000111312 (new)
May 28 15:56:54 an-node01 attrd: [3537]: info: init_ais_connection_classic: AIS connection established
May 28 15:56:54 an-node01 lrmd: [3536]: info: enabling coredumps
May 28 15:56:54 an-node01 corosync[3019]: [pcmk ] info: pcmk_ipc: Recorded connection 0x18ae790 for attrd/0
May 28 15:56:54 an-node01 lrmd: [3536]: info: G_main_add_SignalHandler: Added signal handler for signal 10
May 28 15:56:54 an-node01 attrd: [3537]: info: get_ais_nodeid: Server details: id=1191422144 uname=an-node01.alteeve.com cname=pcmk
May 28 15:56:54 an-node01 lrmd: [3536]: info: G_main_add_SignalHandler: Added signal handler for signal 12
May 28 15:56:54 an-node01 attrd: [3537]: info: init_ais_connection_once: Connection to 'classic openais (with plugin)': established
May 28 15:56:54 an-node01 lrmd: [3536]: info: Started.
May 28 15:56:54 an-node01 attrd: [3537]: info: crm_new_peer: Node an-node01.alteeve.com now has id: 1191422144
May 28 15:56:54 an-node01 attrd: [3537]: info: crm_new_peer: Node 1191422144 is now known as an-node01.alteeve.com
May 28 15:56:54 an-node01 attrd: [3537]: info: main: Cluster connection active
May 28 15:56:54 an-node01 attrd: [3537]: info: main: Accepting attribute updates
May 28 15:56:54 an-node01 attrd: [3537]: info: main: Starting mainloop...
May 28 15:56:54 an-node01 attrd: [3537]: info: crm_update_peer: Node an-node01.alteeve.com: id=1191422144 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000111312 (new)
May 28 15:56:54 an-node01 cib: [3535]: info: startCib: CIB Initialization completed successfully
May 28 15:56:54 an-node01 cib: [3535]: info: get_cluster_type: Cluster type is: 'openais'.
May 28 15:56:54 an-node01 cib: [3535]: info: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
May 28 15:56:54 an-node01 cib: [3535]: info: init_ais_connection_classic: Creating connection to our Corosync plugin
May 28 15:56:54 an-node01 cib: [3535]: info: init_ais_connection_classic: AIS connection established
May 28 15:56:54 an-node01 corosync[3019]: [pcmk ] info: pcmk_ipc: Recorded connection 0x18b2af0 for cib/0
May 28 15:56:54 an-node01 corosync[3019]: [pcmk ] info: pcmk_ipc: Sending membership update 32 to cib
May 28 15:56:54 an-node01 cib: [3535]: info: get_ais_nodeid: Server details: id=1191422144 uname=an-node01.alteeve.com cname=pcmk
May 28 15:56:54 an-node01 cib: [3535]: info: init_ais_connection_once: Connection to 'classic openais (with plugin)': established
May 28 15:56:54 an-node01 cib: [3535]: info: crm_new_peer: Node an-node01.alteeve.com now has id: 1191422144
May 28 15:56:54 an-node01 cib: [3535]: info: crm_new_peer: Node 1191422144 is now known as an-node01.alteeve.com
May 28 15:56:54 an-node01 cib: [3535]: info: cib_init: Starting cib mainloop
May 28 15:56:54 an-node01 cib: [3535]: notice: ais_dispatch_message: Membership 32: quorum acquired
May 28 15:56:54 an-node01 cib: [3535]: info: crm_update_peer: Node an-node01.alteeve.com: id=1191422144 state=member (new) addr=r(0) ip(192.168.3.71) (new) votes=1 (new) born=32 seen=32 proc=00000000000000000000000000000000
May 28 15:56:54 an-node01 cib: [3535]: info: crm_new_peer: Node an-node02.alteeve.com now has id: 1208199360
May 28 15:56:54 an-node01 cib: [3535]: info: crm_new_peer: Node 1208199360 is now known as an-node02.alteeve.com
May 28 15:56:54 an-node01 cib: [3535]: info: crm_update_peer: Node an-node02.alteeve.com: id=1208199360 state=member (new) addr=r(0) ip(192.168.3.72) votes=1 born=32 seen=32 proc=00000000000000000000000000000000
May 28 15:56:54 an-node01 cib: [3535]: info: crm_update_peer: Node an-node01.alteeve.com: id=1191422144 state=member addr=r(0) ip(192.168.3.71) votes=1 born=32 seen=32 proc=00000000000000000000000000111312 (new)
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: 0x13fc1e0 Node 1208199360 now known as an-node02.alteeve.com (was: (null))
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: Node an-node02.alteeve.com now has process list: 00000000000000000000000000000002 (was 00000000000000000000000000000000)
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: Node an-node02.alteeve.com now has process list: 00000000000000000000000000100002 (was 00000000000000000000000000000002)
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: Node an-node02.alteeve.com now has process list: 00000000000000000000000000100102 (was 00000000000000000000000000100002)
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: Node an-node02.alteeve.com now has process list: 00000000000000000000000000100112 (was 00000000000000000000000000100102)
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: crm_new_peer: Node 0 is now known as an-node02.alteeve.com
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: crm_update_peer: Node an-node02.alteeve.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000100002 (new)
May 28 15:56:54 an-node01 attrd: [3537]: info: crm_new_peer: Node 0 is now known as an-node02.alteeve.com
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: crm_update_peer: Node an-node02.alteeve.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000100102 (new)
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: crm_update_peer: Node an-node02.alteeve.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000100112 (new)
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: Node an-node02.alteeve.com now has process list: 00000000000000000000000000101112 (was 00000000000000000000000000100112)
May 28 15:56:54 an-node01 cib: [3535]: info: crm_update_peer: Node an-node02.alteeve.com: id=1208199360 state=member addr=r(0) ip(192.168.3.72) votes=1 born=32 seen=32 proc=00000000000000000000000000000002 (new)
May 28 15:56:54 an-node01 attrd: [3537]: info: crm_update_peer: Node an-node02.alteeve.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000100002 (new)
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: crm_update_peer: Node an-node02.alteeve.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000101112 (new)
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: Node an-node02.alteeve.com now has process list: 00000000000000000000000000111112 (was 00000000000000000000000000101112)
May 28 15:56:54 an-node01 attrd: [3537]: info: crm_update_peer: Node an-node02.alteeve.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000100102 (new)
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: crm_update_peer: Node an-node02.alteeve.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000111112 (new)
May 28 15:56:54 an-node01 pacemakerd: [3530]: info: update_node_processes: Node an-node02.alteeve.com now has process list: 00000000000000000000000000111312 (was 00000000000000000000000000111112)
May 28 15:56:54 an-node01 attrd: [3537]: info: crm_update_peer: Node an-node02.alteeve.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000100112 (new)
May 28 15:56:54 an-node01 stonith-ng: [3534]: info: crm_update_peer: Node an-node02.alteeve.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000111312 (new)
May 28 15:56:54 an-node01 attrd: [3537]: info: crm_update_peer: Node an-node02.alteeve.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000101112 (new)
May 28 15:56:54 an-node01 cib: [3535]: info: crm_update_peer: Node an-node02.alteeve.com: id=1208199360 state=member addr=r(0) ip(192.168.3.72) votes=1 born=32 seen=32 proc=00000000000000000000000000100002 (new)
May 28 15:56:54 an-node01 cib: [3535]: info: crm_update_peer: Node an-node02.alteeve.com: id=1208199360 state=member addr=r(0) ip(192.168.3.72) votes=1 born=32 seen=32 proc=00000000000000000000000000100102 (new)
May 28 15:56:54 an-node01 cib: [3535]: info: crm_update_peer: Node an-node02.alteeve.com: id=1208199360 state=member addr=r(0) ip(192.168.3.72) votes=1 born=32 seen=32 proc=00000000000000000000000000100112 (new)
May 28 15:56:54 an-node01 attrd: [3537]: info: crm_update_peer: Node an-node02.alteeve.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000111112 (new)
May 28 15:56:54 an-node01 cib: [3535]: info: crm_update_peer: Node an-node02.alteeve.com: id=1208199360 state=member addr=r(0) ip(192.168.3.72) votes=1 born=32 seen=32 proc=00000000000000000000000000101112 (new)
May 28 15:56:54 an-node01 cib: [3535]: info: crm_update_peer: Node an-node02.alteeve.com: id=1208199360 state=member addr=r(0) ip(192.168.3.72) votes=1 born=32 seen=32 proc=00000000000000000000000000111112 (new)
May 28 15:56:54 an-node01 cib: [3535]: info: crm_update_peer: Node an-node02.alteeve.com: id=1208199360 state=member addr=r(0) ip(192.168.3.72) votes=1 born=32 seen=32 proc=00000000000000000000000000111312 (new)
May 28 15:56:54 an-node01 attrd: [3537]: info: crm_update_peer: Node an-node02.alteeve.com: id=0 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000111312 (new)
May 28 15:56:55 an-node01 crmd: [3539]: info: do_cib_control: CIB connection established
May 28 15:56:55 an-node01 crmd: [3539]: info: get_cluster_type: Cluster type is: 'openais'.
May 28 15:56:55 an-node01 crmd: [3539]: info: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
May 28 15:56:55 an-node01 crmd: [3539]: info: init_ais_connection_classic: Creating connection to our Corosync plugin
May 28 15:56:55 an-node01 crmd: [3539]: info: init_ais_connection_classic: AIS connection established
May 28 15:56:55 an-node01 corosync[3019]: [pcmk ] info: pcmk_ipc: Recorded connection 0x18b6e50 for crmd/0
May 28 15:56:55 an-node01 corosync[3019]: [pcmk ] info: pcmk_ipc: Sending membership update 32 to crmd
May 28 15:56:55 an-node01 crmd: [3539]: info: get_ais_nodeid: Server details: id=1191422144 uname=an-node01.alteeve.com cname=pcmk
May 28 15:56:55 an-node01 crmd: [3539]: info: init_ais_connection_once: Connection to 'classic openais (with plugin)': established
May 28 15:56:55 an-node01 crmd: [3539]: info: crm_new_peer: Node an-node01.alteeve.com now has id: 1191422144
May 28 15:56:55 an-node01 crmd: [3539]: info: crm_new_peer: Node 1191422144 is now known as an-node01.alteeve.com
May 28 15:56:55 an-node01 crmd: [3539]: info: ais_status_callback: status: an-node01.alteeve.com is now unknown
May 28 15:56:55 an-node01 crmd: [3539]: info: do_ha_control: Connected to the cluster
May 28 15:56:55 an-node01 crmd: [3539]: info: do_started: Delaying start, no membership data (0000000000100000)
May 28 15:56:55 an-node01 crmd: [3539]: info: crmd_init: Starting crmd's mainloop
May 28 15:56:55 an-node01 crmd: [3539]: notice: ais_dispatch_message: Membership 32: quorum acquired
May 28 15:56:55 an-node01 crmd: [3539]: info: ais_status_callback: status: an-node01.alteeve.com is now member (was unknown)
May 28 15:56:55 an-node01 crmd: [3539]: info: crm_update_peer: Node an-node01.alteeve.com: id=1191422144 state=member (new) addr=r(0) ip(192.168.3.71) (new) votes=1 (new) born=32 seen=32 proc=00000000000000000000000000000000
May 28 15:56:55 an-node01 crmd: [3539]: info: crm_new_peer: Node an-node02.alteeve.com now has id: 1208199360
May 28 15:56:55 an-node01 crmd: [3539]: info: crm_new_peer: Node 1208199360 is now known as an-node02.alteeve.com
May 28 15:56:55 an-node01 crmd: [3539]: info: ais_status_callback: status: an-node02.alteeve.com is now unknown
May 28 15:56:55 an-node01 crmd: [3539]: info: ais_status_callback: status: an-node02.alteeve.com is now member (was unknown)
May 28 15:56:55 an-node01 crmd: [3539]: info: crm_update_peer: Node an-node02.alteeve.com: id=1208199360 state=member (new) addr=r(0) ip(192.168.3.72) votes=1 born=32 seen=32 proc=00000000000000000000000000000000
May 28 15:56:55 an-node01 crmd: [3539]: notice: crmd_peer_update: Status update: Client an-node01.alteeve.com/crmd now has status [online] (DC=<null>)
May 28 15:56:55 an-node01 crmd: [3539]: info: crm_update_peer: Node an-node01.alteeve.com: id=1191422144 state=member addr=r(0) ip(192.168.3.71) votes=1 born=32 seen=32 proc=00000000000000000000000000111312 (new)
May 28 15:56:55 an-node01 crmd: [3539]: notice: crmd_peer_update: Status update: Client an-node02.alteeve.com/crmd now has status [online] (DC=<null>)
May 28 15:56:55 an-node01 crmd: [3539]: info: crm_update_peer: Node an-node02.alteeve.com: id=1208199360 state=member addr=r(0) ip(192.168.3.72) votes=1 born=32 seen=32 proc=00000000000000000000000000111312 (new)
May 28 15:56:55 an-node01 crmd: [3539]: info: do_started: Delaying start, Config not read (0000000000000040)
May 28 15:56:55 an-node01 crmd: [3539]: info: do_started: Delaying start, Config not read (0000000000000040)
May 28 15:56:55 an-node01 crmd: [3539]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
May 28 15:56:55 an-node01 crmd: [3539]: info: config_query_callback: Checking for expired actions every 900000ms
May 28 15:56:55 an-node01 crmd: [3539]: info: config_query_callback: Sending expected-votes=2 to corosync
May 28 15:56:55 an-node01 crmd: [3539]: info: do_started: The local CRM is operational
May 28 15:56:55 an-node01 crmd: [3539]: info: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
May 28 15:56:56 an-node01 crmd: [3539]: info: ais_dispatch_message: Membership 32: quorum retained
May 28 15:56:56 an-node01 crmd: [3539]: info: te_connect_stonith: Attempting connection to fencing daemon...
May 28 15:56:57 an-node01 crmd: [3539]: info: te_connect_stonith: Connected
May 28 15:56:59 an-node01 attrd: [3537]: info: cib_connect: Connected to the CIB after 1 signon attempts
May 28 15:56:59 an-node01 attrd: [3537]: info: cib_connect: Sending full refresh
May 28 15:57:56 an-node01 crmd: [3539]: info: crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped! (60000ms)
May 28 15:57:56 an-node01 crmd: [3539]: WARN: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
May 28 15:57:56 an-node01 crmd: [3539]: info: do_state_transition: State transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped ]
May 28 15:57:56 an-node01 crmd: [3539]: info: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
May 28 15:57:56 an-node01 crmd: [3539]: info: do_te_control: Registering TE UUID: 492bee5d-5336-4981-b74f-db6eb6e04f38
May 28 15:57:56 an-node01 crmd: [3539]: WARN: cib_client_add_notify_callback: Callback already present
May 28 15:57:56 an-node01 crmd: [3539]: info: set_graph_functions: Setting custom graph functions
May 28 15:57:56 an-node01 crmd: [3539]: info: unpack_graph: Unpacked transition -1: 0 actions in 0 synapses
May 28 15:57:56 an-node01 crmd: [3539]: info: do_dc_takeover: Taking over DC status for this partition
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_readwrite: We are now in R/W mode
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_request: Operation complete: op cib_master for section 'all' (origin=local/crmd/5, version=0.5.1): ok (rc=0)
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/6, version=0.5.2): ok (rc=0)
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/8, version=0.5.3): ok (rc=0)
May 28 15:57:56 an-node01 crmd: [3539]: info: join_make_offer: Making join offers based on membership 32
May 28 15:57:56 an-node01 crmd: [3539]: info: do_dc_join_offer_all: join-1: Waiting on 2 outstanding join acks
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/10, version=0.5.4): ok (rc=0)
May 28 15:57:56 an-node01 crmd: [3539]: info: ais_dispatch_message: Membership 32: quorum retained
May 28 15:57:56 an-node01 crmd: [3539]: info: crmd_ais_dispatch: Setting expected votes to 2
May 28 15:57:56 an-node01 crmd: [3539]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
May 28 15:57:56 an-node01 crmd: [3539]: info: config_query_callback: Checking for expired actions every 900000ms
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/13, version=0.5.5): ok (rc=0)
May 28 15:57:56 an-node01 crmd: [3539]: info: config_query_callback: Sending expected-votes=2 to corosync
May 28 15:57:56 an-node01 crmd: [3539]: info: update_dc: Set DC to an-node01.alteeve.com (3.0.5)
May 28 15:57:56 an-node01 crmd: [3539]: info: ais_dispatch_message: Membership 32: quorum retained
May 28 15:57:56 an-node01 crmd: [3539]: info: crmd_ais_dispatch: Setting expected votes to 2
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/16, version=0.5.6): ok (rc=0)
May 28 15:57:56 an-node01 crmd: [3539]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
May 28 15:57:56 an-node01 crmd: [3539]: info: do_state_transition: All 2 cluster nodes responded to the join offer.
May 28 15:57:56 an-node01 crmd: [3539]: info: do_dc_join_finalize: join-1: Syncing the CIB from an-node01.alteeve.com to the rest of the cluster
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=local/crmd/17, version=0.5.6): ok (rc=0)
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/18, version=0.5.7): ok (rc=0)
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/19, version=0.5.8): ok (rc=0)
May 28 15:57:56 an-node01 crmd: [3539]: info: update_attrd: Connecting to attrd...
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='an-node01.alteeve.com']/transient_attributes (origin=local/crmd/20, version=0.5.9): ok (rc=0)
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='an-node02.alteeve.com']/transient_attributes (origin=an-node02.alteeve.com/crmd/7, version=0.5.10): ok (rc=0)
May 28 15:57:56 an-node01 attrd: [3537]: info: find_hash_entry: Creating hash entry for terminate
May 28 15:57:56 an-node01 attrd: [3537]: info: find_hash_entry: Creating hash entry for shutdown
May 28 15:57:56 an-node01 crmd: [3539]: info: erase_xpath_callback: Deletion of "//node_state[@uname='an-node01.alteeve.com']/transient_attributes": ok (rc=0)
May 28 15:57:56 an-node01 crmd: [3539]: info: do_dc_join_ack: join-1: Updating node state to member for an-node02.alteeve.com
May 28 15:57:56 an-node01 crmd: [3539]: info: do_dc_join_ack: join-1: Updating node state to member for an-node01.alteeve.com
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='an-node02.alteeve.com']/lrm (origin=local/crmd/21, version=0.5.11): ok (rc=0)
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='an-node01.alteeve.com']/lrm (origin=local/crmd/23, version=0.5.13): ok (rc=0)
May 28 15:57:56 an-node01 attrd: [3537]: info: crm_get_peer: Node an-node02.alteeve.com now has id: 1208199360
May 28 15:57:56 an-node01 crmd: [3539]: info: erase_xpath_callback: Deletion of "//node_state[@uname='an-node02.alteeve.com']/lrm": ok (rc=0)
May 28 15:57:56 an-node01 crmd: [3539]: info: erase_xpath_callback: Deletion of "//node_state[@uname='an-node01.alteeve.com']/lrm": ok (rc=0)
May 28 15:57:56 an-node01 crmd: [3539]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
May 28 15:57:56 an-node01 crmd: [3539]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/25, version=0.5.15): ok (rc=0)
May 28 15:57:56 an-node01 crmd: [3539]: info: do_dc_join_final: Ensuring DC, quorum and node attributes are up-to-date
May 28 15:57:56 an-node01 crmd: [3539]: info: crm_update_quorum: Updating quorum status to true (call=27)
May 28 15:57:56 an-node01 crmd: [3539]: info: abort_transition_graph: do_te_invoke:173 - Triggered transition abort (complete=1) : Peer Cancelled
May 28 15:57:56 an-node01 crmd: [3539]: info: do_pe_invoke: Query 28: Requesting the current CIB: S_POLICY_ENGINE
May 28 15:57:56 an-node01 cib: [3535]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/27, version=0.5.17): ok (rc=0)
May 28 15:57:56 an-node01 attrd: [3537]: info: attrd_local_callback: Sending full refresh (origin=crmd)
May 28 15:57:56 an-node01 attrd: [3537]: info: attrd_trigger_update: Sending flush op to all hosts for: shutdown (<null>)
May 28 15:57:56 an-node01 attrd: [3537]: info: attrd_trigger_update: Sending flush op to all hosts for: terminate (<null>)
May 28 15:57:56 an-node01 pengine: [3538]: ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
May 28 15:57:56 an-node01 pengine: [3538]: ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
May 28 15:57:56 an-node01 pengine: [3538]: ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
May 28 15:57:56 an-node01 pengine: [3538]: notice: stage6: Delaying fencing operations until there are resources to manage
May 28 15:57:56 an-node01 crmd: [3539]: info: do_pe_invoke_callback: Invoking the PE: query=28, ref=pe_calc-dc-1306612676-9, seq=32, quorate=1
May 28 15:57:56 an-node01 crmd: [3539]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
May 28 15:57:56 an-node01 crmd: [3539]: info: unpack_graph: Unpacked transition 0: 2 actions in 2 synapses
May 28 15:57:56 an-node01 crmd: [3539]: info: do_te_invoke: Processing graph 0 (ref=pe_calc-dc-1306612676-9) derived from /var/lib/pengine/pe-input-16.bz2
May 28 15:57:56 an-node01 crmd: [3539]: info: te_rsc_command: Initiating action 3: probe_complete probe_complete on an-node01.alteeve.com (local) - no waiting
May 28 15:57:56 an-node01 attrd: [3537]: info: find_hash_entry: Creating hash entry for probe_complete
May 28 15:57:56 an-node01 attrd: [3537]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
May 28 15:57:56 an-node01 crmd: [3539]: info: te_rsc_command: Initiating action 2: probe_complete probe_complete on an-node02.alteeve.com - no waiting
May 28 15:57:56 an-node01 crmd: [3539]: info: run_graph: ====================================================
May 28 15:57:56 an-node01 crmd: [3539]: notice: run_graph: Transition 0 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-16.bz2): Complete
May 28 15:57:56 an-node01 attrd: [3537]: info: attrd_perform_update: Sent update 10: probe_complete=true
May 28 15:57:56 an-node01 crmd: [3539]: info: te_graph_trigger: Transition 0 is now complete
May 28 15:57:56 an-node01 crmd: [3539]: info: notify_crmd: Transition 0 status: done - <null>
May 28 15:57:56 an-node01 crmd: [3539]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
May 28 15:57:56 an-node01 crmd: [3539]: info: do_state_transition: Starting PEngine Recheck Timer
Check that the process is running:
ps axf
PID TTY STAT TIME COMMAND
...
3019 ? Ssl 0:07 corosync
3530 pts/0 S 0:00 pacemakerd
3534 ? Ss 0:00 \_ /usr/lib64/heartbeat/stonithd
3535 ? Ss 0:00 \_ /usr/lib64/heartbeat/cib
3536 ? Ss 0:00 \_ /usr/lib64/heartbeat/lrmd
3537 ? Ss 0:00 \_ /usr/lib64/heartbeat/attrd
3538 ? Ss 0:00 \_ /usr/lib64/heartbeat/pengine
3539 ? Ss 0:00 \_ /usr/lib64/heartbeat/crmd
Check that there were no errors:
grep ERROR: /var/log/messages | grep -v unpack_resources
# If anything returns, address them, clear /var/log/messages and try starting pacemaker again. Repeat until no errors are returned.
Tools
The core tool is crm, "cluster resource manager". Run by itself, it starts a shell. Alternatively, it can be passed a single argument, a file with multiple commands and STDIN can be redirected in.
The main tool to monitor the cluster is crm_mon, which a variant on crm status.
Pacemaker tools use --help for the main usage information. The same help is available via man pages.
Check the cluster's status:
crm_mon -1
============
Last updated: Sun May 29 12:26:26 2011
Stack: openais
Current DC: an-node01.alteeve.com - partition with quorum
Version: 1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
0 Resources configured.
============
Online: [ an-node02.alteeve.com an-node01.alteeve.com ]
Building an Active/Passive Cluster
Show the current cluster config:
crm configure show
node an-node01.alteeve.com
node an-node02.alteeve.com
property $id="cib-bootstrap-options" \
dc-version="1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
cluster-infrastructure="openais" \
expected-quorum-votes="2"
The same, but it all the XML glory (wrapped manually for readability):
crm configure show xml
<?xml version="1.0" ?>
<cib admin_epoch="0" cib-last-written="Sat May 28 16:36:06 2011" crm_feature_set="3.0.5"
dc-uuid="an-node01.alteeve.com" epoch="5" have-quorum="1" num_updates="21"
validate-with="pacemaker-1.2">
<configuration>
<crm_config>
<cluster_property_set id="cib-bootstrap-options">
<nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
value="1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f"/>
<nvpair id="cib-bootstrap-options-cluster-infrastructure"
name="cluster-infrastructure" value="openais"/>
<nvpair id="cib-bootstrap-options-expected-quorum-votes"
name="expected-quorum-votes" value="2"/>
</cluster_property_set>
</crm_config>
<nodes>
<node id="an-node02.alteeve.com" type="normal" uname="an-node02.alteeve.com"/>
<node id="an-node01.alteeve.com" type="normal" uname="an-node01.alteeve.com"/>
</nodes>
<resources/>
<constraints/>
</configuration>
</cib>
Confirm the configuration:
crm_verify -L
crm_verify[2709]: 2011/05/29_13:13:33 ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
crm_verify[2709]: 2011/05/29_13:13:33 ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
crm_verify[2709]: 2011/05/29_13:13:33 ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid
-V may provide more details
We see that there are errors because we have not yet configured STONITH, aka, fencing. Let's fix this now.
Configuring STONITH
STONITH is another name for fencing, and is an acronym for "Shoot The Other Node In The Head".
We will use the Node Assassin fence device here. Please set it up as per the Node Assassin page's instructions and then return here.
Alternatively, please configure IPMI, iLO or your preferred fence device and then return here.
Now, RHEL does not support heartbeat stonith agents, so we will need to configure using the RHCS fence_agents.
ToDo: Submit the cib commit stonith argument to the docs.
crm
crm(live)# cib new stonith
INFO: stonith shadow CIB created
crm(stonith)# configure
crm(stonith)configure# primitive apc-fencing stonith:fence_apc \
params \
pcmk_host_map="an-node01.alteeve.com:1 an-node02.alteeve.com:2" \
pcmk_host_list="an-node01.alteeve.com an-node02.alteeve.com" \
pcmk_host_check="static-list" \
ipaddr="192.168.1.6" \
action="reboot" \
login="apc" \
passwd="secret" \
port="dynamic"
WARNING: apc-fencing: action start not advertised in meta-data, it may not be supported by the RA
WARNING: apc-fencing: action stop not advertised in meta-data, it may not be supported by the RA
crm(stonith)configure# clone Fencing apc-fencing
crm(stonith)configure# property stonith-enabled="true"
crm(stonith)configure# show
node an-node01.alteeve.com \
attributes standby="off"
node an-node02.alteeve.com \
attributes standby="off"
primitive ClusterIP ocf:heartbeat:IPaddr2 \
params ip="192.168.122.101" cidr_netmask="32" \
op monitor interval="30s" \
meta target-role="Stopped"
primitive apc-fencing stonith:fence_apc \
params pcmk_host_map="an-node01.alteeve.com:1,an-node02.alteeve.com:2" pcmk_host_list="an-node01.alteeve.com an-node02.alteeve.com" pcmk_host_check="static-list" ipaddr="192.168.1.6" action="reboot" login="apc" passwd="stone1983" port="TBA"
clone Fencing apc-fencing
property $id="cib-bootstrap-options" \
dc-version="1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="true"
crm(stonith)configure# cib commit stonith
INFO: commited 'stonith' shadow CIB to the cluster
cib use live
crm(live)# cib delete stonith
INFO: stonith shadow CIB deleted
crm(live)# configure property stonith-enabled="true"
crm(live)# configure show
node an-node01.alteeve.com \
attributes standby="off"
node an-node02.alteeve.com \
attributes standby="off"
primitive ClusterIP ocf:heartbeat:IPaddr2 \
params ip="192.168.122.101" cidr_netmask="32" \
op monitor interval="30s" \
meta target-role="Stopped"
primitive apc-fencing stonith:fence_apc \
params pcmk_host_map="an-node01.alteeve.com:1,an-node02.alteeve.com:2" pcmk_host_list="an-node01.alteeve.com an-node02.alteeve.com" pcmk_host_check="static-list" ipaddr="192.168.1.6" action="reboot" login="apc" passwd="stone1983" port="TBA"
property $id="cib-bootstrap-options" \
dc-version="1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="true"
Note: See rhbz#720214 for the port=TBA requirement, and rhbz#720218 for the two WARNING messages. |
Notes:
- Check man stonithd for more information on stonith options.
- Use priority to control which fence devices are used in what order. The lower the number, the higher the priority.
- This is not yet supported.
- Currently, priority goes to the node with the most available fence devices, and then the devices are used in the order they are found in crm configure show.
- The node names in pcmk_host_map must match the name used by corosync or cman. Verify with crm_mon -p.
- The port=TBA prevents the shell from complaining when the primitive has been configured for multiple nodes.
- Other notes:
Look into CTS (cluster test suite), top level dir in pacemaker.
- man stonithd
- Q. Does 'priority = integer [0]' provide a mechanism for controlling the order in which fence devices are tried?
- A. Yes, but it is not yet implemented.
Configure An Active/Passive Cluster
CFS notes:
- http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch09s03.html;
- stonith -L should be stonith_admin -L? If so, this is returning something odd.
- No, it should be crm ra ....
- stonith -L should be stonith_admin -L? If so, this is returning something odd.
cmirror
cmirror - DOC-55285 - Jonathan Brassow - visegrips - #lvm
Notes
Reference Linbit Tech Guides when citing "Highly Available iSCSI storage with DRBD and Pacemaker".
How Pacemaker Stores and Validates cib.xml
Pacemaker stores it's CIB in /var/lib/pacemaker/cib/. The CIB itself is cib.xml.
When Pacemaker starts, it checks that the cib.xml file is sane by checking it's sum against cib.xml.sum. You can manually check that the hash matches using cibadmin -5 -x cib.xml. If you edit the cib.xml, you will need to either delete the cib.xml.sum file completely, or replace it's contents via cibadmin -5 -x cib.xml > cib.xml.sig.
Pacemaker keeps a cib-XX.raw and cib-XX.raw.sig, where XX is an integer, as backups of the cluster configuration.
The cib.last file contains the current CIB version (one value higher than the last backup).
Adding Resources Started Outside of Pacemaker
11:22 < DanFrincu> in pacemaker if the resource is started outside the cluster and you want to add it
11:22 < DanFrincu> it's a two-step process
11:23 < DanFrincu> first add the resource to the cluster config via whatever means (I prefer, crm configure, then edit (that brings up the editor) add the resource, save and exit, verify, then apply cluster wide)
11:23 < DanFrincu> with the is-managed=false meta param
11:25 < DanFrincu> then add the target-role=Started and set the is-managed to true
11:25 < DanFrincu> pacemaker takes over, runs the probes on the resource, sees it matches in expected output, et voila
11:25 < DanFrincu> just FYI
Any questions, feedback, advice, complaints or meanderings are welcome. | |||
Alteeve's Niche! | Enterprise Support: Alteeve Support |
Community Support | |
© Alteeve's Niche! Inc. 1997-2024 | Anvil! "Intelligent Availability®" Platform | ||
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions. |