2x5 Scalable Cluster Tutorial

From Alteeve Wiki
Jump to navigation Jump to search

 AN!Wiki :: How To :: 2x5 Scalable Cluster Tutorial

Overview

This paper has one goal;

  • Creating a 2-node, high-availability cluster hosting Xen virtual machines.

Technologies We Will Use

We will introduce and use the following technologies:

  • RHCS, Red Hat Cluster Services version 3, aka "Cluster 3", running on Red Hat Enterprise Linux 6.0 x86_64.
    • RHCS implements:
      • cman; The cluster manager.
      • corosync; The cluster engine, implementing the totem protocol and cpg and other core cluster services.
      • rgmanager; The resource group manager handles restoring and failing over services in the cluster, including our Xen VMs.
  • Fencing devices needed to keep a cluster safe.
    • Two fencing types are discussed;
      • IPMI; The most common fence method used in servers.
      • Node Assassin; A home-brew fence device ideal for learning or as a backup to IPMI.
  • Xen; The virtual server hypervisor.
    • Converting the host OS into the special access dom0 virtual machine.
    • Provisioning domU VMs.
  • Putting all cluster-related daemons under the control of rgmanager.
    • Making the VMs highly available.

Prerequisites

It is expected that you are already comfortable with the Linux command line, specifically bash, and that you are familiar with general administrative tasks in Red Hat based distributions. You will also need to be comfortable using editors like vim, nano, gedit, kate or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, multicast, broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

Where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

Platform

Red Hat Cluster Service version 3, also known as "Cluster Stable 3" or "RHCS 3", entered the server distribution world with the release of RHEL 6. It is used by downstream distributions like CentOS and Scientific Linux. This tutorial should be easily adapted to any Red Hat derivative distribution. It is expected that most users will have 64-bit CPUs, thus, we will use the x86_64 distribution and packages.

If you are on an 32-bit system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names. Be aware though that issues arising from the need for PAE will not be discussed.

If you do not have a Red Hat Network account, you can download CentOS or another derivative of the same release, currently 6.0.

Note: When last checked, down-stream distributions have not yet been released. It is expected that they will be available around mid to late December.

Focus and Goal

Clusters can serve to solve three problems; Reliability, Performance and Scalability.

This paper will build a cluster designed to be more reliable, also known as a High-Availability cluster or simply HA Cluster. At the end of this paper, you should have a fully functioning two-node cluster capable of hosting a "floating" virtual servers. That is, VMs that exist on one node that can be easily moved to the other node with minimal or no down time.

Base System Setup

This paper is admittedly long-winded. There is a "cheat-sheet" version planned, but it will be written only after this main tutorial is complete. Please be patient! Clustering is not inherently difficult, but there are a lot of pieces that need to work together for anything to work. Grab a coffee or tea and settle in.

Hardware

We will need two physical servers each with the following hardware:

  • One or more multi-core CPUs with Virtualization support.
  • Three network cards; At least one should be gigabit or faster.
  • One or more hard drives.
  • You will need some form of a fence device. This can be an IPMI-enabled server, a Node Assassin, a fenceable PDU or similar.

This paper uses the following hardware, and would suggest these be "minimum specifications":

  • ASUS M4A78L-M
  • AMD Athlon II x2 250
  • 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
  • 1x Intel 82540 PCI NICs
  • 1x D-Link DGE-560T
  • Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would serve the purposes of creating this document. What you purchase shouldn't matter, so long as the minimum requirements are met.

Note: I use three physical NICs, but you can get away with fewer by using VLANs or by simply re-using a given interface. Neither appealed to me given the minimal cost of add-in network cards and the relative complexity of VLANs. If you wish to alter your network setup, please do so.

Pre-Assembly Information

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to network boot your install media.

There is no requirement, from a clustering point of view, that any given network card be mapped to any given ethX device. However, you will be jumping between servers fairly often and having various setups adds one more level of complexity. For this reason, I strongly recommend you follow this section.

Before you assemble your servers, record their network cards' MAC addresses. I like to keep simple text files like these:

cat an-node01.mac
90:E6:BA:71:82:EA	eth0	# Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53	eth1	# D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4	eth2	# Intel Corporation 82540EM Gigabit Ethernet Controller
cat an-node02.mac
90:E6:BA:71:82:D8	eth0	# Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A	eth1	# D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78	eth2	# Intel Corporation 82540EM Gigabit Ethernet Controller

This will prove very handy later.

OS Install

There is no hard and fast rule on how you install the host operating systems. Ultimately, it's a question of what you prefer. There are some things that you should keep in mind though.

  • Balance the desire for tools against the reality that all programs have bugs.
    • Bugs could be exploited to gain access to your server. If the host is compromised, all of the virtual servers are compromised.
  • The host operating system, known as dom0 in Xen, should do nothing but run the hypervisor.
  • If you install a graphical interface, like Xorg and Gnome, consider disabling it.
    • This paper takes this approach and will cover disabling the graphical interface.

Below is the kickstart script used by the nodes for this paper. You should be able to adapt it easily to suit your needs. All options are documented.

Post OS Install

There are a handful of changes we will want to make now that the install is complete. Some of these are optional and you may skip them if you prefer. However, the remainder of this paper assumes these changes have been made. If you used the kickstart script, then some of these steps will have already been completed.

Disable selinux

Given the complexity of clustering, we will disable selinux to keep it from being more complex. Obviously, this introduces security issues that you may not be comfortable with.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect, but don't do it yet as some changes to come may also need a reboot.

Change the Default Run-Level

This is an optional step intended to improve performance.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reduces the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

vim /etc/inittab
id:3:initdefault:
init 3

Make Boot Messages Visible

This is another optional step that disables the rhgb (Red Hat Graphical Boot) and quiet kernel arguments. These options provide the nice boot-time splash screen. I like to turn them off though as they also hide a lot of boot messages that can be helpful.

To make this change, edit the grub menu and remove the rhgb quiet arguments from the kernel /vmlinuz... line.

vim /boot/grub/menu.lst

Change:

title Red Hat Enterprise Linux (2.6.32-71.el6.x86_64)
        root (hd0,0)
        kernel /vmlinuz-2.6.32-71.el6.x86_64 ro root=UUID=ef8ebd1b-8c5f-4bc8-b683-ead5f4603fec rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=auto rhgb quiet
        initrd /initramfs-2.6.32-71.el6.x86_64.img

To:

title Red Hat Enterprise Linux (2.6.32-71.el6.x86_64)
        root (hd0,0)
        kernel /vmlinuz-2.6.32-71.el6.x86_64 ro root=UUID=ef8ebd1b-8c5f-4bc8-b683-ead5f4603fec rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=auto
        initrd /initramfs-2.6.32-71.el6.x86_64.img

Setup Inter-Node Networking

This is the first stage of our network setup. Here we will walk through setting up the three networks between our two nodes. Later we will revisit networking to tie the virtual machines together.

Warning About Managed Switches

WARNING: Please pay attention to this warning! The vast majority of cluster problems end up being network related. The hardest ones to diagnose are usually multicast issues.

If you use a managed switch, be careful about enabling Multicast IGMP Snooping or Spanning Tree Protocol. They have been known to cause problems by not allowing multicast packets to reach all nodes. This can cause somewhat random break-downs in communication between your nodes, leading to seemingly random fences and DLM lock timeouts. If your switches support PIM Routing, be sure to use it!

If you have problems with your cluster not forming, or seemingly random fencing, try using a cheap unmanaged switch. If the problem goes away, you are most likely dealing with a managed switch configuration problem.

Network Layout

This setup expects you to have three physical network cards connected to three independent networks. Each network serves a purpose:

  • Network connected to the Internet and thus has untrusted traffic.
  • Storage network used for keeping data between the nodes in sync.
  • Back-channel network used for secure internode communication.

These are the networks and names that will be used in this tutorial. Please note that, inside VMs, device names will not match the list below. This table is valid for the operating systems running the hypervisors, known as dom0 in Xen or as the host in other virtualized environments.

Network Description Short Name Device Name Suggested Subnet NIC Properties
Back-Channel Network BCN eth0 192.168.1.0/24 NICs with IPMI piggy-back must be used here.
Second-fastest NIC should be used here.
If using a PXE server, this should be a bootable NIC.
Storage Network SN eth1 192.168.2.0/24 Fastest NIC should be used here.
Internet-Facing Network IFN eth2 192.168.3.0/24 Remaining NIC should be used here.

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:

  1. If your nodes have IPMI piggy-backing on a normal NIC, that NIC must be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a major security risk.
  2. The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
  3. If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
  4. The final NIC should be used for the IFN subnet.

Node IP Addresses

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

  Internet-Facing Network (IFN) Storage Network (SN) Back-Channel Network (BCN)
an-node01 192.168.1.71 192.168.2.71 192.168.3.71
an-node02 192.168.1.72 192.168.2.72 192.168.3.72

Disable The NetworkManager Daemon

Some cluster software will not start with NetworkManager running! This is because NetworkManager is designed to be a highly-adaptive, easy to use network configuration system that can adapt to frequent changes in a network. For workstations and laptops, this is wonderful. For clustering, this can be disastrous. We need to ensure that, once set, the network will not change.

Disable NetworkManager from starting with the system.

chkconfig NetworkManager off
chkconfig --list NetworkManager
NetworkManager 	0:off	1:off	2:off	3:off	4:off	5:off	6:off

The second command shows us that NetworkManager is now disabled in all run-levels.

Enable the network Daemon

The first step is to map your physical interfaces to the desired ethX name. There is an existing tutorial that will show you how to do this.

There are a few ways to configure network in Fedora:

  • system-config-network (graphical)
  • system-config-network-tui (ncurses)
  • Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files.

If you decide that you want to hand-craft your network interfaces, take a look at the tutorial above. In it are example configuration files that are compatible with this tutorial. There are also links to documentation on what options are available in the network configuration files.

WARNING: Do not proceed until your node's networking is fully configured! This may be a small sub-section, but it is critical that you have everything setup properly before going any further!

Update the Hosts File

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

Note: Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a entries for all networks, but this is optional.

The updated /etc/hosts file should look something like this:

vim /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

# Internet Facing Network
192.168.1.71    an-node01 an-node01.alteeve.com an-node01.ifn
192.168.1.72    an-node02 an-node02.alteeve.com an-node02.ifn

# Storage Network
192.168.2.71    an-node01.sn
192.168.2.72    an-node02.sn

# Back Channel Network
192.168.3.71    an-node01.bcn
192.168.3.72    an-node02.bcn

# Node Assassins
192.168.3.61    fence_na01 fence_na01.alteeve.com
192.168.3.62    motoko motoko.alteeve.com

Now to test this, ping both nodes by their name, as returned by uname -n, and make sure the ping packets are sent on the back channel network (192.168.1.0/24).

ping -c 5 an-node01.alteeve.com
PING an-node01 (192.168.1.71) 56(84) bytes of data.
64 bytes from an-node01 (192.168.1.71): icmp_seq=1 ttl=64 time=0.399 ms
64 bytes from an-node01 (192.168.1.71): icmp_seq=2 ttl=64 time=0.403 ms
64 bytes from an-node01 (192.168.1.71): icmp_seq=3 ttl=64 time=0.413 ms
64 bytes from an-node01 (192.168.1.71): icmp_seq=4 ttl=64 time=0.365 ms
64 bytes from an-node01 (192.168.1.71): icmp_seq=5 ttl=64 time=0.428 ms

--- an-node01 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4001ms
rtt min/avg/max/mdev = 0.365/0.401/0.428/0.030 ms
ping -c 5 an-node02.alteeve.com
PING an-node02 (192.168.1.72) 56(84) bytes of data.
64 bytes from an-node02 (192.168.1.72): icmp_seq=1 ttl=64 time=0.419 ms
64 bytes from an-node02 (192.168.1.72): icmp_seq=2 ttl=64 time=0.405 ms
64 bytes from an-node02 (192.168.1.72): icmp_seq=3 ttl=64 time=0.416 ms
64 bytes from an-node02 (192.168.1.72): icmp_seq=4 ttl=64 time=0.373 ms
64 bytes from an-node02 (192.168.1.72): icmp_seq=5 ttl=64 time=0.396 ms

--- an-node02 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4001ms
rtt min/avg/max/mdev = 0.373/0.401/0.419/0.030 ms

If you did name your other nodes in /etc/hosts, now is a good time to make sure that everything is working by pinging each interface by name and also pinging the fence devices.

From an-node01

ping -c 5 an-node02
ping -c 5 an-node02.ifn
ping -c 5 an-node02.sn
ping -c 5 an-node02.bcn
ping -c 5 fence_na01
ping -c 5 fence_na01.alteeve.com
ping -c 5 motoko
ping -c 5 motoko.alteeve.com

Then repeat the set of pings from an-node02 to the an-node01 networks and the fence devices.

From an-node02

ping -c 5 an-node01
ping -c 5 an-node01.ifn
ping -c 5 an-node01.sn
ping -c 5 an-node01.bcn
ping -c 5 fence_na01
ping -c 5 fence_na01.alteeve.com
ping -c 5 motoko
ping -c 5 motoko.alteeve.com

Be sure that, if your fence device uses a name, that you include entries to resolve it as well. You can see how I've done this with the two Node Assassin devices I use. The same applies to IPMI or other devices, if you plan to reference them by name.

Fencing will be discussed in more detail later on in this HowTo.

Disable Firewalls

In the spirit of keeping things simple, and understanding that this is a test cluster, we will flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop

What I like to do in production clusters is disable the IP address on the internet-facing interfaces on the dom0 machines. The only real connection to the interface is inside a VM designed to be a firewall running Shorewall. That VM will have two virtual interfaces connected to eth0 and eth2. With that VM in place, and with all other VMs only having a virtual interface connected to eth0, all Internet traffic is forced through the one firewall VM.

When you are finished building your cluster, you may want to check out the Shorewall tutorial below.

Setup SSH Shared Keys

This is an optional step. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. Keep in mind, this tutorial assumes that you are building a test cluster, so there is a focus on ease of use.

If you're a little new to SSH, it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy it's newly generated public key to each remote user's home directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the other node's root user's directory.

Here, simply, is what we will do.

  • Log in to an-node01 as root
    • Generate an ssh key
    • Copy the contents from /root/.ssh/id_rsa.pub
  • Log in to an-node02 as root
    • Edit the file /root/.ssh/authorized_keys
    • Paste in the content's of root@an-node01's public key.
    • Generate an ssh key
    • Copy the contents from /root/.ssh/id_rsa.pub
  • Log back in to an-node01 as root
    • Edit the file /root/.ssh/authorized_keys
    • Paste in the content's of root@an-node02's public key.

Here are the detailed steps.

For each user, on each machine you want to connect from, run:

# The '2047' is just to screw with brute-forces a bit.
ssh-keygen -t rsa -N "" -b 2047 -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
08:d8:ed:72:38:61:c5:0e:cf:bf:dc:28:e5:3c:a7:88 root@an-node01.alteeve.com
The key's randomart image is:
+--[ RSA 2047]----+
|     ..          |
|   o.o.          |
|  . ==.          |
|   . =+.         |
|    + +.S        |
|     +  o        |
|       = +       |
|     ...B o      |
|    E ...+       |
+-----------------+

This will create two files: the private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private must never be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

Private key:

cat ~/.ssh/id_rsa
-----BEGIN RSA PRIVATE KEY-----
MIIEoQIBAAKCAQBlL42DC+NJVpJ0rdrWQ1rxGEbPrDoe8j8+RQx3QYiB014R7jY5
EaTenThxG/cudgbLluFxq6Merfl9Tq2It3k9Koq9nV9ZC/vXBcl4MC7pGSQaUw2h
DVwI7OCSWtnS+awR/1d93tANXRwy7K5ic1pcviJeN66dPuuPqJEF/SKE7yEBapMq
sN28G4IiLdimsV+UYXPQLOiMy5stmyGhFhQGH3kxYzJPOgiwZEFPZyXinGVoV+qa
9ERSjSKAL+g21zbYB/XFK9jLNSJqDIPa//wz0T+73agZ0zNlxygmXcJvapEsFGDG
O6tcy/3XlatSxjEZvvfdOnC310gJVp0bcyWDAgMBAAECggEAMZd0y91vr+n2Laln
r8ujLravPekzMyeXR3Wf/nLn7HkjibYubRnwrApyNz11kBfYjL+ODqAIemjZ9kgx
VOhXS1smVHhk2se8zk3PyFAVLblcsGo0K9LYYKd4CULtrzEe3FNBFje10FbqEytc
7HOMvheR0IuJ0Reda/M54K2H1Y6VemtMbT+aTcgxOSOgflkjCTAeeOajqP5r0TRg
1tY6/k46hLiBka9Oaj+QHHoWp+aQkb+ReHUBcUihnz3jcw2u8HYrQIO4+v4Ud2kr
C9QHPW907ykQTMAzhMvZ3DIOcqTzA0r857ps6FANTM87tqpse5h2KfdIjc0Ok/AY
eKgYAQKBgQDm/P0RygIJl6szVhOb5EsQU0sBUoMT3oZKmPcjHSsyVFPuEDoq1FG7
uZYMESkVVSYKvv5hTkRuVOqNE/EKtk5bwu4mM0S3qJo99cLREKB6zNdBp9z2ACDn
0XIIFIalXAPwYpoFYi1YfG8tFfSDvinLI6JLDT003N47qW1cC5rmgQKBgHAkbfX9
8u3LiT8JqCf1I+xoBTwH64grq/7HQ+PmwRqId+HyyDCm9Y/mkAW1hYQB+cL4y3OO
kGL60CZJ4eFiTYrSfmVa0lTbAlEfcORK/HXZkLRRW03iuwdAbZ7DIMzTvY2HgFlU
L1CfemtmzEC4E6t5/nA4Ytk9kPSlzbzxfXIDAoGAY/WtaqpZ0V7iRpgEal0UIt94
wPy9HrcYtGWX5Yk07VXS8F3zXh99s1hv148BkWrEyLe4i9F8CacTzbOIh1M3e7xS
pRNgtH3xKckV4rVoTVwh9xa2p3qMwuU/jMGdNygnyDpTXusKppVK417x7qU3nuIv
1HzJNPwz6+u5GLEo+oECgYAs++AEKj81dkzytXv3s1UasstOvlqTv/j5dZNdKyZQ
72cvgsUdBwxAEhu5vov1XRmERWrPSuPOYI/4m/B5CYbTZgZ/v8PZeBTg17zgRtgo
qgJq4qu+fXHKweR3KAzTPSivSiiJLMTiEWb5CD5sw6pYQdJ3z5aPUCwChzQVU8Wf
YwKBgQCvoYG7gwx/KGn5zm5tDpeWb3GBJdCeZDaj1ulcnHR0wcuBlxkw/TcIadZ3
kqIHlkjll5qk5EiNGNlnpHjEU9X67OKk211QDiNkg3KAIDMKBltE2AHe8DhFsV8a
Mc/t6vHYZ632hZ7b0WNuudB4GHJShOumXD+NfJgzxqKJyfGkpQ==
-----END RSA PRIVATE KEY-----

Public key:

cat ~/.ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAGUvjYML40lWknSt2tZDWvEYRs+sOh7yPz5FDHdBiIHTXhHuNjkRpN6dOHEb9y52BsuW4XGrox6t+X1OrYi3eT0qir2dX1kL+9cFyXgwLukZJBpTDaENXAjs4JJa2dL5rBH/V33e0A1dHDLsrmJzWly+Il43rp0+64+okQX9IoTvIQFqkyqw3bwbgiIt2KaxX5Rhc9As6IzLmy2bIaEWFAYfeTFjMk86CLBkQU9nJeKcZWhX6pr0RFKNIoAv6DbXNtgH9cUr2Ms1ImoMg9r//DPRP7vdqBnTM2XHKCZdwm9qkSwUYMY7q1zL/deVq1LGMRm+9906cLfXSAlWnRtzJYM= root@an-node01.alteeve.com

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From an-node01, type:

ssh root@an-node02
The authenticity of host 'an-node02 (192.168.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,192.168.1.72' (RSA) to the list of known hosts.
root@an-node02's password: 
Last login: Fri Oct  1 20:07:01 2010 from 192.168.1.102

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

cat ~/.ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAGUvjYML40lWknSt2tZDWvEYRs+sOh7yPz5FDHdBiIHTXhHuNjkRpN6dOHEb9y52BsuW4XGrox6t+X1OrYi3eT0qir2dX1kL+9cFyXgwLukZJBpTDaENXAjs4JJa2dL5rBH/V33e0A1dHDLsrmJzWly+Il43rp0+64+okQX9IoTvIQFqkyqw3bwbgiIt2KaxX5Rhc9As6IzLmy2bIaEWFAYfeTFjMk86CLBkQU9nJeKcZWhX6pr0RFKNIoAv6DbXNtgH9cUr2Ms1ImoMg9r//DPRP7vdqBnTM2XHKCZdwm9qkSwUYMY7q1zL/deVq1LGMRm+9906cLfXSAlWnRtzJYM= root@an-node01.alteeve.com

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

Cluster Setup, Part 1

There will be two stages to setting up the cluster.

  • Part 1; Setting up the core cluster.
  • Setting up DRBD, Clustered LVM, GFS2 and Xen then provisioning a Virtual server.
  • Part 2; Adding DRBD and Xen domU VMs to the cluster.

A Word On Complexity

Clustering is not inherently hard, but it is inherently complex. Consider;

  • Any given program has N bugs.
    • RHCS uses; cman, corosync, totem, fenced, rgmanager, dlm, qdisk and GFS2,
    • We will be adding DRBD, CLVM and Xen.
    • Right there, we have N^11 possible bugs. We'll call this A.
  • A cluster has Y nodes.
    • In our case, 2 nodes, each with 3 networks.
    • The network infrastructure (Switches, routers, etc). If you use managed switches, add another layer of complexity.
    • This gives us another Y^(2*3), and then ^2 again for managed switches. We'll call this B.
  • Let's add the human factor. Let's say that a person needs roughly 5 years of cluster experience to be considered an expert. For each year less than this, add a Z "oops" factor, (5-Z)^2. We'll call this C.
  • So, finally, add up the complexity, using this tutorial's layout, 0-years of experience and managed switches.
    • (N^11) * (Y^(2*3)^2) * ((5-0)^2) == (A * B * C) == an-unknown-but-big-number.

This isn't meant to scare you away, but it is meant to be a sobering statement. Obviously, those numbers are somewhat artificial, but the point remains.

Any one piece is easy to understand, thus, clustering is inherently easy. However, given the large number of variables, you must really understand all the pieces and how they work together. DO NOT think that you will have this mastered and working in a month. Certainly don't try to sell clusters as a service without a lot of internal testing.

Clustering is kind of like chess. The rules are pretty straight forward, but the complexity can take some time to master.

An Overview Before We Begin

When looking at a cluster, there is a tendency to want to dive right into the configuration file. That is not very useful in clustering.

  • When you look at the configuration file, it is quite short.

It isn't like most applications or technologies though. Most of us learn by taking something, like a configuration file, and tweaking it this way and that to see what happens. I tried that with clustering and learned only what it was like to bang my head against the wall.

  • Understanding the parts and how they work together is critical.

You will find that the discussion on the components of clustering, and how those components and concepts interact, will be much longer than the initial configuration. It is true that we could talk very briefly about the actual syntax, but it would be a disservice. Please, don't rush through the next section or, worse, skip it and go right to the configuration. You will waste far more time than you will save.

  • Clustering is easy, but it has a complex web of inter-connectivity. You must grasp this network if you want to be an effective cluster administrator!

Component; cman

This was, traditionally, the cluster manager. In the 3.0 series, it acts mainly as a service manager, handling the starting and stopping of clustered services. In the 3.1 series, cman will be removed entirely.

Component; corosync

Corosync is the heart of the cluster. All other computers operate though this component, and no cluster component can work without it. Further, it is shared between both Pacemaker and RHCS clusters.

In Red Hat clusters, corosync is configured via the central cluster.conf file. In Pacemaker clusters, it is configured directly in corosync.conf. As we will be building an RHCS, we will only use cluster.conf. That said, (almost?) all corosync.conf options are available in cluster.conf. This is important to note as you will see references to both configuration files when searching the Internet.

Concept; quorum

Quorum is defined as a collection of machines and devices in a cluster with a clear majority of votes.

The idea behind quorum is that, which ever group of machines has it, can safely start clustered services even when defined members are not accessible.

Take this scenario;

  • You have a cluster of four nodes, each with one vote.
    • The cluster's expected_votes is 4. A clear majority, in this case, is 3 because (4/2)+1, rounded down, is 3.
    • Now imagine that there is a failure in the network equipment and one of the nodes disconnects from the rest of the cluster.
    • You now have two partitions; One partition contains three machines and the other partition has one.
    • The three machines will have quorum, and the other machine will lose quorum.
    • The partition with quorum will reconfigure and continue to provide cluster services.
    • The partition without quorum will withdraw from the cluster and shut down all cluster services.

This behaviour acts as a guarantee that the two partitions will never try to access the same clustered resources, like a shared filesystem, thus guaranteeing the safety of those shared resources.

This also helps explain why an even 50% is not enough to have quorum, a common question for people new to clustering. Using the above scenario, imagine if the split were 2-nodes and 2-nodes. Because either can't be sure what the other would do, neither can safely proceed. If we allowed an even 50% to have quorum, both partition might try to take over the clustered services and disaster would soon follow.

There is one, and only one except to this rule.

In the case of a two node cluster, as we will be building here, any failure results in a 50/50 split. If we enforced quorum in a two-node cluster, there would never be high availability because and failure would cause both nodes to withdraw. The risk with this exception is that we now place the entire safety of the cluster on fencing, a concept we will cover in a second. Fencing is a second line of defense and something we are loath to rely on alone.

Even in a two-node cluster though, proper quorum can be maintained by using a quorum disk, called a qdisk. This is another topic we will touch on in a moment. This tutorial will implement a qdisk specifically so that we can get away from this two_node exception.

Concept; Virtual Synchrony

All cluster operations have to occur in the same order across all nodes. This concept is called "virtual synchrony", and it is provided by corosync using "closed process groups", CPG.

Let's look at how locks are handled on clustered file systems as an example.

  • As various nodes want to work on files, they send a lock request to the cluster. When they are done, they send a lock release to the cluster.
    • Lock and unlock messages must arrive in the same order to all nodes, regardless of the real chronological order that they were issued.
  • Let's say one node sends out messages "a1 a2 a3 a4". Meanwhile, the other node sends out "b1 b2 b3 b4".
    • All of these messages go to corosync which gathers them up and sends them up and sorts them.
    • It is totally possible that corosync will get the messages as "a2 b1 b2 a1 b4 a3 a4 b4".
    • The corosync application will then ensure that all nodes get the messages in the above order, one at a time. All nodes must confirm that they got a given message before the next message is sent to any node.

This will tie into fencing and totem, as we'll see in the next sections.

Concept; Fencing

Fencing is a absolutely critical part of clustering. Without fully working fence devices, your cluster will fail.

Was that strong enough, or should I say that again? Let's be safe:

DO NOT BUILD A CLUSTER WITHOUT PROPER, WORKING AND TESTED FENCING.

Sorry, I promise that this will be the only time that I speak so strongly. Fencing really is critical, and explaining the need for fencing is nearly a weekly event. So then, let's discuss fencing.

When a node stops responding, an internal timeout and counter start ticking away. During this time, no messages are moving through the cluster and the cluster is, essentially, hung. If the node responds in time, the timeout and counter reset and the cluster begins operating properly again.

If, on the other hand, the node does not respond in time, the node will be declared dead. The cluster will take a "head count" to see which nodes it still has contact with and will determine then if there are enough to have quorum. If so, the cluster will issue a "fence" against the silent node. This is a call to a program called fenced, the fence daemon.

The fence daemon will look at the cluster configuration and get the fence devices configured for the dead node. Then, one at a time and in the order that they appear in the configuration, the fence daemon will call those fence devices, via their fence agents, passing to the fence agent any configured arguments like username, password, port number and so on. If the first fence agent returns a failure, the next fence agent will be called. If the second fails, the third will be called, then the forth and so on. Once the last (or perhaps only) fence device fails, the fence daemon will retry again, starting back at the start of the list. It will do this indefinitely until one of the fence devices success.

Here's the flow, in point form:

  • The corosync program collects messages and sends them off, one at a time, to all nodes.
  • All nodes respond, and the next message is sent. Repeat continuously during normal operation.
  • Suddenly, one node stops responding.
    • Communication freezes while the cluster waits for the silent node.
    • A timeout starts (300ms by default), and each time the timeout is hit, and error counter increments.
    • The silent node responds before the counter reaches the limit.
      • The counter is reset to 0
      • The cluster operates normally again.
  • Again, one node stops responding.
    • Again, the timeout begins and the error count increments each time the timeout is reached.
    • Time error exceeds the limit (10 is the default); Three seconds have passed (300ms * 10).
    • The node is declared dead.
    • The cluster checks which members it still has, and if that provides enough votes for quorum.
      • If there are too few votes for quorum, the cluster software freezes and the node(s) withdraw from the cluster.
      • If there are enough votes for quorum, the silent node is declared dead.
        • corosync calls fenced, telling it to fence the node.
        • Which fence device(s) to use, that is, what fence_agent to call and what arguments to pass, is gathered.
        • For each configured fence device:
          • The agent is called and fenced waits for the fence_agent to exit.
          • The fence_agent's exit code is examined. If it's a success, recovery starts. If it failed, the next configured fence agent is called.
        • If all (or the only) configured fence fails, fenced will start over.
        • fenced will wait and loop forever until a fence agent succeeds. During this time, the cluster is hung.
    • Once a fence_agent succeeds, the cluster is reconfigured.
      • A new closed process group (cpg) is formed.
      • A new fence domain is formed.
      • Lost cluster resources are recovered as per rgmanager's configuration (including file system recovery as needed).
      • Normal cluster operation is restored.

This skipped a few key things, but the general flow of logic should be there.

This is why fencing is so important. Without a properly configured and tested fence device or devices, the cluster will never successfully fence and the cluster will stay hung forever.

Component; totem

The totem protocol defined message passing within the cluster and it is used by corosync. A token is passed around all the nodes in the cluster, and the timeout discussed in fencing above is actually a token timeout. The counter, then, is the number of lost tokens that are allowed before a node is considered dead.

The totem protocol supports something called 'rrp', Redundant Ring Protocol. Through rrp, you can add a second backup ring on a separate network to take over in the event of a failure in the first ring. In RHCS, these rings are known as "ring 0" and "ring 1".

Component; rgmanager

When the cluster configuration changes, corosync calls rgmanager, the resource group manager. It will examine what changed and then will start, stop, migrate or recover cluster resources as needed.

Component; qdisk

If you have a cluster of 2 to 16 nodes, you can use a quorum disk. This is a small partition on shared storage device, like a SAN or DRBD device, that the cluster can use to make much better decisions about which nodes should have quorum when a split in the network happens.

The way a qdisk works, at it's most basic, is to have one or more votes in quorum. Generally, but not necessarily always, the qdisk device has one vote less than the total number of nodes (N-1).

  • In a two node cluster, the qdisk would have one vote.
  • In a seven node cluster, the qdisk would have six votes.

Imagine these two scenarios; First without qdisk, the revisited to see how qdisk helps.

  • First Scenarion; A two node cluster, which we will implement here.

If the network connection on the totem ring(s) breaks, you will enter into a dangerous state called a "split-brain". Normally, this can't happen because quorum can only be held by one side at a time. In a two_node cluster though, this is allowed.

Without a qdisk, either node could potentially start the cluster resources. This is a disastrous possibility and it is avoided by a fence dual. Both nodes will try to fence the other at the same time, but only the fastest one wins. The idea behind this is that one will always live because the other will die before it can get it's fence call out. In theory, this works fine. In practice though, there are cases where fence calls can be "queued", thus, in fact, allow both nodes to die. This defeats the whole "high availability" thing, now doesn't it? Also, this possibility is why the two_node option is the only exception to the quorum rules.

So how does a qdisk help?

Two ways!

First;

The biggest way it helps is by getting away from the two_node exception. With the qdisk partition, you are back up to three votes, so there will never be a 50/50 split. If either node retains access to the quorum disk while the other loses access, then right there things are decided. The one with the disk has 2 votes and wins quorum and will fence the other. Meanwhile, the other will only have 1 votes, thus it will lose quorum, and will withdraw from the cluster and not try to fence the other node.

Second;

You can use heuristics with qdisk to have a more intelligent partition recovery mechanism. For example, let's look again at the scenario where the link(s) between the two nodes hosting the totem ring is cut. This time though, let's assume that the storage network link is still up, so both nodes have access to the qdisk partition. How would the qdisk act as a tie breaker?

One way is to have a heuristics test that checks to see if one of the nodes has access to a particular router. With this heuristics test, if only one node had access to that switch, the qdisk would give it's vote to that node and ensure that the "healthiest" node survived. Pretty cool, eh?

  • Second Scenarion; A seven node cluster with six dead members.

Admittedly, this is an extreme scenario, but it serves to illustrate the point well. Remember how we said that the general rule is that the qdisk has N-1 votes?

With our seven node cluster, on it's own, there would be a total of 7 votes, so normally quorum would require 4 nodes be alive (((7/2)+1) = (3.5+1) = 4.5, rounded down is 4). With the death of the fourth node, all cluster services would fail. We understand now why this would be the case, but what if the nodes are, for example, serving up websites? In this case, 3 nodes are still sufficient to do the job. Heck, even 1 node is better than nothing. With the rules of quorum though, it just wouldn't happen.

Let's now look at how the qdisk can help.

By giving the qdisk partition 6 votes, you raise the cluster's total expected votes from 7 to 13. With this new count, the votes needed to for quorum is 7 (((13/2)+1) = (6.5+1) = 7.5, rounded down is 7).

So looking back at the scenario where we've lost four of our seven nodes; The surviving nodes have 3 votes, but they can talk to the qdisk which provides another 6 votes, for a total of 9. With that, quorum is achieved and the three nodes are allowed to form a cluster and continue to provide services. Even if you lose all but one node, you are still in business because the one surviving node, which is still able to talk to the qdisk and thus win it's 6 votes, has a total of 7 and thus has quorum!

There is another benefit. As we mentioned in the first scenario, we can add heuristics to the qdisk. Imagine that, rather than having six nodes die, they instead partition off because of a break in the network. Without qdisk, the six nodes would easily win quorum, fence the one other node and then reform the cluster. What if, though, the one lone node was the only one with access to a critical route to the Internet? The six nodes would be useless in a web-server environment. With the heuristics provided by qdisk, that one useful node would get the qdisk's 6 votes and win quorum over the other six nodes!

A little qdisk goes a long way.

Component; DRBD

DRBD; Distributed Replicating Block Device, is a technology that takes raw storage from two or more nodes and keeps their data synchronized in real time. It is sometimes described as "RAID 1 over Nodes", and that is conceptually accurate. In this tutorial's cluster, DRBD will be used to provide that back-end storage as a cost-effective alternative to a tranditional SAN or iSCSI device.

To help visualize DRBD's use and role, look at the map of our cluster's storage.

Component; CLVM

With DRBD providing the raw storage for the cluster, we must now create partitions. This is where Clustered LVM, known as CLVM, comes into play.

CLVM is ideal in that it understands that it is clustered and therefor won't provide access to nodes outside of the formed cluster. That is, not a member of corosync's closed process group, which, in turn, requires quorum.

It is ideal because it can take one or more raw devices, known as "physical volumes", or simple as PVs, and combine their raw space into one or more "volume groups", known as VGs. These volume groups then act just like a typical hard drive and can be "partitioned" into one or more "logical volumes", known as LVs. These LVs are what will be formatted with a clustered file system.

LVM is particularly attractive because of how incredibly flexible it is. We can easily add new physical volumes later, and then grow an existing volume group to use the new space. This new space can then be given to existing logical volumes, or entirely new logical volumes can be created. This can all be done while the cluster is online offering an upgrade path with no down time.

Component; GFS2

With DRBD providing the clusters raw storage space, and Clustered LVM providing the logical partitions, we can now look at the clustered file system. This is the role of the Global File System version 2, known simply as GFS2.

It works much like standard filesystem, with mkfs.gfs2, fsck.gfs2 and so on. The major difference is that it and clvmd use the cluster's distributed locking mechanism provided by dlm_controld. Once formatted, the GFS2-formatted partition can be mounted and used by any node in the cluster's closed process group. All nodes can then safely read from and write to the data on the partition simultaneously.

Component; DLM

One of the major roles of a cluster is to provide distributed locking on clustered storage. In fact, storage software can not be clustered without using DLM, as provided by the dlm_controld daemon, using corosync's virtual synchrony.

Through DLM, all nodes accessing clustered storage are guaranteed to get POSIX locks, called plocks, in the same order across all nodes. Both CLVM and GFS2 rely on DLM, though other clustered storage, like OCFS2, use it as well.

Component; Xen

There are two major open-source virtualization platforms available in the Linux world today; Xen and KVM. The former is maintained by Citrix and the other by Redhat. It would be difficult to say which is "better", as they're both very good. Xen can be argued to be more mature where KVM is the "official" solution supported by Red Hat directly.

We will be using the Xen hypervisor and a "host" virtual server called dom0. In Xen, every machine is a virtual server, including the system you installed when you built the server. This is possible thanks to a small Xen micro-operating system that initially boots, then starts up your original installed operating system as a virtual server with special access to the underlying hardware and hypervisor management tools.

The rest of the virtual servers in a Xen environment are collectively called "domU" virtual servers. These will be the highly-available resource that will migrate between nodes during failure events.

A Little History

In the RHCS version 2 days (RHEL 5.x and derivatives), there was a component called openais which handled totem. The OpenAIS project was designed to be the heart of the cluster and was based around the Service Availability Forum's Application Interface Specification. AIS is an open API designed to provide inter-operable high availability services.

In 2008, it was decided that the AIS specification was overkill for clustering and a duplication of effort in the existing and easier to maintain corosync project. OpenAIS was then split off as a separate project specifically designed to act as an optional add-on to corosync for users who wanted AIS functionality.

You will see a lot of references to OpenAIS while searching the web for information on clustering. Understanding it's evolution will hopefully help you avoid confusion.

Finally; Begin Configuration

At the heart of Red Hat Cluster Services is the /etc/cluster/cluster.conf configuration file.

This is an XML configuration file that stores and controls all of the cluster configuration, including node setups, resource management, fault tolerances, fence devices and their use and so on.

The goal of this tutorial is to introduce you to clustering, so only four components will be shown here, configured in two stages:

  • Stage 1
    • Node definitions
    • Fence device setup
  • Stage 2
    • Quorum Disk setup
    • Resource Management

There is a tremendous amount of options to allow for extremely fine-grained control of the cluster. To discuss all of the option would require a dedicated article and would distract quite a bit at this stage. There is an ongoing project to do just that, but it is not complete yet.

It is strongly advised that, after this tutorial, you take the time to review these options. For now though, let's keep it simple.

What Do We Need To Start?

  • First

You need a name for your cluster. It's important that it be unique if the cluster will be on a network with other clusters. This paper will use an-cluster-01.

  • Second

We need the hostname of each cluster node. You can get this by using the uname program.

uname -n
an-node01.alteeve.com

In the example above, the first node's hostname is an-node01. This will be used shortly when defining the node.

  • Third

You need to know what fence device you will be using and how to use them. Exactly how you do this will depend on what fence device you have available to you. There are many possibilities, but IPMI is probably the most common and Node Assassin is available to everyone, so those two will be discussed in detail shortly.

The main thing to know about your fence device(s) are:

  • What is the fence agent called? IPMI uses fence_ipmilan and Node Assassin uses fence_na.
  • Does one device support multiple devices? If so, what port number is each node on?
  • What IP address or resolvable hostname is the device available at?
  • Does the device require a user name and password? If so, what are they?
  • What are the supported fence options and which do you want to use? The most common being reboot, but off is another popular option.
  • Does the fence device support or require other options? If so, what are they and what values do you want to use?
  • Summary

For this tutorial, we will have the following information now:

  • Cluster name: an-cluster-01
  • Node: an-node01
    • Fist fence device: IPMI
      • IPMI is per-node, so no port number is needed.
      • The IPMI interface is at 192.168.3.51.
      • Username is admin and the password is secret.
      • No special arguments are needed.
    • Second fence device: Node Assassin
      • Node Assassin supports four nodes, and an-node01 is connected to port 1.
      • The Node Assassin interface is at 192.168.3.61.
      • Username is admin and the password is sauce.
      • We want to use the quiet option (quiet="true").
  • Node: an-node02
    • Fist fence device: IPMI
      • IPMI is per-node, so no port number is needed.
      • The IPMI interface is at 192.168.3.52.
      • Username is admin and the password is secret.
      • No special arguments are needed.
    • Second fence device: Node Assassin
      • Node Assassin supports four nodes, and an-node01 is connected to port 2.
      • The Node Assassin interface is at 192.168.3.61.
      • Username is admin and the password is sauce.
      • We want to use the quiet option (quiet="true").

Note that:

  • IPMI is per-node, and thus has different IP addresses an no ports.
  • Node Assassin supports multiple nodes, so has a common IP address and different ports per node.

We now know enough to write the first version of the cluster.conf file!

Install The Cluster Software

If you are using Red Hat Enterprise Linux, you will need to add the RHEL Server Optional (v. 6 64-bit x86_64) channel for each node in your cluster. You can do this in RHN by going the your subscription management page, clicking on each server, clicking on "Alter Channel Subscriptions", click to enable the RHEL Server Optional (v. 6 64-bit x86_64) channel and then by clicking on "Change Subscription".

This actual installation is simple, just use yum to install cman.

yum install cman

This will pull in a good number of dependencies.

For now, we do not want the cluster manager to start on boot. We'll turn it off and disable the service until we are finished testing.

chkconfig cman off
/etc/init.d/cman stop
   Leaving fence domain...                                 [  OK  ]
   Stopping gfs_controld...                                [  OK  ]
   Stopping dlm_controld...                                [  OK  ]
   Stopping fenced...                                      [  OK  ]
   Stopping cman...                                        [  OK  ]
   Unloading kernel modules...                             [  OK  ]
   Unmounting configfs...                                  [  OK  ]

Configuring And Testing Fence Devices

Before we can look at the cluster configuration, we first need to make sure the fence devices are setup and working properly. We're going to setup IPMI on either node and a Node Assassin fence device. If you only use one, you can safely ignore the other for now. If you use a different fence device, please consult the manufacturer's documentation. Better yet, contribute the setup to this document!

Configure And Test IPMI

IPMI requires having a system board with an IPMI baseboard management controller, known as a BMC. IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

Most manufacturers provide a method of configuring the BMC. Many provide a menu in the BIOS or at boot time. Modern IPMI-enabled systems offer a dedicated web interface that you can access using a browser. There is a third option as well, which we will show here, and that is by using a command line tool called, conveniently, ipmitool.

Configuring IPMI From The Command Line

To start, we need to install the IPMI user software.

ToDo: confirm this is valid for RHEL6

yum install ipmitool freeipmi freeipmi-bmc-watchdog freeipmi-ipmidetectd OpenIPMI

Once installed and and the daemon has been started, you should be able to check the local IPMI BMC using ipmitool.

/etc/init.d/ipmi start
Starting ipmi drivers:                                     [  OK  ]
ipmitool chassis status
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : false
Power Control Fault  : false
Power Restore Policy : always-off
Last Power Event     : command
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false
Front Panel Control  : none

If you see something similar, you're up and running. You can now check the current configuration using the following command.

ipmitool -I open lan print 1
Set in Progress         : Set Complete
Auth Type Support       : NONE MD2 MD5 OEM 
Auth Type Enable        : Callback : NONE MD2 MD5 OEM 
                        : User     : NONE MD2 MD5 OEM 
                        : Operator : NONE MD2 MD5 OEM 
                        : Admin    : NONE MD2 MD5 OEM 
                        : OEM      : 
IP Address Source       : Static Address
IP Address              : 192.168.3.51
Subnet Mask             : 255.255.0.0
MAC Address             : 00:e0:81:aa:bb:cc
SNMP Community String   : AMI
IP Header               : TTL=0x00 Flags=0x00 Precedence=0x00 TOS=0x00
BMC ARP Control         : ARP Responses Disabled, Gratuitous ARP Disabled
Gratituous ARP Intrvl   : 0.0 seconds
Default Gateway IP      : 0.0.0.0
Default Gateway MAC     : 00:00:00:00:00:00
Backup Gateway IP       : 0.0.0.0
Backup Gateway MAC      : 00:00:00:00:00:00
802.1q VLAN ID          : Disabled
802.1q VLAN Priority    : 0
RMCP+ Cipher Suites     : 1,2,3,6,7,8,11,12,0,0,0,0,0,0,0,0
Cipher Suite Priv Max   : aaaaXXaaaXXaaXX
                        :     X=Cipher Suite Unused
                        :     c=CALLBACK
                        :     u=USER
                        :     o=OPERATOR
                        :     a=ADMIN
                        :     O=OEM

You can change the MAC address, but this isn't advised without a good reason to do so.

Below is an example set of commands that will configure the IPMI BMC, save the new settings and then check that the new settings took. Adapt the values to suit your environment and preferences.

# Don't change the MAC without a good reason. If you need to though, this should work.
#ipmitool -I open lan set 1 macaddr 00:e0:81:aa:bb:cd

# Set the IP to be static (instead of DHCP)
ipmitool -I open lan set 1 ipsrc static

# Set the IP, default gateway and subnet mask address of the IPMI interface.
ipmitool -I open lan set 1 ipaddr 192.168.3.51
ipmitool -I open lan set 1 defgw ipaddr 0.0.0.0
ipmitool -I open lan set 1 netmask 255.255.255.0

# Set the password.
ipmitool -I open lan set 1 password secret
ipmitool -I open user set password 2 secret

# Set the snmp community string, if appropriate
ipmitool -I open lan set 1 snmp alteeve

# Enable access
ipmitool -I open lan set 1 access on

# Reset the IPMI BMC to make sure the changes took effect.
ipmitool mc reset cold

# Wait a few seconds and then re-run the call that dumped the setup to ensure
# it is now what we want.
sleep 5
ipmitool -I open lan print 1

If all went well, you should see the same output as above, but now with your new configuration.

Testing IPMI

The ipmitool tool needs to be installed on the workstation that you want to run the tests from. You will certainly want to test fencing from each node against all other nodes! If you want to test from you personal computer though, be sure to install ipmitool before hand. Note that the example below if for RPM based distributions. Please check your distribution for the availability of ipmitool if you can't use yum.

yum install ipmitool

The following commands only work against remote servers. You must use the example commands in the previous section when checking the local server.

The least invasive test is to simply check remote machine's chassis power status. Until this check works, there is no sense trying to actually reboot the remote servers.

Let's check an-node01 from an-node02. Note that here we use the IP address directly, but in practice I like to use a name that resolves to the IP address of the IPMI interface (denoted by a .ipmi suffix after the normal short hostname).

ipmitool -I lan -H 192.168.3.51 -U admin -P secret chassis power status
Chassis Power is on

Once this works, you can test a reboot, power off or power on event by replacing status with cycle, off and on, respectively. This is, in fact, what the fence_ipmilan fence agent does!

Lastly, make sure that the ipmi daemon starts with the server.

chkconfig ipmi on

There are a few more option than what I mentioned here, which you can read about in man ipmitool.

Configure And Test Node Assassin

The Node Assassin fence agent was added to the RHCS's fence-agents in version 3.0.16. RHEL 6.0 ships with version 3.0.12 though, so we'll need to install it manually, first.

cd ~
wget -c http://nodeassassin.org/files/node_assassin/node_assassin-1.1.6.tar.gz
tar -xvzf node_assassin-1.1.6.tar.gz
cd node_assassin-1.1.6
./install
Ready to start the Node Assassin install:
I will now copy: [usr/share/man/man8/fence_na.8.gz] -> [/usr/share/man/man8/fence_na.8.gz] - Copied.
I will now copy: [usr/share/fence/fence_na.lib] -> [/usr/share/fence/fence_na.lib] - Copied.
I need to create: [/etc/cluster/node_assassin/]
doesn't exist... creating.
 - Created: [/etc/cluster/node_assassin/]
I will now copy: [fence_na.pod] -> [/etc/cluster/node_assassin/fence_na.pod] - Copied.
I will now copy: [etc/cluster/fence_na.conf] -> [/etc/cluster/fence_na.conf] - Copied.
I will now copy: [usr/sbin/fence_na] -> [/usr/sbin/fence_na] - Copied.
I will now copy: [README] -> [/etc/cluster/node_assassin/README] - Copied.
I will now copy: [INSTALL] -> [/etc/cluster/node_assassin/INSTALL] - Copied.
I will now copy: [CHANGES] -> [/etc/cluster/node_assassin/CHANGES] - Copied.
Backing up the original 'cluster.rng' file.
I will now copy: [/usr/share/cluster/cluster.rng] -> [/usr/share/cluster/cluster.rng.pre-node_assassin] - Copied.
Reading: [/usr/share/cluster/cluster.rng]
 - Read.
Injecting NA support arguments.
 - Injected.
Writing out: [/usr/share/cluster/cluster.rng]
 - Written.

Installation completed!

IMPORTANT! Please read!
 - Before you can use Node Assassin, you must edit
   /etc/cluster/fence_na.conf
Read: 'INSTALL' for details on how to complete the above steps.

If you want to remove the Node Assassin fence agent, simply run ./uninstall from the same directory. Do note that it will delete the configuration file, too.

Complete the install on both nodes before editing the config file.

Now that the fence agent is installed, you will need to configure it. The configuration file is pretty well documented, so we will just look at the specific lines that need to be edited. For reference, let's edit this configuration file on an-node01 and then copy it to an-node02.

vim /etc/cluster/fence_na.conf

Set the username and password that you want to use with your devices. These must match the values entered in the cluster.conf file.

# This is the authentication information... It is currently a simple plain text
# compare, but this will change prior to first release.
system::username        =       admin
system::password        =       sauce

After programming your Node Assassin, copy the name you gave it into the variable below.

# The node assassin name. This must match exactly with the name programmed into
# the given node.
na::1::na_name          =       fence_na01

Now define the individual nodes connected to your device.

# These are aliases for each Node Assassin port. They should match the name or
# URI of the node connected to the given port. This is optional but will make
# the fenced 'list' argument more accurate and sane. If a port is listed here,
# then the 'list' action will return '<node_id>,<value>'. If a port is not
# defined, 'list' will return '<node_id>,<node::X::name-node_id>'. If a port is
# set to 'unused', it will be skipped when replying to a 'list'.
na::1::alias::1         =       an-node01.alteeve.com
na::1::alias::2         =       an-node02.alteeve.com
na::1::alias::3         =       unused
na::1::alias::4         =       unused

Now copy it over to an-node02.

rsync -av /etc/cluster/fence_na.conf root@an-node02:/etc/cluster/
sending incremental file list
fence_na.conf

sent 1508 bytes  received 67 bytes  3150.00 bytes/sec
total size is 3537  speedup is 2.25

Finally, test that you can fence either node. First, fence an-node02 from an-node01.

fence_na -a fence_na01.alteeve.com -l admin -p sauce -n 2 -o reboot
Node Assassin: . [fence_na01.alteeve.com].
TCP Port: ...... [238].
Node: .......... [02].
Login: ......... [admin].
Password: ...... [sauce].
Action: ........ [reboot].
Version Request: [no].
Done reading args.
Processing: [02:1,sleep,02:0,sleep,02:off,02:2,sleep,02:on]
Fencing node 02:
 - Reset fenced.
 - Reset released.
 - Power fenced.
 - SUCCESS!
 - Node's front-panel switches locked.
Sleeping: 1, Done.
Releasing 02
 - Power released.
 - Reset released.
 - Status: SUCCESS!
Sleeping: 1, Done.
Booting node
 - Power button closed.
 - Power button opened.
Sleeping: 1, Done.

Now, fence an-node01 from an-node02.

fence_na -a fence_na02.alteeve.com -l admin -p sauce -n 1 -o reboot
Node Assassin: . [fence_na01.alteeve.com].
TCP Port: ...... [238].
Node: .......... [01].
Login: ......... [admin].
Password: ...... [sauce].
Action: ........ [reboot].
Version Request: [no].
Done reading args.
Processing: [01:1,sleep,01:0,sleep,01:off,01:2,sleep,01:on]
Fencing node 01:
 - Reset fenced.
 - Reset released.
 - Power fenced.
 - SUCCESS!
 - Node's front-panel switches locked.
Sleeping: 1, Done.
Releasing 01
 - Power released.
 - Reset released.
 - Status: SUCCESS!
Sleeping: 1, Done.
Booting node
 - Power button closed.
 - Power button opened.
Sleeping: 1, Done.

A note: When you use the quiet="true" option in Node Assassin's /etc/cluster/cluster.conf entry, all of the output above is suppressed.

The First cluster.conf File

Before we begin, let's discuss briefly a few things about the cluster we will build.

  • The totem communication will be over private and secure networks. This means that we can ignore encrypting our cluster communications which will improve performance and simplify our configuration.
  • To start, we will use the special two_node option. We will remove this when we add in the quorum disk in stage 2.
  • We will not implement redundant ring protocol just now, but will add it later.

With these decisions and the information gathered, here is what our first /etc/cluster/cluster.conf file will look like.

touch /etc/cluster/cluster.conf
vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="an-cluster" config_version="1">
	<cman two_node="1" expected_votes="1" />
	<totem secauth="off" rrp_mode="none" />
	<clusternodes>
		<clusternode name="an-node01.alteeve.com" nodeid="1">
			<fence>
				<method name="ipmi">
					<device name="fence_ipmi01" action="reboot" />
				</method>
				<method name="node_assassin">
					<device name="fence_na01" port="01" action="reboot" />
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node02.alteeve.com" nodeid="2">
			<fence>
				<method name="ipmi">
					<device name="fence_ipmi02" action="reboot" />
				</method>
				<method name="node_assassin">
					<device name="fence_na01" port="02" action="reboot" />
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="fence_ipmi01" agent="fence_ipmilan" ipaddr="192.168.3.51" login="admin" passwd="secret" />
		<fencedevice name="fence_ipmi02" agent="fence_ipmilan" ipaddr="192.168.3.52" login="admin" passwd="secret" />
		<fencedevice name="fence_na01"   agent="fence_na"      ipaddr="192.168.3.61" login="admin" passwd="sauce" quiet="true" />
	</fencedevices>
</cluster>

Save the file, then validate it using the xmllint program. If it validates, the contents will be printed followed by a success message. If it fails, address the errors and try again.

xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="an-cluster" config_version="1">
	<cman two_node="1" expected_votes="1"/>
	<totem secauth="off" rrp_mode="none"/>
	<clusternodes>
		<clusternode name="an-node01.alteeve.com" nodeid="1">
			<fence>
				<method name="ipmi">
					<device name="fence_ipmi01" action="reboot"/>
				</method>
				<method name="node_assassin">
					<device name="fence_na01" port="01" action="reboot"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node02.alteeve.com" nodeid="2">
			<fence>
				<method name="ipmi">
					<device name="fence_ipmi02" action="reboot"/>
				</method>
				<method name="node_assassin">
					<device name="fence_na01" port="02" action="reboot"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="fence_ipmi01" agent="fence_ipmilan" ipaddr="192.168.3.51" login="admin" passwd="secret"/>
		<fencedevice name="fence_ipmi02" agent="fence_ipmilan" ipaddr="192.168.3.52" login="admin" passwd="secret"/>
		<fencedevice name="fence_na01" agent="fence_na" ipaddr="192.168.3.61" login="admin" passwd="sauce" quiet="true"/>
	</fencedevices>
</cluster>
/etc/cluster/cluster.conf validates

DO NOT PROCEED UNTIL YOUR cluster.conf FILE VALIDATES!

Unless you have it perfect, your cluster will fail.

Once it validates, proceed.

Starting The Cluster For The First Time

By default, if you start one node only and you've enabled the <cman two_node="1" expected_votes="1"/> option as we have done, the lone server will effectively gain quorum. It will try to connect to the cluster, but there won't be a cluster to connect to, so it will fence the other node after a timeout period. This timeout is 6 seconds by default.

For now, we will leave the default as it is. If you're interested in changing it though, the argument you are looking for is post_join_delay.

This behaviour means that we'll want to start both nodes well within six seconds of one another, least the slower one get needlessly fenced.

Left off here

Note to help minimize dual-fences:

  • you could add FENCED_OPTS="-f 5" to /etc/sysconfig/cman on *one* node (ilo fence devices may need this)

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.