Abandoned - Two Node Fedora 13 Cluster: Difference between revisions

From Alteeve Wiki
Jump to navigation Jump to search
Line 491: Line 491:


<source lang="bash">
<source lang="bash">
yum install cman
yum -y install cman
</source>
</source>



Revision as of 22:10, 8 October 2010

 AN!Wiki :: How To :: Abandoned - Two Node Fedora 13 Cluster

Notice, Sep. 08, 2010 - Draft 1: This is the first draft of this HowTo. It has not been peer-reviewed yet, so please proceed with caution and on a test cluster. Any and all feedback is much appreciated.

Overview

This paper has one goal;

  1. How to assemble the simplest cluster possible, a 2-Node Cluster, which you can then expand on for your own needs.

With this completed, you can then jump into "step 2" papers that will show various uses of a two node cluster:

  1. How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time.
  2. How to create simple resources that can move between nodes. Examples will be a simple PostgreSQL database, DHCP, DNS and web servers.

Prerequisites

It is expected that you are already comfortable with the Linux command line, specifically bash, and that you are familiar with general administrative tasks in Red Hat based distributions, specifically Fedora. You will also need to be comfortable using editors like vim, nano or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, multicast, broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

Platform

This paper will implement the Red Hat Cluster Suite using the Fedora v13 distribution. This paper uses the x86_64 repositories, however, if you are on an i386 (32 bit) system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names.

You can either download the stock Fedora 13 DVD ISO (currently at version 5.4 which is used in this document), or you can try out the alpha AN!Cluster Install DVD. (4.3GB iso). If you use the latter, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant Fedora distro, please contact me and let me know what your trouble was.

Why Fedora 13?

Generally speaking, I much prefer to use a server-oriented Linux distribution like CentOS, Debian or similar. However, there have been many recent changes in the Linux-Clustering world that have made all of the currently available server-class distributions obsolete. With luck, once Red Hat Enterprise Linux and CentOS version 6 is released, this will change.

Until then, Fedora version 13 provides the most up to date binary releases of the new implementation of the clustering stack available. For this reason, FC13 is the best choice in clustering, if you want to be current. To mitigate some of the issues introduced by using a workstation distribution, many packages will be stripped out of the default install.

Focus

Clusters can serve to solve three problems; Reliability, Performance and Scalability.

This focus of the cluster described in this paper is primarily reliability. Second to this, scalability will be the priority leaving performance to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper.

Goal

At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" resources. That is, resources that exist on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources.

This paper should also serve to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper.

Begin

Let's begin!

Hardware

We will need two physical servers each with the following hardware:

  • One or more multi-core CPUs with Virtualization support.
  • Three network cards; At least one should be gigabit or faster.
  • One or more hard drives.
  • You will need some form of a fence device. This can be an IPMI-enabled server, a Node Assassin, a fenceable PDU or similar.

This paper uses the following hardware:

  • ASUS M4A78L-M
  • AMD Athlon II x2 250
  • 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
  • 1x Intel 82540 PCI NICs
  • 1x D-Link DGE-560T
  • Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would serve the purposes of creating this document. What you purchase shouldn't matter, so long as the minimum requirements are met.

Pre-Assembly

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to network boot your install media. In that case, if the wrong NIC is chosen for eth0, you will be presented with a list of MAC addresses to attempt setup with.

Before you assemble your servers, record their network cards' MAC addresses. I like to keep simple text files like these:

cat an-node01.mac
90:E6:BA:71:82:EA	eth0	# Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53	eth1	# D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4	eth2	# Intel Corporation 82540EM Gigabit Ethernet Controller
cat an-node02.mac
90:E6:BA:71:82:D8	eth0	# Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A	eth1	# D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78	eth2	# Intel Corporation 82540EM Gigabit Ethernet Controller

Feel free to record the information in any way that suits you the best.

OS Install

Start with a stock Fedora 13 install. This How-To uses Fedora 13 x86_64, however it should be fairly easy to adapt to other recent Fedora versions. This document is also attempting to be easily ported to CentOS/RHEL version 6 once it is released. This will not adapt well to CentOS/RHEL version 5 though... Much of the cluster stack has changed dramatically since it was initially released.

These are sample kickstart script used by this paper. Be sure to set your own password string and network settings.

Warning! These kickstart scripts will erase your hard drive! Adapt them, don't blindly use them.

Generic cluster node kickstart scripts.

AN!Cluster Install

If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.

  • Download the custom AN!Cluster v0.2.001 Install DVD. (4.5GiB iso). (Currently disabled - Reworking for F13)

Post OS Install

Once the OS is installed, we need to do a couple things;

  1. Disable selinux
  2. Setup networking.
  3. Change the default run-level.

Disable 'selinux'

To allow this paper to focus on clustering, we will disable selinux. Obviously, this introduces security issues that you may not be comfortable with. If you wish to leave selinux enabled, it will be up to you to sort out issues that crop up.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect.

Change the Default Run-Level

This is an optional step. It improves performance only, it is not a required step.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reduces the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

vim /etc/inittab
id:3:initdefault:
init 3

Setup Networking

We need to remove NetworkManager, enable network, configure the ifcfg-eth* files and then start the network daemon.

Managed Switch Warning

WARNING: Please pay attention to this warning. The vast majority of cluster problems end up being network related. The hardest ones to diagnose are usually multicast issues.

If you use a managed switch, be careful about enabling Multicast IGMP Snooping or Spanning Tree Protocol! They have been known to cause problems by not allowing multicast packets to reach all nodes. This can cause somewhat random break-downs in communication between your nodes, leading to fences and DLM lock timeouts.

Ensure the PIM Routing is setup and working.

If you have problems with your cluster not forming, or seemingly random fencing, try using an unmanaged switch or use a cross-over cable. If the problem goes away, you are most likely dealing with a managed switch configuration problem.

Network Layout

This setup expects you to have three physical network cards connected to three independent networks. To have a common vernacular, lets use this table to describe them:

Network Description Short Name Device Name Suggested Subnet NIC Properties
Back-Channel Network BCN eth0 192.168.1.0/24 NICs with IPMI piggy-back must be used here.
Second-fastest NIC should be used here.
Storage Network SN eth1 192.168.2.0/24 Fastest NIC should be used here.
Internet-Facing Network IFN eth2 192.168.3.0/24 Remaining NIC should be used here.
If using a PXE server, this should be a bootable NIC.

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:

  1. If your nodes have IPMI piggy-backing on a normal NIC, that NIC must be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a major security risk.
  2. The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
  3. If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
  4. The final NIC should be used for the IFN subnet.

Node IP Addresses

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

  Internet-Facing Network (IFN) Storage Network (SN) Back-Channel Network (BCN)
an-node01 192.168.1.71 192.168.2.71 192.168.3.71
an-node02 192.168.1.72 192.168.2.72 192.168.3.72

Remove NetworkManager

Some cluster software will not start with NetworkManager installed. This is because it is designed to be a highly-adaptive network system that can accommodate frequent changes in the network. To simplify these network transitions for end-users, a lot of reconfiguration of the network is done behind the scenes.

For workstations, this is wonderful. For clustering, this can be disastrous. Transient network outages are already a risk to a cluster's stability!

So first up, make sure that NetworkManager is completely removed from your system. If you used the kickstart scripts, then it was not installed. Otherwise, run:

yum remove NetworkManager NetworkManager-gnome NetworkManager-openconnect NetworkManager-openvpn NetworkManager-pptp NetworkManager-vpnc cnetworkmanager knetworkmanager knetworkmanager-openvpn knetworkmanager-pptp knetworkmanager-vpnc libproxy-networkmanager yum-NetworkManager-dispatcher

Setup 'network'

Before proceeding with network configuration, check to see if your network cards are aligned to the proper ethX network names. If they need to be adjusted, please follow this How-To before proceeding:

There are a few ways to configure network in Fedora:

  • system-config-network (graphical)
  • system-config-network-tui (ncurses)
  • Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files. (See: here for a full list of options)

Do not proceed until your node's networking is fully configured.

Update the Hosts File

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

Note: Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.

The updated /etc/hosts file should look something like this:

vim /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

# Internet Facing Network
192.168.1.71    an-node01 an-node01.alteeve.com an-node01.bcn
192.168.1.72    an-node02 an-node02.alteeve.com an-node02.bcn

# Storage Network
192.168.2.71    an-node01.sn
192.168.2.72    an-node02.sn

# Back Channel Network
192.168.3.71    an-node01.ifn
192.168.3.72    an-node02.ifn

# Node Assassins
192.168.3.61    batou batou.alteeve.com
192.168.3.62    motoko motoko.alteeve.com

Note: I use Node Assassins. If you use IPMI or other fence devices, alter the entries as appropriate.

Now to test this, ping both nodes by their name, as returned by uname -n, and make sure the ping packets are sent on the back channel network (192.168.1.0/24).

ping -c 5 an-node01.alteeve.com
PING an-node01 (192.168.1.71) 56(84) bytes of data.
64 bytes from an-node01 (192.168.1.71): icmp_seq=1 ttl=64 time=0.399 ms
64 bytes from an-node01 (192.168.1.71): icmp_seq=2 ttl=64 time=0.403 ms
64 bytes from an-node01 (192.168.1.71): icmp_seq=3 ttl=64 time=0.413 ms
64 bytes from an-node01 (192.168.1.71): icmp_seq=4 ttl=64 time=0.365 ms
64 bytes from an-node01 (192.168.1.71): icmp_seq=5 ttl=64 time=0.428 ms

--- an-node01 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4001ms
rtt min/avg/max/mdev = 0.365/0.401/0.428/0.030 ms
ping -c 5 an-node02.alteeve.com
PING an-node02 (192.168.1.72) 56(84) bytes of data.
64 bytes from an-node02 (192.168.1.72): icmp_seq=1 ttl=64 time=0.419 ms
64 bytes from an-node02 (192.168.1.72): icmp_seq=2 ttl=64 time=0.405 ms
64 bytes from an-node02 (192.168.1.72): icmp_seq=3 ttl=64 time=0.416 ms
64 bytes from an-node02 (192.168.1.72): icmp_seq=4 ttl=64 time=0.373 ms
64 bytes from an-node02 (192.168.1.72): icmp_seq=5 ttl=64 time=0.396 ms

--- an-node02 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4001ms
rtt min/avg/max/mdev = 0.373/0.401/0.419/0.030 ms

Be sure that, if your fence device uses a name, that you include entries to resolve it as well. You can see how I've done this with the two Node Assassin devices I use. The same applies to IPMI or other devices, if you plan to reference them by name.

Fencing will be discussed in more detail later on in this HowTo.

Disable Firewalls

Be sure to flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

There will be enough potential sources of problem as it is. Disabling firewalls at this stage will minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.

chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop

Setup SSH Shared Keys

This is an optional step. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. This is not meant to be a security-focused How-To, so please independently study the risks.

If you're a little new to SSH, it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the other node's root user's directory.

For each user, on each machine you want to connect from, run:

# The '2047' is just to screw with brute-forces a bit. :)
ssh-keygen -t rsa -N "" -b 2047 -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
08:d8:ed:72:38:61:c5:0e:cf:bf:dc:28:e5:3c:a7:88 root@an-node01.alteeve.com
The key's randomart image is:
+--[ RSA 2047]----+
|     ..          |
|   o.o.          |
|  . ==.          |
|   . =+.         |
|    + +.S        |
|     +  o        |
|       = +       |
|     ...B o      |
|    E ...+       |
+-----------------+

This will create two files: the private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private must never be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

Private key:

cat ~/.ssh/id_rsa
-----BEGIN RSA PRIVATE KEY-----
MIIEoQIBAAKCAQBlL42DC+NJVpJ0rdrWQ1rxGEbPrDoe8j8+RQx3QYiB014R7jY5
EaTenThxG/cudgbLluFxq6Merfl9Tq2It3k9Koq9nV9ZC/vXBcl4MC7pGSQaUw2h
DVwI7OCSWtnS+awR/1d93tANXRwy7K5ic1pcviJeN66dPuuPqJEF/SKE7yEBapMq
sN28G4IiLdimsV+UYXPQLOiMy5stmyGhFhQGH3kxYzJPOgiwZEFPZyXinGVoV+qa
9ERSjSKAL+g21zbYB/XFK9jLNSJqDIPa//wz0T+73agZ0zNlxygmXcJvapEsFGDG
O6tcy/3XlatSxjEZvvfdOnC310gJVp0bcyWDAgMBAAECggEAMZd0y91vr+n2Laln
r8ujLravPekzMyeXR3Wf/nLn7HkjibYubRnwrApyNz11kBfYjL+ODqAIemjZ9kgx
VOhXS1smVHhk2se8zk3PyFAVLblcsGo0K9LYYKd4CULtrzEe3FNBFje10FbqEytc
7HOMvheR0IuJ0Reda/M54K2H1Y6VemtMbT+aTcgxOSOgflkjCTAeeOajqP5r0TRg
1tY6/k46hLiBka9Oaj+QHHoWp+aQkb+ReHUBcUihnz3jcw2u8HYrQIO4+v4Ud2kr
C9QHPW907ykQTMAzhMvZ3DIOcqTzA0r857ps6FANTM87tqpse5h2KfdIjc0Ok/AY
eKgYAQKBgQDm/P0RygIJl6szVhOb5EsQU0sBUoMT3oZKmPcjHSsyVFPuEDoq1FG7
uZYMESkVVSYKvv5hTkRuVOqNE/EKtk5bwu4mM0S3qJo99cLREKB6zNdBp9z2ACDn
0XIIFIalXAPwYpoFYi1YfG8tFfSDvinLI6JLDT003N47qW1cC5rmgQKBgHAkbfX9
8u3LiT8JqCf1I+xoBTwH64grq/7HQ+PmwRqId+HyyDCm9Y/mkAW1hYQB+cL4y3OO
kGL60CZJ4eFiTYrSfmVa0lTbAlEfcORK/HXZkLRRW03iuwdAbZ7DIMzTvY2HgFlU
L1CfemtmzEC4E6t5/nA4Ytk9kPSlzbzxfXIDAoGAY/WtaqpZ0V7iRpgEal0UIt94
wPy9HrcYtGWX5Yk07VXS8F3zXh99s1hv148BkWrEyLe4i9F8CacTzbOIh1M3e7xS
pRNgtH3xKckV4rVoTVwh9xa2p3qMwuU/jMGdNygnyDpTXusKppVK417x7qU3nuIv
1HzJNPwz6+u5GLEo+oECgYAs++AEKj81dkzytXv3s1UasstOvlqTv/j5dZNdKyZQ
72cvgsUdBwxAEhu5vov1XRmERWrPSuPOYI/4m/B5CYbTZgZ/v8PZeBTg17zgRtgo
qgJq4qu+fXHKweR3KAzTPSivSiiJLMTiEWb5CD5sw6pYQdJ3z5aPUCwChzQVU8Wf
YwKBgQCvoYG7gwx/KGn5zm5tDpeWb3GBJdCeZDaj1ulcnHR0wcuBlxkw/TcIadZ3
kqIHlkjll5qk5EiNGNlnpHjEU9X67OKk211QDiNkg3KAIDMKBltE2AHe8DhFsV8a
Mc/t6vHYZ632hZ7b0WNuudB4GHJShOumXD+NfJgzxqKJyfGkpQ==
-----END RSA PRIVATE KEY-----

Public key:

cat ~/.ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAGUvjYML40lWknSt2tZDWvEYRs+sOh7yPz5FDHdBiIHTXhHuNjkRpN6dOHEb9y52BsuW4XGrox6t+X1OrYi3eT0qir2dX1kL+9cFyXgwLukZJBpTDaENXAjs4JJa2dL5rBH/V33e0A1dHDLsrmJzWly+Il43rp0+64+okQX9IoTvIQFqkyqw3bwbgiIt2KaxX5Rhc9As6IzLmy2bIaEWFAYfeTFjMk86CLBkQU9nJeKcZWhX6pr0RFKNIoAv6DbXNtgH9cUr2Ms1ImoMg9r//DPRP7vdqBnTM2XHKCZdwm9qkSwUYMY7q1zL/deVq1LGMRm+9906cLfXSAlWnRtzJYM= root@an-node01.alteeve.com

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From an-node01, type:

ssh root@an-node02
The authenticity of host 'an-node02 (192.168.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,192.168.1.72' (RSA) to the list of known hosts.
root@an-node02's password: 
Last login: Fri Oct  1 20:07:01 2010 from 192.168.1.102

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

cat ~/.ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAGUvjYML40lWknSt2tZDWvEYRs+sOh7yPz5FDHdBiIHTXhHuNjkRpN6dOHEb9y52BsuW4XGrox6t+X1OrYi3eT0qir2dX1kL+9cFyXgwLukZJBpTDaENXAjs4JJa2dL5rBH/V33e0A1dHDLsrmJzWly+Il43rp0+64+okQX9IoTvIQFqkyqw3bwbgiIt2KaxX5Rhc9As6IzLmy2bIaEWFAYfeTFjMk86CLBkQU9nJeKcZWhX6pr0RFKNIoAv6DbXNtgH9cUr2Ms1ImoMg9r//DPRP7vdqBnTM2XHKCZdwm9qkSwUYMY7q1zL/deVq1LGMRm+9906cLfXSAlWnRtzJYM= root@an-node01.alteeve.com

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

Initial Cluster Setup

Before we get into specifics, let's take a minute to talk about the major components used in our cluster.

Core Program Overviews

These are the core programs that may be new to you that we will use to build our cluster. Before we configure them, let's take a minute to understand their roles.

Cluster Manager

The cluster manager, cman, takes the configuration file /etc/cluster/cluster.conf configuration file and uses it to configure and start the various cluster elements. This is where nodes are defined, fence devices are configured and various cluster tolerances are set.

Corosync

Corosync is, essentially, the clustering kernel which provides the essential services for cluster-aware applications and other cluster services. At it's core is the totem protocol which provides cluster membership and ordered messaging. On top of this, it implements a number of core clustering services like closed process group messaging, quorum and confdb (an object database).

It's goal is to provide a substantially simpler and more flexible set of APIs to facilitate clustering in Linux. It manages which nodes are in the cluster, it triggers error messages when something fails, manages cluster locking and so on. Most other clustered applications rely on corosync to know when something has happened or to announce when a cluster-related action has taken place.

Please note that the corosync_overview man page is considered out of date at the time of this writing.

cpg; Closed Process Group

Note: Only install the following if you wish to review the man page.

yum install corosynclib-devel
man cpg_overview

The closed process group is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages amongst nodes in a consistent order. It adds PIDs and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group.

OpenAIS

Note: We will not use OpenAIS. It is included in here to explain it's role to folks coming from the Cluster stable 2 environment.

OpenAIS is now an extension to Corosync that adds an open-source implementation of the Service Availability (SA) Forum's 'Application Interface Specification' (AIS). It is an API and policy designed to be used by applications concerned with maintaining services during faults. AIS implements the 'Availability Management Framework' (AMF) which, in turn, provides for application fail over, cluster management (CLM), Checkpointing (CKPT), Eventing (EVT), Messaging (MSG), and Distributed Locking (DLOCK).

It does this by implementing pluggable shared libraries that get loaded into corosync at runtime. These libraries then can make use of corosync's internal API. Both corosync and openais services provide IPC interfaces so that application programmers can make use of their behaviour. In short; applications can use OpenAIS to be cluster-aware. It's libraries are used by some applications, including Pacemaker. In our application, we will only be using it's libraries indirectly.

Please note that the openais_overview man page is considered out of date at the time of this writing.

Note: Only install the following if you wish to review the man page.

yum install openais

Setup Core Programs

We now need to edit and create the configuration files for our cluster.

Setup cman

You may only need to configure this section.

First, install the cluster manager:

yum -y install cman

Setting Up /etc/cluster/cluster.conf

Once installed, we need to create and fill in the /etc/cluster/cluster.conf file. This is the core configuration file of our cluster and is an XML formatted file.

Notice

This is, in many ways, the "core" of your cluster. This section is short, but expect to spend some time getting the right setup for your environment. Please pay careful attention to all options and thoroughly test your configuration before going into production!

vim /etc/cluster/cluster.conf

The example in that article details many of the common options you will want to use. For a more complete list, please run man 5 cluster.conf for a more complete discussion on this file. Also, please be aware that some arguments can overlap between corosync and this file.

Here is a real-world example configuration used in the development two-node AN!Cluster. This will likely not work for you, but should act as a good base to adapt and expand on.

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="1">
    <cman two_node="1" expected_votes="1"/>
    <totem secauth="off" rrp_mode="active"/>
    <clusternodes>
        <clusternode name="an-node01.alteeve.com" nodeid="1">
            <fence>
                <method name="node_assassin">
                    <device name="batou" port="01" action="reboot"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="an-node02.alteeve.com" nodeid="2">
            <fence>
                <method name="node_assassin">
                    <device name="batou" port="02" action="reboot"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <fencedevices>
        <fencedevice name="batou" agent="fence_na" ipaddr="batou.alteeve.com" login="username" passwd="secret" quiet="1"/>
    </fencedevices>
</cluster>

Once your version of the cluster.conf is done, you need to verify it. Be sure to validate it against the cluster.rng validation file.

xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf

If there are errors, fix them. Once it is formatted properly, the contents of you cluster.conf file will be returned followed by "/etc/cluster/cluster.conf validates". Once this is the case, you can move on.

Note: Normally at this stage, you'd use chkconfig to enable cman to start at boot. However, depending on how you use your cluster, you may need to modify the boot order. For now, let's leave it off until we're got all the components enabled. This is particularly important if you will be using DRBD, CLVM or Xen.

chkconfig cman off

To see if it's working though, have a terminal open to both nodes and manually start cman. I suggest have two terminal windows open on each node. In one, start tailing /var/log/messages:

clear; tail -f -n 0 /var/log/messages

Now, with both log files being watched, start cman on both nodes.

Note: You MUST execute the following on both nodes withing the time limit set by post_join_delay (default is 6 seconds). If you wait longer than this, the first node started will fence the other node, assuming you have fencing configured properly. If fencing is not configured properly, your first node will hang.

/etc/init.d/cman start

Examine the log files to verify there were no errors. If there were, fix them. If there was not, fantastic!

Warning: This last step will forcefully reboot your nodes. Do not do this on a production cluster! If it must be done on a production server, stop all non-essential services and be sure you have a good backup, just in case.

Lastly, test that your fencing is configured and working properly. We do this by using a program called fence_node. With both nodes up and running cman, call a fence against the other node. Be sure to be watching the log files.

From an-node01.alteeve.com

fence_node an-node02.alteeve.com
fence an-node02.alteeve.com success

If it worked, when the node comes back up, repeat the process from an-node02.alteeve.com against an-node01.alteeve.com.

Setup Corosync

Note: In our cluster, we will not need to configure Corosync.

When using cman, as we are here, Corosync's configuration file is ignored.

Most of the settings available in corosync.conf can be found in the cluster.conf file. Please see the cluster.conf article for a comprehensive list of what is and is not supported.

If you want to configure corosync and not use cman, please see the following article:

Fencing

Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.

  • The Cluster Admin's Mantra:
    • The only thing you don't know is what you don't know.

Just because one node loses communication with another node, it cannot assume that the silent node is dead!

What is it?

"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a split-brain condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.

Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:

  • Power
    • Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
  • Blocking
    • Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.

Those familiar with heartbeat clusters will know "fencing" by the HA Linux term "STONITH", literally, Shoot The Other Node In The Head.

Misconception

It is a very common mistake to ignore fencing when first starting to learn about clustering. Often people think "It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster.".

Wrong!

For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, cman and related daemons will fail if they can't find a fence agent to use.

Secondly: testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.

Fence Devices

Many major OEMs have their own remote management devices that can serve as fence devices. Examples are Dell's 'DRAC' (Dell Remote Access Controller), HP's iLO (Integrate Lights Out), IBM's 'RSA' (Remote Supervisor Adapter), Sun's 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via IPMI, Intelligent Power Management Interface.

In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.

Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.

Implementation

How you implement fencing depends largely on what kind of fence device(s) you have. This is where I need help from you, dear reader. I only have access to Node Assassin fence devices on my lab's cluster and IPMI at work. I will show examples below using these devices. If you have other fence devices, like IPMI, addressable PDU or so on, please let me know so that I can expand this section.

Using Node Assassin To Demonstrate Fencing

In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced fence daemon.

When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:

<cluster name="an-cluster" config_version="1">
	<clusternodes>
		<clusternode name="an-node02.alteeve.com" nodeid="2">
			<fence>
				<method name="node_assassin">
					<device name="motoko" port="02" action="off"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="motoko" agent="fence_na" quiet="true"
		ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
		</fencedevice>
	</fencedevices>
</cluster>

If the cluster manager (corosync, specifically) determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name which is motoko in this case. It gathers the other variables and then looks in the <fencedevices> section for the device with the matching name. With these two sections, it now has all the variable-value pairs needed to pass to the fence agent script set in the agent variable.

So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments as a series of lines to the agent via STDIN.

ipaddr=motoko.alteeve.com
login=motoko
passwd=secret
quiet=true
port=2
action=off

How the fence agent acts on these arguments varies depending on the fence device itself. To continue using Node Assassin as an example, the fence_na fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on. Finally, it tells the device what action to take. Here, the action is off which, internally, translates as hitting the reset switch, then press and hold the power switch long enough to force a power down. As a last step, it checks to see if there is power still coming from the node to determine whether the fence succeeded or not. In other devices, the port could be a power jack, a network switch port and so on.

Once the device completes, it returns a success or failed message in the form of a prescribed exit code. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.

If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.

An Example IPMI Configuration

IPMI is perhaps the most common method used for fencing nodes in a cluster. It requires having a system board with an IPMI baseboard management controller (BMC). IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

Given the ubiquity of IPMI, I'd like to take the time to walk through configuring and implementing IPMI in a cluster.

IPMI is generally configured using tools provided by the manufacturer of your server. Often there is a web interface with a default IP address, user name and password. In many cases though, you can also configure your IPMI device from the command line using user-space tools. The next section will walk you through an example setup of an IPMI device. If you've already configure yours though, note the IP addresses, user names and passwords that you assigned and skip down one section.

Configuring IPMI From The Command Line

To start, we need to install the IPMI user software.

yum install ipmitool freeipmi.x86_64 freeipmi-bmc-watchdog.x86_64 freeipmi-ipmidetectd.x86_64 OpenIPMI.x86_64

Once installed, you should be able to check the local IPMI BMC using ipmitool once you're started the ipmi daemon.

/etc/init.d/ipmi start
Starting ipmi drivers:                                     [  OK  ]
ipmitool chassis status
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : false
Power Control Fault  : false
Power Restore Policy : always-off
Last Power Event     : command
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false
Front Panel Control  : none

If you see that, you're doing well. You can now check the current configuration using the following command.

ipmitool -I open lan print 1
Set in Progress         : Set Complete
Auth Type Support       : NONE MD2 MD5 OEM 
Auth Type Enable        : Callback : NONE MD2 MD5 OEM 
                        : User     : NONE MD2 MD5 OEM 
                        : Operator : NONE MD2 MD5 OEM 
                        : Admin    : NONE MD2 MD5 OEM 
                        : OEM      : 
IP Address Source       : Static Address
IP Address              : 10.255.128.1
Subnet Mask             : 255.255.0.0
MAC Address             : 00:e0:81:aa:bb:cc
SNMP Community String   : AMI
IP Header               : TTL=0x00 Flags=0x00 Precedence=0x00 TOS=0x00
BMC ARP Control         : ARP Responses Disabled, Gratuitous ARP Disabled
Gratituous ARP Intrvl   : 0.0 seconds
Default Gateway IP      : 192.168.1.1
Default Gateway MAC     : 00:00:00:00:00:00
Backup Gateway IP       : 0.0.0.0
Backup Gateway MAC      : 00:00:00:00:00:00
802.1q VLAN ID          : Disabled
802.1q VLAN Priority    : 0
RMCP+ Cipher Suites     : 1,2,3,6,7,8,11,12,0,0,0,0,0,0,0,0
Cipher Suite Priv Max   : aaaaXXaaaXXaaXX
                        :     X=Cipher Suite Unused
                        :     c=CALLBACK
                        :     u=USER
                        :     o=OPERATOR
                        :     a=ADMIN
                        :     O=OEM

You can change the MAC address, but this isn't advised without a good reason to do so without good reason. That said, here is an example set of commands that will configure, save and then check that the settings took. Adapt the values to suit your environment and preferences.

# Don't change the MAC without a good reason. If you need to though, this should work.
#ipmitool -I open lan set 1 macaddr 00:e0:81:aa:bb:cd

# Set the IP to be static (instead of DHCP)
ipmitool -I open lan set 1 ipsrc static

# Set the IP, default gateway and subnet mask address of the IPMI interface.
ipmitool -I open lan set 1 ipaddr 192.168.1.171
ipmitool -I open lan set 1 defgw ipaddr 192.168.1.1
ipmitool -I open lan set 1 netmask 255.255.255.0

# Set the password.
ipmitool -I open lan set 1 password secret
ipmitool -I open user set password 2 secret

# Set the snmp community string, if appropriate
ipmitool -I open lan set 1 snmp alteeve

# Enable access
ipmitool -I open lan set 1 access on

# Reset the IPMI BMC to make sure the changes took effect.
ipmitool mc reset cold

# Wait a few seconds and then re-run the call that dumped the setup to ensure
# it is now what we want.
sleep 5
ipmitool -I open lan print 1

If all went well, you should see the same output as above, but now with your new configuration.

Testing IPMI

If you skipped the previous step and just want the minimal setup that works, you only need to install one package.

yum install ipmitool

Regardless of how you configured your IPMI devices, you will want to test now that you can check the power state of your nodes using the IPMI interface.

You can only query the status of the remote node(s) this way. You can't use the following command to check your local node.

In the example below, we will check the state of an-node01 from an-node02. Note that here we use the IP address directly, but in practice I like to use a name that resolves to the IP address of the IPMI interface (denoted by a .ipmi suffix).

ipmitool -I lan -H 192.168.1.171 -U admin -P secret chassis power status
Chassis Power is on

You could replace status with on, off and cycle to remotely boot, power off and reboot (power cycle) the node. This is, in fact, what the fence_ipmilan fence agent does. If you can afford to stop your servers, it's a good idea to play with the various power options to see how the work. There are a few more option than what I mentioned here, which you can read about in man ipmitool.

Configuring /etc/cluster.conf

Lastly, we need to add the IPMI configuration to our cluster.conf file. Note that, unlike with the Node Assassin setup, there is no port value in device tag. This is because there is only device controlled by the IPMI BMC; The node itself. The other change is that there are now two <fencedevice ... /> entries; One for each node's IPMI device. Finally, pay attention to the <device name="..." field. This is an example of where it matters.

Here is an example...

<cluster name="an-cluster" config_version="1">
    <clusternodes>
        <clusternode name="an-node01.alteeve.com" nodeid="1" votes="1">
            <fence>
                <method name="ipmi">
                    <device name="fence_an01" action="reboot" />
                </method>
            </fence>
        </clusternode>
        <clusternode name="an-node02.alteeve.com" nodeid="2" votes="1">
            <fence>
                <method name="ipmi">
                    <device name="fence_an02" action="reboot" />
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <fencedevices>
        <fencedevice name="fence_an01" agent="fence_ipmilan" ipaddr="192.168.3.61" login="admin" passwd="secret" />
        <fencedevice name="fence_an02" agent="fence_ipmilan" ipaddr="192.168.3.62" login="admin" passwd="secret" />
    </fencedevices>
</cluster>

Node Assassin

A cheap alternative is the Node Assassin, an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.

Full Disclosure: Node Assassin was created by me, with much help from others, for this paper.

Clean Up

Some daemons may have been installed or dragged in during the setup of the cluster. Most notable is heartbeat which is not compatible with RHCS. The following command will disable various daemons from starting at boot.

As for the rest of the daemons below, it is safe to run even when the daemons aren't installed or have already been removed. Of course, skip the ones you actually want to use. This assumes that the nodes have been built according to this HowTo.

chkconfig heartbeat off; chkconfig iscsid off; chkconfig iptables off; chkconfig ip6tables off

So You Have a Cluster - Now What?

Building the cluster infrastructure was pretty easy, wasn't it?

But what will you do with it?

Choose Your Own Adventure Cluster
  Xen-Based Virtual Machine Host on DRBD+CLVM
High Availability VM Host
 

Thanks

  • A huge thanks to Interlink Connectivity! They hire me as a contractor and have allowed me to extend these docs while working on their clusters. Development of these How-Tos would be much slower if not for them. If you need hosting or colo services, drop them a line. Their website is a bit out of date though, so disregard the prices. :)
  • To sdake of corosync for helping me sort out the plock component and corosync in general.
  • To Angus Salkeld for helping me nail down the Corosync and OpenAIS differences.
  • To HJ Lee from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
  • To Steven Dake for clarifying the to_x vs. logoutput: x arguments in openais.conf.
  • To Fabio Massimo Di Nitto for helping me get caught up with clustering and VMs.

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.