Abandoned - Two Node Fedora 14 Cluster

From Alteeve Wiki
Jump to navigation Jump to search

 AN!Wiki :: How To :: Abandoned - Two Node Fedora 14 Cluster

Notice, Oct. 25, 2010 - Draft 3: I've decided to merge the original cluster paper with the Xen configuration paper. The reason for this is that I expect my next couple of uses of clustering will differ enough to warrant a dedicated paper, rather than branching off of this one.

Overview

This paper has one goal;

  • Creating a 2-node, high-availability cluster hosting Xen virtual machines.

We will introduce and use the following technologies:

  • RHCS, Red Hat Cluster Services version 3, aka "Cluster 3", running on Fedora 14 x86_64.
    • RHCS implements:
      • cman; The cluster manager.
      • corosync; The cluster engine, implementing the totem protocol and cpg and other core cluster services.
      • rgmanager; The resource group manager handles restoring and failing over services in the cluster, including our Xen VMs.
  • Fencing devices needed to keep a cluster safe.
    • Two fencing types are discussed;
      • IPMI; The most common fence method used in servers.
      • Node Assassin; A home-brew fence device ideal for learning or as a backup to IPMI.
  • Xen; The virtual server hypervisor.
    • Converting the host OS into the special access dom0 virtual machine.
    • Provisioning domU VMs.
    • Making the VMs highly available.

Prerequisites

It is expected that you are already comfortable with the Linux command line, specifically bash, and that you are familiar with general administrative tasks in Red Hat based distributions, specifically Fedora. You will also need to be comfortable using editors like vim, nano or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, multicast, broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

Platform

This paper will implement the Red Hat Cluster Suite, known as RHCS, using the Fedora 14 x86_64 distribution. If you are on an i386 (32 bit) system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names.

You can either download the stock Fedora 14 DVD ISO, or you can try out the alpha AN!Cluster Install DVD. (4.3GB iso). If you use the latter, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant Fedora distro, please contact me and let me know what your trouble was.

Why Fedora 14?

Generally speaking, I much prefer to use a server-oriented Linux distribution like CentOS, Debian or similar. However, there have been many recent changes in the Linux-Clustering world that have made all of the currently available server-class distributions obsolete. With luck, once Red Hat Enterprise Linux and CentOS version 6 is released, this will change.

Until then, Fedora version 14 provides the most up to date binary releases of the new implementation of the clustering stack available. For this reason, F14 is the best choice in clustering, if you want to be current. To mitigate some of the issues introduced by using a workstation distribution, many packages will be stripped out of the default install.

Focus and Goal

Clusters can serve to solve three problems; Reliability, Performance and Scalability.

This paper will build a cluster designed to be more reliable, also known as a High-Availability cluster or simply HA Cluster.

At the end of this paper, you should have a fully functioning two-node cluster capable of hosting a "floating" virtual servers. That is, VMs that exist on one node that can be easily moved to the other node with minimal or no down time.

Begin

Let's begin!

Hardware

We will need two physical servers each with the following hardware:

  • One or more multi-core CPUs with Virtualization support.
  • Three network cards; At least one should be gigabit or faster.
  • One or more hard drives.
  • You will need some form of a fence device. This can be an IPMI-enabled server, a Node Assassin, a fenceable PDU or similar.

This paper uses the following hardware:

  • ASUS M4A78L-M
  • AMD Athlon II x2 250
  • 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
  • 1x Intel 82540 PCI NICs
  • 1x D-Link DGE-560T
  • Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would serve the purposes of creating this document. What you purchase shouldn't matter, so long as the minimum requirements are met.

Pre-Assembly

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to network boot your install media. In that case, if the wrong NIC is chosen for eth0, you will be presented with a list of MAC addresses to attempt setup with.

Before you assemble your servers, record their network cards' MAC addresses. I like to keep simple text files like these:

cat an-node01.mac
90:E6:BA:71:82:EA	eth0	# Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53	eth1	# D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4	eth2	# Intel Corporation 82540EM Gigabit Ethernet Controller
cat an-node02.mac
90:E6:BA:71:82:D8	eth0	# Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A	eth1	# D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78	eth2	# Intel Corporation 82540EM Gigabit Ethernet Controller

Feel free to record the information in any way that suits you the best.

OS Install

Start with a stock Fedora 14 install. This How-To uses Fedora 14 x86_64, however it should be fairly easy to adapt to other recent Fedora versions. This document is also attempting to be easily ported to CentOS/RHEL version 6 once it is released. No effort is make to support CentOS or RHEL version 5 though as they use cluster stable 2 and much has changed between the two versions.

Kickstart Files

This is a sample kickstart script I use to create per-node kickstart scripts. To adapt them, the main things to change are the root user's password hash, the name and password hash of the first non-root user and the hostname set in the eth0 device name. The adapted kickstart scripts can be rolled into a Fedora 14 DVD ISO or used on a PXE server.

Warning! This kickstart script will erase your hard drive! Adapt it, don't blindly use it!

Generic cluster node kickstart script.

AN!Cluster Install

If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.

  • Download the custom AN!Cluster v0.2.002 Install DVD. (4.5GiB iso). (Currently disabled - Reworking for F14)

Post OS Install

There are a handful of changes we will want to make now that the install is complete. Some of these are optional and you may skip them if you prefer. However, the remainder of this paper assumes these changes have been made.

Disable selinux

To allow this paper to focus on clustering, we will disable selinux. Obviously, this introduces security issues that you may not be comfortable with. If you wish to leave selinux enabled, it will be up to you to sort out issues that crop up.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect, but don't do it yet as some changes to come may also need a reboot.

Change the Default Run-Level

This is an optional step. It improves performance only, it is not a required step.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reduces the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

vim /etc/inittab
id:3:initdefault:
init 3

Make Boot Messages Visible

Another optional step, in-line with the change above, is to disable the rhgb (Red Hat Graphical Boot) and quiet kernel arguments. These options provide the clean boot screen you normally see with Fedora, but they also hide a lot of boot messages that we may find helpful.

To make this change, edit the grub menu and remove the rhgb quiet arguments from the kernel /vmlinuz... line. These arguments are usually the last ones on the line. If you leave this until later you may see two or more kernel entries. Delete these arguments where ever they are found.

vim /boot/grub/menu.lst

Change:

title Fedora (2.6.34.7-56.fc14.x86_64)
	root (hd0,0)
	kernel /vmlinuz-2.6.34.7-56.fc14.x86_64 ro root=UUID=<uuid> rd_MD_UUID=<uuid> rd_MD_UUID=<uuid> <options> rhgb quiet
	initrd /initramfs-2.6.34.7-56.fc14.x86_64.img

To:

title Fedora (2.6.34.7-56.fc14.x86_64)
	root (hd0,0)
	kernel /vmlinuz-2.6.34.7-56.fc14.x86_64 ro root=UUID=<uuid> rd_MD_UUID=<uuid> rd_MD_UUID=<uuid> <options>
	initrd /initramfs-2.6.34.7-56.fc14.x86_64.img

Setup Networking

There are several things to do with regard to networking before we can proceed with our cluster setup. The next few sections will walk through them, one at a time.

Warning About Managed Switches

WARNING: Please pay attention to this warning. The vast majority of cluster problems end up being network related. The hardest ones to diagnose are usually multicast issues.

If you use a managed switch, be careful about enabling Multicast IGMP Snooping or Spanning Tree Protocol. They have been known to cause problems by not allowing multicast packets to reach all nodes. This can cause somewhat random break-downs in communication between your nodes, leading to fences and DLM lock timeouts.

If your switches support PIM Routing, be sure to use it.

If you have problems with your cluster not forming, or seemingly random fencing, try using an unmanaged switch or use a cross-over cable. If the problem goes away, you are most likely dealing with a managed switch configuration problem.

Network Layout

This setup expects you to have three physical network cards connected to three independent networks. If you don't have the ability of using three physical network interfaces, you can replicate this setup logically using VLANs.

To have a common vernacular, lets use this table to describe our networks.

Network Description Short Name Device Name Suggested Subnet NIC Properties
Back-Channel Network BCN eth0 192.168.1.0/24 NICs with IPMI piggy-back must be used here.
Second-fastest NIC should be used here.
If using a PXE server, this should be a bootable NIC.
Storage Network SN eth1 192.168.2.0/24 Fastest NIC should be used here.
Internet-Facing Network IFN eth2 192.168.3.0/24 Remaining NIC should be used here.

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:

  1. If your nodes have IPMI piggy-backing on a normal NIC, that NIC must be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a major security risk.
  2. The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
  3. If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
  4. The final NIC should be used for the IFN subnet.

Node IP Addresses

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

  Internet-Facing Network (IFN) Storage Network (SN) Back-Channel Network (BCN)
an-node01 192.168.1.71 192.168.2.71 192.168.3.71
an-node02 192.168.1.72 192.168.2.72 192.168.3.72

Remove NetworkManager

Some cluster software will not start with NetworkManager installed. This is because it is designed to be a highly-adaptive network system that can accommodate frequent changes in the network. To simplify these network transitions for end-users, a lot of reconfiguration of the network is done behind the scenes.

For workstations, this is wonderful. For clustering, this can be disastrous. Transient network outages are already a risk to a cluster's stability!

So first up, make sure that NetworkManager is completely removed from your system. If you used the kickstart scripts, then it was not installed. Otherwise, run:

yum remove NetworkManager NetworkManager-gnome NetworkManager-openconnect NetworkManager-openvpn NetworkManager-pptp NetworkManager-vpnc cnetworkmanager knetworkmanager knetworkmanager-openvpn knetworkmanager-pptp knetworkmanager-vpnc libproxy-networkmanager yum-NetworkManager-dispatcher

Setup 'network'

Before proceeding with network configuration, check to see if your network cards are aligned to the proper ethX network names. If they need to be adjusted, please follow this How-To before proceeding:

There are a few ways to configure network in Fedora:

  • system-config-network (graphical)
  • system-config-network-tui (ncurses)
  • Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files. (See: here for a full list of options)

Do not proceed until your node's networking is fully configured.

Update the Hosts File

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

Note: Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.

The updated /etc/hosts file should look something like this:

vim /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

# Internet Facing Network
192.168.1.71    an-node01 an-node01.alteeve.com an-node01.bcn
192.168.1.72    an-node02 an-node02.alteeve.com an-node02.bcn

# Storage Network
192.168.2.71    an-node01.sn
192.168.2.72    an-node02.sn

# Back Channel Network
192.168.3.71    an-node01.ifn
192.168.3.72    an-node02.ifn

# Node Assassins
192.168.3.61    batou batou.alteeve.com
192.168.3.62    motoko motoko.alteeve.com

Note: I use Node Assassins. If you use IPMI or other fence devices, alter the entries as appropriate.

Now to test this, ping both nodes by their name, as returned by uname -n, and make sure the ping packets are sent on the back channel network (192.168.1.0/24).

ping -c 5 an-node01.alteeve.com
PING an-node01 (192.168.1.71) 56(84) bytes of data.
64 bytes from an-node01 (192.168.1.71): icmp_seq=1 ttl=64 time=0.399 ms
64 bytes from an-node01 (192.168.1.71): icmp_seq=2 ttl=64 time=0.403 ms
64 bytes from an-node01 (192.168.1.71): icmp_seq=3 ttl=64 time=0.413 ms
64 bytes from an-node01 (192.168.1.71): icmp_seq=4 ttl=64 time=0.365 ms
64 bytes from an-node01 (192.168.1.71): icmp_seq=5 ttl=64 time=0.428 ms

--- an-node01 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4001ms
rtt min/avg/max/mdev = 0.365/0.401/0.428/0.030 ms
ping -c 5 an-node02.alteeve.com
PING an-node02 (192.168.1.72) 56(84) bytes of data.
64 bytes from an-node02 (192.168.1.72): icmp_seq=1 ttl=64 time=0.419 ms
64 bytes from an-node02 (192.168.1.72): icmp_seq=2 ttl=64 time=0.405 ms
64 bytes from an-node02 (192.168.1.72): icmp_seq=3 ttl=64 time=0.416 ms
64 bytes from an-node02 (192.168.1.72): icmp_seq=4 ttl=64 time=0.373 ms
64 bytes from an-node02 (192.168.1.72): icmp_seq=5 ttl=64 time=0.396 ms

--- an-node02 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4001ms
rtt min/avg/max/mdev = 0.373/0.401/0.419/0.030 ms

Be sure that, if your fence device uses a name, that you include entries to resolve it as well. You can see how I've done this with the two Node Assassin devices I use. The same applies to IPMI or other devices, if you plan to reference them by name.

Fencing will be discussed in more detail later on in this HowTo.

Disable Firewalls

Be sure to flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

There will be enough potential sources of problem as it is. Disabling firewalls at this stage will minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.

chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop

Setup SSH Shared Keys

This is an optional step. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. This is not meant to be a security-focused How-To, so please independently study the risks.

If you're a little new to SSH, it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the other node's root user's directory.

For each user, on each machine you want to connect from, run:

# The '2047' is just to screw with brute-forces a bit. :)
ssh-keygen -t rsa -N "" -b 2047 -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
08:d8:ed:72:38:61:c5:0e:cf:bf:dc:28:e5:3c:a7:88 root@an-node01.alteeve.com
The key's randomart image is:
+--[ RSA 2047]----+
|     ..          |
|   o.o.          |
|  . ==.          |
|   . =+.         |
|    + +.S        |
|     +  o        |
|       = +       |
|     ...B o      |
|    E ...+       |
+-----------------+

This will create two files: the private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private must never be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

Private key:

cat ~/.ssh/id_rsa
-----BEGIN RSA PRIVATE KEY-----
MIIEoQIBAAKCAQBlL42DC+NJVpJ0rdrWQ1rxGEbPrDoe8j8+RQx3QYiB014R7jY5
EaTenThxG/cudgbLluFxq6Merfl9Tq2It3k9Koq9nV9ZC/vXBcl4MC7pGSQaUw2h
DVwI7OCSWtnS+awR/1d93tANXRwy7K5ic1pcviJeN66dPuuPqJEF/SKE7yEBapMq
sN28G4IiLdimsV+UYXPQLOiMy5stmyGhFhQGH3kxYzJPOgiwZEFPZyXinGVoV+qa
9ERSjSKAL+g21zbYB/XFK9jLNSJqDIPa//wz0T+73agZ0zNlxygmXcJvapEsFGDG
O6tcy/3XlatSxjEZvvfdOnC310gJVp0bcyWDAgMBAAECggEAMZd0y91vr+n2Laln
r8ujLravPekzMyeXR3Wf/nLn7HkjibYubRnwrApyNz11kBfYjL+ODqAIemjZ9kgx
VOhXS1smVHhk2se8zk3PyFAVLblcsGo0K9LYYKd4CULtrzEe3FNBFje10FbqEytc
7HOMvheR0IuJ0Reda/M54K2H1Y6VemtMbT+aTcgxOSOgflkjCTAeeOajqP5r0TRg
1tY6/k46hLiBka9Oaj+QHHoWp+aQkb+ReHUBcUihnz3jcw2u8HYrQIO4+v4Ud2kr
C9QHPW907ykQTMAzhMvZ3DIOcqTzA0r857ps6FANTM87tqpse5h2KfdIjc0Ok/AY
eKgYAQKBgQDm/P0RygIJl6szVhOb5EsQU0sBUoMT3oZKmPcjHSsyVFPuEDoq1FG7
uZYMESkVVSYKvv5hTkRuVOqNE/EKtk5bwu4mM0S3qJo99cLREKB6zNdBp9z2ACDn
0XIIFIalXAPwYpoFYi1YfG8tFfSDvinLI6JLDT003N47qW1cC5rmgQKBgHAkbfX9
8u3LiT8JqCf1I+xoBTwH64grq/7HQ+PmwRqId+HyyDCm9Y/mkAW1hYQB+cL4y3OO
kGL60CZJ4eFiTYrSfmVa0lTbAlEfcORK/HXZkLRRW03iuwdAbZ7DIMzTvY2HgFlU
L1CfemtmzEC4E6t5/nA4Ytk9kPSlzbzxfXIDAoGAY/WtaqpZ0V7iRpgEal0UIt94
wPy9HrcYtGWX5Yk07VXS8F3zXh99s1hv148BkWrEyLe4i9F8CacTzbOIh1M3e7xS
pRNgtH3xKckV4rVoTVwh9xa2p3qMwuU/jMGdNygnyDpTXusKppVK417x7qU3nuIv
1HzJNPwz6+u5GLEo+oECgYAs++AEKj81dkzytXv3s1UasstOvlqTv/j5dZNdKyZQ
72cvgsUdBwxAEhu5vov1XRmERWrPSuPOYI/4m/B5CYbTZgZ/v8PZeBTg17zgRtgo
qgJq4qu+fXHKweR3KAzTPSivSiiJLMTiEWb5CD5sw6pYQdJ3z5aPUCwChzQVU8Wf
YwKBgQCvoYG7gwx/KGn5zm5tDpeWb3GBJdCeZDaj1ulcnHR0wcuBlxkw/TcIadZ3
kqIHlkjll5qk5EiNGNlnpHjEU9X67OKk211QDiNkg3KAIDMKBltE2AHe8DhFsV8a
Mc/t6vHYZ632hZ7b0WNuudB4GHJShOumXD+NfJgzxqKJyfGkpQ==
-----END RSA PRIVATE KEY-----

Public key:

cat ~/.ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAGUvjYML40lWknSt2tZDWvEYRs+sOh7yPz5FDHdBiIHTXhHuNjkRpN6dOHEb9y52BsuW4XGrox6t+X1OrYi3eT0qir2dX1kL+9cFyXgwLukZJBpTDaENXAjs4JJa2dL5rBH/V33e0A1dHDLsrmJzWly+Il43rp0+64+okQX9IoTvIQFqkyqw3bwbgiIt2KaxX5Rhc9As6IzLmy2bIaEWFAYfeTFjMk86CLBkQU9nJeKcZWhX6pr0RFKNIoAv6DbXNtgH9cUr2Ms1ImoMg9r//DPRP7vdqBnTM2XHKCZdwm9qkSwUYMY7q1zL/deVq1LGMRm+9906cLfXSAlWnRtzJYM= root@an-node01.alteeve.com

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From an-node01, type:

ssh root@an-node02
The authenticity of host 'an-node02 (192.168.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,192.168.1.72' (RSA) to the list of known hosts.
root@an-node02's password: 
Last login: Fri Oct  1 20:07:01 2010 from 192.168.1.102

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

cat ~/.ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAGUvjYML40lWknSt2tZDWvEYRs+sOh7yPz5FDHdBiIHTXhHuNjkRpN6dOHEb9y52BsuW4XGrox6t+X1OrYi3eT0qir2dX1kL+9cFyXgwLukZJBpTDaENXAjs4JJa2dL5rBH/V33e0A1dHDLsrmJzWly+Il43rp0+64+okQX9IoTvIQFqkyqw3bwbgiIt2KaxX5Rhc9As6IzLmy2bIaEWFAYfeTFjMk86CLBkQU9nJeKcZWhX6pr0RFKNIoAv6DbXNtgH9cUr2Ms1ImoMg9r//DPRP7vdqBnTM2XHKCZdwm9qkSwUYMY7q1zL/deVq1LGMRm+9906cLfXSAlWnRtzJYM= root@an-node01.alteeve.com

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

Initial Cluster Setup

Before we get into specifics, let's take a minute to talk about the major components used in our cluster.

Core Program Overviews

These are the core programs that may be new to you that we will use to build our cluster. Before we configure them, let's take a minute to understand their roles.

Cluster Manager

The cluster manager, cman, takes the configuration file /etc/cluster/cluster.conf configuration file and uses it to configure and start the various cluster elements. This is where nodes are defined, fence devices are configured and various cluster tolerances are set.

Corosync

Corosync is, essentially, the clustering kernel which provides the essential services for cluster-aware applications and other cluster services. At it's core is the totem protocol which provides cluster membership and ordered messaging. On top of this, it implements a number of core clustering services like closed process group messaging, quorum and confdb (an object database).

It's goal is to provide a substantially simpler and more flexible set of APIs to facilitate clustering in Linux. It manages which nodes are in the cluster, it triggers error messages when something fails, manages cluster locking and so on. Most other clustered applications rely on corosync to know when something has happened or to announce when a cluster-related action has taken place.

Please note that the corosync_overview man page is considered out of date at the time of this writing.

cpg; Closed Process Group

Note: Only install the following if you wish to review the man page.

yum install corosynclib-devel
man cpg_overview

The closed process group is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages amongst nodes in a consistent order. It adds PIDs and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group.

OpenAIS

Note: We will not use OpenAIS. It is included in here to explain it's role to folks coming from the Cluster stable 2 environment.

OpenAIS is now an extension to Corosync that adds an open-source implementation of the Service Availability (SA) Forum's 'Application Interface Specification' (AIS). It is an API and policy designed to be used by applications concerned with maintaining services during faults. AIS implements the 'Availability Management Framework' (AMF) which, in turn, provides for application fail over, cluster management (CLM), Checkpointing (CKPT), Eventing (EVT), Messaging (MSG), and Distributed Locking (DLOCK).

It does this by implementing pluggable shared libraries that get loaded into corosync at runtime. These libraries then can make use of corosync's internal API. Both corosync and openais services provide IPC interfaces so that application programmers can make use of their behaviour. In short; applications can use OpenAIS to be cluster-aware. It's libraries are used by some applications, including Pacemaker. In our application, we will only be using it's libraries indirectly.

Setup Core Programs

We now need to edit and create the configuration files for our cluster.

Setup cman

You may only need to configure this section.

First, install the cluster manager:

yum -y install cman

Setting Up /etc/cluster/cluster.conf

Once installed, we need to create and fill in the /etc/cluster/cluster.conf file. This is the core configuration file of our cluster and is an XML formatted file.

Notice

This is, in many ways, the "core" of your cluster. This section is short, but expect to spend some time getting the right setup for your environment. Please pay careful attention to all options and thoroughly test your configuration before going into production!

DO NOT SKIP THIS SECTION!

Sorry for yelling, but I've heard of too many people missing this. The link below is very, very important to building your cluster. It is it's own article only because of it's size. To include it directly would close to double the size of this tutorial, and most of the options are safe to leave undefined as the defaults are sane. Do read through them though, as the notes in the comments will help you understand how this file, and in turn, how your cluster works.

vim /etc/cluster/cluster.conf

The example in that article details many of the common options you will want to use. For a more complete list, please run man 5 cluster.conf for a more complete discussion on this file. Also, please be aware that some arguments can overlap between corosync and this file.

Here is a real-world example configuration used in the development two-node AN!Cluster. This will likely not work for you, but should act as a good base to adapt and expand on.

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="1">
    <cman two_node="1" expected_votes="1"/>
    <totem secauth="off" rrp_mode="active"/>
    <clusternodes>
        <clusternode name="an-node01.alteeve.com" nodeid="1">
            <fence>
                <method name="node_assassin">
                    <device name="batou" port="01" action="reboot"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="an-node02.alteeve.com" nodeid="2">
            <fence>
                <method name="node_assassin">
                    <device name="batou" port="02" action="reboot"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <fencedevices>
        <fencedevice name="batou" agent="fence_na" ipaddr="batou.alteeve.com" login="username" passwd="secret" quiet="1"/>
    </fencedevices>
</cluster>

Once your version of the cluster.conf is done, you need to verify it. Be sure to validate it against the cluster.rng validation file.

xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf

If there are errors, fix them. Once it is formatted properly, the contents of you cluster.conf file will be returned followed by "/etc/cluster/cluster.conf validates". Once this is the case, you can move on.

Note: Normally at this stage, you'd use chkconfig to enable cman to start at boot. However, depending on how you use your cluster, you may need to modify the boot order. For now, let's leave it off until we're got all the components enabled. This is particularly important if you will be using DRBD, CLVM or Xen.

chkconfig cman off

To see if it's working though, have a terminal open to both nodes and manually start cman. I suggest have two terminal windows open on each node. In one, start tailing /var/log/messages:

clear; tail -f -n 0 /var/log/messages

Now, with both log files being watched, start cman on both nodes.

Note: You MUST execute the following on both nodes withing the time limit set by post_join_delay (default is 6 seconds). If you wait longer than this, the first node started will fence the other node, assuming you have fencing configured properly. If fencing is not configured properly, your first node will hang.

/etc/init.d/cman start

Examine the log files to verify there were no errors. If there were, fix them. If there was not, fantastic!

Warning: This last step will forcefully reboot your nodes. Do not do this on a production cluster! If it must be done on a production server, stop all non-essential services and be sure you have a good backup, just in case.

Lastly, test that your fencing is configured and working properly. We do this by using a program called fence_node. With both nodes up and running cman, call a fence against the other node. Be sure to be watching the log files.

From an-node01.alteeve.com

fence_node an-node02.alteeve.com
fence an-node02.alteeve.com success

If it worked, when the node comes back up, repeat the process from an-node02.alteeve.com against an-node01.alteeve.com.

Setup Corosync

Note: In our cluster, we will not need to configure Corosync.

When using cman, as we are here, Corosync's configuration file is ignored.

Most of the settings available in corosync.conf can be found in the cluster.conf file. Please see the cluster.conf article for a comprehensive list of what is and is not supported.

If you want to configure corosync and not use cman, please see the following article:

Fencing

Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.

  • The Cluster Admin's Mantra:
    • The only thing you don't know is what you don't know.

Just because one node loses communication with another node, it cannot assume that the silent node is dead!

What is it?

"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a split-brain condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.

Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:

  • Power
    • Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
  • Blocking
    • Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.

Those familiar with heartbeat clusters will know "fencing" by the HA Linux term "STONITH", literally, Shoot The Other Node In The Head.

Misconception

It is a very common mistake to ignore fencing when first starting to learn about clustering. Often people think "It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster.".

Wrong!

For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, cman and related daemons will fail if they can't find a fence agent to use.

Secondly: testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.

Fence Devices

Many major OEMs have their own remote management devices that can serve as fence devices. Examples are Dell's 'DRAC' (Dell Remote Access Controller), HP's iLO (Integrate Lights Out), IBM's 'RSA' (Remote Supervisor Adapter), Sun's 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via IPMI, Intelligent Power Management Interface.

In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.

Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.

Implementation

How you implement fencing depends largely on what kind of fence device(s) you have. This is where I need help from you, dear reader. I only have access to Node Assassin fence devices on my lab's cluster and IPMI at work. I will show examples below using these devices. If you have other fence devices, like IPMI, addressable PDU or so on, please let me know so that I can expand this section.

Using Node Assassin To Demonstrate Fencing

In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced fence daemon.

When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:

<cluster name="an-cluster" config_version="1">
	<clusternodes>
		<clusternode name="an-node02.alteeve.com" nodeid="2">
			<fence>
				<method name="node_assassin">
					<device name="motoko" port="02" action="off"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="motoko" agent="fence_na" quiet="true"
		ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
		</fencedevice>
	</fencedevices>
</cluster>

If the cluster manager (corosync, specifically) determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name which is motoko in this case. It gathers the other variables and then looks in the <fencedevices> section for the device with the matching name. With these two sections, it now has all the variable-value pairs needed to pass to the fence agent script set in the agent variable.

So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments as a series of lines to the agent via STDIN.

ipaddr=motoko.alteeve.com
login=motoko
passwd=secret
quiet=true
port=2
action=off

How the fence agent acts on these arguments varies depending on the fence device itself. To continue using Node Assassin as an example, the fence_na fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on. Finally, it tells the device what action to take. Here, the action is off which, internally, translates as hitting the reset switch, then press and hold the power switch long enough to force a power down. As a last step, it checks to see if there is power still coming from the node to determine whether the fence succeeded or not. In other devices, the port could be a power jack, a network switch port and so on.

Once the device completes, it returns a success or failed message in the form of a prescribed exit code. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.

If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.

An Example IPMI Configuration

IPMI is perhaps the most common method used for fencing nodes in a cluster. It requires having a system board with an IPMI baseboard management controller (BMC). IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

Given the ubiquity of IPMI, I'd like to take the time to walk through configuring and implementing IPMI in a cluster.

IPMI is generally configured using tools provided by the manufacturer of your server. Often there is a web interface with a default IP address, user name and password. In many cases though, you can also configure your IPMI device from the command line using user-space tools. The next section will walk you through an example setup of an IPMI device. If you've already configure yours though, note the IP addresses, user names and passwords that you assigned and skip down one section.

Configuring IPMI From The Command Line

To start, we need to install the IPMI user software.

yum install ipmitool freeipmi.x86_64 freeipmi-bmc-watchdog.x86_64 freeipmi-ipmidetectd.x86_64 OpenIPMI.x86_64

Once installed, you should be able to check the local IPMI BMC using ipmitool once you're started the ipmi daemon.

/etc/init.d/ipmi start
Starting ipmi drivers:                                     [  OK  ]
ipmitool chassis status
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : false
Power Control Fault  : false
Power Restore Policy : always-off
Last Power Event     : command
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false
Front Panel Control  : none

If you see that, you're doing well. You can now check the current configuration using the following command.

ipmitool -I open lan print 1
Set in Progress         : Set Complete
Auth Type Support       : NONE MD2 MD5 OEM 
Auth Type Enable        : Callback : NONE MD2 MD5 OEM 
                        : User     : NONE MD2 MD5 OEM 
                        : Operator : NONE MD2 MD5 OEM 
                        : Admin    : NONE MD2 MD5 OEM 
                        : OEM      : 
IP Address Source       : Static Address
IP Address              : 10.255.128.1
Subnet Mask             : 255.255.0.0
MAC Address             : 00:e0:81:aa:bb:cc
SNMP Community String   : AMI
IP Header               : TTL=0x00 Flags=0x00 Precedence=0x00 TOS=0x00
BMC ARP Control         : ARP Responses Disabled, Gratuitous ARP Disabled
Gratituous ARP Intrvl   : 0.0 seconds
Default Gateway IP      : 192.168.1.1
Default Gateway MAC     : 00:00:00:00:00:00
Backup Gateway IP       : 0.0.0.0
Backup Gateway MAC      : 00:00:00:00:00:00
802.1q VLAN ID          : Disabled
802.1q VLAN Priority    : 0
RMCP+ Cipher Suites     : 1,2,3,6,7,8,11,12,0,0,0,0,0,0,0,0
Cipher Suite Priv Max   : aaaaXXaaaXXaaXX
                        :     X=Cipher Suite Unused
                        :     c=CALLBACK
                        :     u=USER
                        :     o=OPERATOR
                        :     a=ADMIN
                        :     O=OEM

You can change the MAC address, but this isn't advised without a good reason to do so without good reason. That said, here is an example set of commands that will configure, save and then check that the settings took. Adapt the values to suit your environment and preferences.

# Don't change the MAC without a good reason. If you need to though, this should work.
#ipmitool -I open lan set 1 macaddr 00:e0:81:aa:bb:cd

# Set the IP to be static (instead of DHCP)
ipmitool -I open lan set 1 ipsrc static

# Set the IP, default gateway and subnet mask address of the IPMI interface.
ipmitool -I open lan set 1 ipaddr 192.168.1.171
ipmitool -I open lan set 1 defgw ipaddr 192.168.1.1
ipmitool -I open lan set 1 netmask 255.255.255.0

# Set the password.
ipmitool -I open lan set 1 password secret
ipmitool -I open user set password 2 secret

# Set the snmp community string, if appropriate
ipmitool -I open lan set 1 snmp alteeve

# Enable access
ipmitool -I open lan set 1 access on

# Reset the IPMI BMC to make sure the changes took effect.
ipmitool mc reset cold

# Wait a few seconds and then re-run the call that dumped the setup to ensure
# it is now what we want.
sleep 5
ipmitool -I open lan print 1

If all went well, you should see the same output as above, but now with your new configuration.

Testing IPMI

If you skipped the previous step and just want the minimal setup that works, you only need to install one package.

yum install ipmitool

Regardless of how you configured your IPMI devices, you will want to test now that you can check the power state of your nodes using the IPMI interface.

You can only query the status of the remote node(s) this way. You can't use the following command to check your local node.

In the example below, we will check the state of an-node01 from an-node02. Note that here we use the IP address directly, but in practice I like to use a name that resolves to the IP address of the IPMI interface (denoted by a .ipmi suffix).

ipmitool -I lan -H 192.168.1.171 -U admin -P secret chassis power status
Chassis Power is on

You could replace status with on, off and cycle to remotely boot, power off and reboot (power cycle) the node. This is, in fact, what the fence_ipmilan fence agent does. If you can afford to stop your servers, it's a good idea to play with the various power options to see how the work. There are a few more option than what I mentioned here, which you can read about in man ipmitool.

Once you have completely tested your IPMI settings, enable the service to start on boot.

chkconfig ipmi on

Configuring /etc/cluster.conf

Lastly, we need to add the IPMI configuration to our cluster.conf file. Note that, unlike with the Node Assassin setup, there is no port value in device tag. This is because there is only device controlled by the IPMI BMC; The node itself. The other change is that there are now two <fencedevice ... /> entries; One for each node's IPMI device. Finally, pay attention to the <device name="..." field. This is an example of where it matters.

Here is an example...

<cluster name="an-cluster" config_version="1">
    <clusternodes>
        <clusternode name="an-node01.alteeve.com" nodeid="1" votes="1">
            <fence>
                <method name="ipmi">
                    <device name="fence_an01" action="reboot" />
                </method>
            </fence>
        </clusternode>
        <clusternode name="an-node02.alteeve.com" nodeid="2" votes="1">
            <fence>
                <method name="ipmi">
                    <device name="fence_an02" action="reboot" />
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <fencedevices>
        <fencedevice name="fence_an01" agent="fence_ipmilan" ipaddr="192.168.3.61" login="admin" passwd="secret" />
        <fencedevice name="fence_an02" agent="fence_ipmilan" ipaddr="192.168.3.62" login="admin" passwd="secret" />
    </fencedevices>
</cluster>

Node Assassin

A cheap alternative is the Node Assassin, an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.

Full Disclosure: Node Assassin was created by me, with much help from others, for this paper.

Clean Up

Some daemons may have been installed or dragged in during the setup of the cluster. Most notable is heartbeat which is not compatible with RHCS. The following command will disable various daemons from starting at boot.

As for the rest of the daemons below, it is safe to run even when the daemons aren't installed or have already been removed. Of course, skip the ones you actually want to use. This assumes that the nodes have been built according to this HowTo.

chkconfig heartbeat off; chkconfig iscsid off; chkconfig iptables off; chkconfig ip6tables off

Creating a Clustered Xen Host

This tutorial will cover several topics; DRBD, CLVM, GFS2, Xen dom0 and domU VMs and rgmanager. Their relationship is thus:

  • DRBD provides a mechanism to replicate data across both nodes in real time and guarantees a consistent view of that data from either node. Think of it like RAID level 1, but across machines.
  • CLVM sits on the DRBD partition and provides the underlying mechanism for allowing both nodes to access shared data in a clustered environment. It will host a shared filesystem by way of GFS2 as well as LVs that Xen's domU VMs will use as their disk space.
  • GFS2 will be the clustered file system used on one of the DBRD-backed, CLVM-managed partitions. Files that need to be shared between nodes, like the Xen VM configuration files, will exist on this partition.
  • Xen will be the hypervisor in use that will manage the various virtual machines. Each virtual machine will exist in an LVM LV.
    • Xen's dom0 is the special "host" virtual machine. In this case, dom0 will be the OS installed in the first HowTo.
    • Xen's domU virtual machines will be the "floating", highly available servers.
  • Lastly, rgmanager will be the component of cman that will be configured to manage the automatic migration of the virtual machines when failures occur and when nodes recover.

Setting Up Xen

It may seem odd to start with Xen at this stage, but it is going to rather fundamentally alter each node's "host" operating system.

At this point, each node's host OS is a traditional operating system operating on the bare metal. When we install a dom0 kernel though, we tell Xen to boot a mini operating system first, and then to boot our "host" operating system. In effect, this converts the host node's operating system into just another virtual machine, albeit with a special view of the underlying hardware and Xen hypervisor.

This conversion is somewhat disruptive, so I like to get it out of the way before proceeding with the cluster configuration. We will then do the rest of the setup before returning to Xen later on to create the floating virtual machines.

A Note On The State Of Xen dom0 Support In Fedora

As of Fedora 8, support for Xen dom0 has been removed, but support for the hypervisor and domU virtual machines remains. Red Hat's position is that KVM will be the supported platform going forward. That said, this page seems to indicate that PV Ops dom0 kernels will be supported in the future. Specifically, when dom0 support is merged into the mainline Linux kernel. When this will be is open to speculation, though "by Fedora 16" seems to be a reasonable educated guess.

What this means for us is that we need to use a non-standard dom0 kernel. Specifically, we will use a kernel created by myoung (Micheal Young) for Fedora 12. This kernel does not directly support DRBD, so be aware that we will need to build new DRBD kernel modules for his kernel and then rebuild the DRBD modules each time his kernel is updated.

Install The Hypervisor

With the release of Fedora 14, the Xen hypervisor version 4.x is available directly from the main yum repositories. This drastically simplifies installation over prior versions of Fedora.

We want to install a handful of other programs at the same time as the Xen hypervisor.

yum -y install xen.x86_64 xen-doc.x86_64 libvirt.x86_64 virt-top.x86_64 virt-mem.x86_64 qemu.x86_64 qemu-img.x86_64

Installing The AN!Cluster dom0 Kernel

Notice!

The kernel provided here was recompiled on Fedora 14 and is a slightly modified version of Michael Young's kernel available below. I was originally driven to recompile in an effort to solve a DRBD-related kernel oops. For now, unless you have the same DRBD kernel oops, I'd strongly recommend against using the AN!Cluster dom0 kernel until it has been tested much more thoroughly.

With that warning out of the way...

This kernel was compiled on a Fedora 14, x86_64. The DRBD RPMs available a little later where compiled against this dom0 kernel.

Note: The --force is required because the current kernel is newer than 2.6.32 used here. Without this switch, the RPM would not install.

cd ~
wget -c https://alteeve.com/files/an-cluster/kernel-2.6.32.25-172.rc1.dom0_an1.fc14.x86_64.rpm
rpm -ivh --force kernel-2.6.32.25-172.rc1.dom0_an1.fc14.x86_64.rpm

If you would like to install the debug, devel and/or header RPMs for this kernel, they are available below.

Note: The ernel-debuginfo-2.6.32.25-172.rc1.dom0_an1.fc14.x86_64.rpm RPM is 218 MiB.

wget -c https://alteeve.com/files/an-cluster/kernel-2.6.32.25-172.rc1.dom0_an1.fc14.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/kernel-debuginfo-2.6.32.25-172.rc1.dom0_an1.fc14.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/kernel-debuginfo-common-x86_64-2.6.32.25-172.rc1.dom0_an1.fc14.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/kernel-devel-2.6.32.25-172.rc1.dom0_an1.fc14.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/kernel-headers-2.6.32.25-172.rc1.dom0_an1.fc14.x86_64.rpm
rpm -ivh --force kernel-de*-2.6.32.25-172.dom0_an1.fc14.x86_64.rpm kernel-headers-2.6.32.25-172.dom0_an1.fc14.x86_64.rpm

Post AN!Cluster dom0 Install Configuration

The entry in grub's /boot/grub/menu.lst won't work. You will need to edit it so that it calls the existing installed operating system as a module.

Note: Copy and modify the entry created by the RPM. Simply copying this entry will almost certainly not work! Your root= is likely different and your rd_MD_UUID= will definitely be different, even on the same machine across installs. Generally speaking, what follows the kernel /vmlinuz-2.6.32.25-172.dom0_an1.fc14.x86_64 ... entry made by the dom0 kernel can be copied after the module /vmlinuz-2.6.32.25-172.dom0_an1.fc14.x86_64 ... entry in the example below.

vim /boot/grub/menu.lst

Copy the existing entry that looks like the example below (note that the examples below are truncated for ease of reading).

title Fedora (2.6.32.25-172.rc1.dom0_an1.fc14.x86_64)
        root (hd0,0)
        kernel /vmlinuz-2.6.32.25-172.rc1.dom0_an1.fc14.x86_64 ro root=...
        initrd /initramfs-2.6.32.25-172.rc1.dom0_an1.fc14.x86_64.img

Now edit it to look like this new example entry below.

title Fedora (2.6.32.25-172.rc1.dom0_an1.fc14.x86_64)
        root (hd0,0)
        kernel /xen.gz dom0_mem=1024M
        module /vmlinuz-2.6.32.25-172.rc1.dom0_an1.fc14.x86_64 ro root=...
        module /initramfs-2.6.32.25-172.rc1.dom0_an1.fc14.x86_64.img

Installing Michael Young's dom0 Kernel

This uses a kernel built for Fedora 12, but it works on Fedora 14. This step involves either installing it over HTML or adding and enabling his repository and then installing it from there.

Installing Via myoung's Repository

This is almost always the preferred method. However, do note that when myoung updates his kernel, there will be a lag where the dom0 dependent RPMs provided here will no longer be compatible.

To add the repository, download the myoung.dom0.repo into the /etc/yum.repos.d/ directory.

cd /etc/yum.repos.d/
wget -c http://myoung.fedorapeople.org/dom0/myoung.dom0.repo

To enable his repository, edit the repository file and change the two enabled=0 entries to enabled=1.

vim /etc/yum.repos.d/myoung.dom0.repo
[myoung-dom0]
name=myoung's repository of Fedora based dom0 kernels - $basearch
baseurl=http://fedorapeople.org/~myoung/dom0/$basearch/
enabled=1
gpgcheck=0

[myoung-dom0-source]
name=myoung's repository of Fedora based dom0 kernels - Source
baseurl=http://fedorapeople.org/~myoung/dom0/src/
enabled=1
gpgcheck=0

Install the Xen dom0 kernel (edit the version number with the updated version if it has changed).

yum install kernel-2.6.32.25-172.xendom0.fc12.x86_64

Post Michael Young's dom0 Install Configuration

The entry in grub's /boot/grub/menu.lst won't work. You will need to edit it so that it calls the existing installed operating system as a module.

Note: Copy and modify the entry created by the RPM. Simply copying this entry will almost certainly not work! Your root= is likely different and your rd_MD_UUID= will definitely be different, even on the same machine across installs. Generally speaking, what follows the kernel /vmlinuz-2.6.32.25-172.xendom0.fc12.x86_64 ... entry made by the dom0 kernel can be copied after the module /vmlinuz-2.6.32.25-172.xendom0.fc12.x86_64 ... entry in the example below.

vim /boot/grub/menu.lst
title Xen 4.0.x, Linux kernel 2.6.32.25-172.xendom0.fc12.x86_64
	root   (hd0,0)
	kernel /xen.gz dom0_mem=1024M
	module /vmlinuz-2.6.32.25-172.xendom0.fc12.x86_64 ...
	module /initramfs-2.6.32.25-172.xendom0.fc12.x86_64.img

Disabling Automatic Kernel Updates

Seeing as we're using an older kernel, yum will want to replace it whenever there is an updated kernel* package available. Likewise if myoung updates his kernel. In the latter case, the updated kernel from Mr. Young would break compatibility with our DRBD module. So to be safe, we want to tell yum to never update the kernel.

To do this, we need to add exclude=kernel* to the /etc/yum.conf file.

echo "exclude=kernel*" >> /etc/yum.conf
cat /etc/yum.conf
[main]
cachedir=/var/cache/yum/$basearch/$releasever
keepcache=0
debuglevel=2
logfile=/var/log/yum.log
exactarch=1
obsoletes=1
gpgcheck=1
plugins=1
installonly_limit=3
color=never

#  This is the default, if you make this bigger yum won't see if the metadata
# is newer on the remote and so you'll "gain" the bandwidth of not having to
# download the new metadata and "pay" for it by yum not having correct
# information.
#  It is esp. important, to have correct metadata, for distributions like
# Fedora which don't keep old packages around. If you don't like this checking
# interupting your command line usage, it's much better to have something
# manually check the metadata once an hour (yum-updatesd will do this).
# metadata_expire=90m

# PUT YOUR REPOS HERE OR IN separate files named file.repo
# in /etc/yum.repos.d

exclude=kernel*

Make xend play nice with clustering

By default under Fedora 14, cman will start before xend. This is a problem because xend takes the network down as part of it's setup. This causes totem communication to fail which leads to fencing.

Note: Move xenconsoled and xenstore to 09 and 10 start positions and then make xend depend on the before starting.

To avoid this, edit the initialization scripts for /etc/init.d/xend and it's dependents xenconsoled and xenstore to have a lower minimum start position. We need to maintain the start order of xenstore first, xenconsoled second and lastly xend. By default, their minimum start positions are 96, 97 and 98 respectively. We will change these to 10, 11 and 12, again, respectively.

Note that we are not altering the start position of xendomains! This is intentional as this daemon will start the domU VMs. This can not happen until all other cluster related daemons have started.

To change the start order we will change the line chkconfig: 2345 9x 01 lines to chkconfig: 2345 1x 01, where x is the given daemon's start number. Further, we'll make sure that xenstored begins first by add it to xenconsoled's Required-Start line. We'll then make sure that xenconsoled starts before xend by adding it to xend's Required-Start line.

To recap the changes;

  • xenstored will start first.
    • We'll change it's start position from 96 to 10.
    • We will not add anything to it's Required-Start as it must be the first daemon to come up.
  • xenconsoled will start second.
    • We'll change it's start position from 97 to 11.
    • We will add xenstored to it's Required-Start line.
  • xend will start third.
    • We'll change it's start position from 98 to 12.
    • We will add xenconsoled to it's Required-Start line.

When done, the three initialization scripts should look like the examples below.

vim /etc/init.d/xenstored
#!/bin/bash
#
# xenstored     Script to start and stop the Xen control daemon.
#
# Author:       Daniel Berrange <berrange@redhat.com
#
# chkconfig: 2345 10 01
# description: Starts and stops the Xen xenstored daemon.
### BEGIN INIT INFO
# Provides:          xenstored
# Required-Start:    $syslog $remote_fs
# Should-Start:
# Required-Stop:     $syslog $remote_fs
# Should-Stop:
# Default-Start:     3 4 5
# Default-Stop:      0 1 2 6
# Default-Enabled:   yes
# Short-Description: Start/stop xenstored
# Description:       Starts and stops the Xen xenstored daemon.
### END INIT INFO
vim /etc/init.d/xenconsoled
#!/bin/bash
#
# xenconsoled   Script to start and stop the Xen xenconsoled daemon
#
# Author:       Daniel P. Berrange <berrange@redhat.com>
#
# chkconfig: 2345 11 01
# description: Starts and stops the Xen control daemon.
### BEGIN INIT INFO
# Provides:          xenconsoled
# Required-Start:    $syslog $remote_fs xenstored
# Should-Start:
# Required-Stop:     $syslog $remote_fs
# Should-Stop:
# Default-Start:     3 4 5
# Default-Stop:      0 1 2 6
# Default-Enabled:   yes
# Short-Description: Start/stop xenconsoled
# Description:       Starts and stops the Xen xenconsoled daemon.
### END INIT INFO
vim /etc/init.d/xend
#!/bin/bash
#
# xend          Script to start and stop the Xen control daemon.
#
# Author:       Keir Fraser <keir.fraser@cl.cam.ac.uk>
#
# chkconfig: 2345 12 98
# description: Starts and stops the Xen control daemon.
### BEGIN INIT INFO
# Provides:          xend
# Required-Start:    $syslog $remote_fs xenconsoled
# Should-Start:
# Required-Stop:     $syslog $remote_fs
# Should-Stop:
# Default-Start:     3 4 5
# Default-Stop:      0 1 2 6
# Default-Enabled:   yes
# Short-Description: Start/stop xend
# Description:       Starts and stops the Xen control daemon.
### END INIT INFO

With xend set to start at a position lower than 98, we now have room for chkconfig to put other daemons after it in the start order, which will be needed a little later. First and foremost, we now need to tell cman to not start until after xend is up.

As above, we will now edit cman's /etc/init.d/cman script. This time though, we will not edit it's chkconfig line. Instead, we will simply add xend to the Required-Start line.

vim /etc/init.d/cman
#!/bin/bash
#
# cman - Cluster Manager init script
#
# chkconfig: - 21 79
# description: Starts and stops cman
#
#
### BEGIN INIT INFO
# Provides:             cman
# Required-Start:       $network $time xend
# Required-Stop:        $network $time
# Default-Start:
# Default-Stop:
# Short-Description:    Starts and stops cman
# Description:          Starts and stops the Cluster Manager set of daemons
### END INIT INFO

Finally, remove and re-add the xend and cman daemons to re-order them in the start list:

chkconfig xenstored off; chkconfig xenconsoled off; chkconfig xend off; chkconfig cman off; 
chkconfig xenstored on; chkconfig xenconsoled on; chkconfig xend on; chkconfig cman on

Confirm that the order has changed so that xend is earlier in the boot sequence than cman. Assuming you've switched to run-level 3, run:

ls -lah /etc/rc3.d/

Your start sequence should now look like:

lrwxrwxrwx.  1 root root   19 Sep 15 19:29 S26xenstored -> ../init.d/xenstored
lrwxrwxrwx.  1 root root   21 Sep 15 19:29 S27xenconsoled -> ../init.d/xenconsoled
lrwxrwxrwx.  1 root root   14 Sep 15 19:29 S28xend -> ../init.d/xend
lrwxrwxrwx.  1 root root   14 Sep 15 19:29 S29cman -> ../init.d/cman

Booting Into The New dom0

If everything went well, you should be able to boot the new dom0 operating system. If you watch the boot process closely, you will see that the boot process is different. You should now see the Xen hypervisor boot prior to handing off to the "host" operating system. This can be confirmed once the dom0 operating system has booted by checking that the file /proc/xen/capabilities exists. What it contains doesn't matter at this stage, only that it exists at all.

Configure Networking

Networking in Xen, particularly in a cluster, can be confusing. If you are not familiar with networking in Xen, please review to following article before proceeding.

A note of a major change from previous layouts. In Xen 3.x, ethX would be copied to a virtual interface called vethX. Then the real ethX would be renamed to pethX and the virtual interface vethX would be renamed to ethX to take it's place. Finally, a bridge called xenbrX would be created and the real pethX and virtual ethX would be connected to it.

This has been changed somewhat it that now, by default, ethX is left alone and a simple bridge called virbrX would be created. We'll be changing this to be somewhat similar to the old style.

Specifically, the real ethX will be renamed to pethX. Then a bridge will be created called ethX, which plays the role of dom0's interface and bridges connections from VMs through pethX and out into the real world.

This is explained in more detail, and with diagrams, in the article below.

Adding New NICs to Xen

By default, xend manages eth0 only. We need to add eth2. Personally, I don't like to put the storage network ethernet devices (eth1) under Xen's control as this potentially can cause DRBD problems on xend restart. Whether you add it or not I will leave to your preferences.

You can see which, if any, network devices are under Xen's control by running ifconfig and checking to see if there is a virbrX corresponding to a given ethX device.

ifconfig
eth0      Link encap:Ethernet  HWaddr 48:5B:39:3C:53:14  
          inet addr:192.168.1.74  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::4a5b:39ff:fe3c:5314/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:224261 errors:0 dropped:0 overruns:0 frame:0
          TX packets:55174 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:319384110 (304.5 MiB)  TX bytes:27348739 (26.0 MiB)
          Interrupt:225 Base address:0x8000 

eth1      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:5A  
          inet addr:192.168.2.74  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:9b5a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2 errors:0 dropped:0 overruns:0 frame:0
          TX packets:36 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:832 (832.0 b)  TX bytes:6234 (6.0 KiB)
          Memory:feae0000-feb00000 

eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:96:EA  
          inet addr:192.168.3.74  Bcast:192.168.3.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:96ea/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2 errors:0 dropped:0 overruns:0 frame:0
          TX packets:34 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:818 (818.0 b)  TX bytes:6081 (5.9 KiB)
          Memory:fe9e0000-fea00000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

virbr0    Link encap:Ethernet  HWaddr 02:23:C8:98:31:17  
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:4014 (3.9 KiB)

In the above example, eth0 has a corresponding virbr0 bridge having it's own subnet. In non-clustered systems, this is fine. For our purposes though, it will not do.

Removing The qemu virbr0 Bridge

By default, QEMU creates a bridge called virbr0 designed to connect virtual machines to the first eth0 interface. Our system will not need this, so we will remove it. This bridge is configured in the /etc/libvirt/qemu/networks/default.xml file, so to remove this bridge, simply delete the contents of the file.

cat /dev/null >/etc/libvirt/qemu/networks/default.xml

The next time you reboot, that bridge will be gone.

Madi: Put in the command to delete the bridge before a reboot.

Create /etc/xen/scripts/an-network-script

This script will be used by Xen to turn the dom0 ethX interfaces into bridges. All traffic to the bridge, be it from dom0 or domU VMs, will be routeable out of the corresponding pethX device. As domU VMs come online, a hotplug script will create virtual interfaces between this new bridge and the domU's interface(s). Think of the vifX.Y devices as being the network cables you'd normally run between a server and a switch.

Before we proceed, please note three things;

  1. You don't need to use the file name an-network-script. I suggest this name mainly to keep in line with the rest of the 'AN!x' naming used here.
  2. If you install convirt or other hypervisor tools, they will likely create their own bridge script.
  3. Adding eth1 is optional, as we know ahead of time that eth1 will not be made available to any virtual machines as it is dedicated to DRBD. I'm adding it here because I like having things consistent; Do whichever makes more sense to you.

First, touch the file and then chmod it to be executable.

touch /etc/xen/scripts/an-network-script
chmod 755 /etc/xen/scripts/an-network-script

Now edit it to contain the following:

vim /etc/xen/scripts/an-network-script
#!/bin/sh
dir=$(dirname "$0")
"$dir/network-bridge" "$@" vifnum=0 netdev=eth0 bridge=eth0
"$dir/network-bridge" "$@" vifnum=2 netdev=eth2 bridge=eth2

Now tell Xen to execute that script by editing /etc/xen/xend-config.sxp file and changing the network-script argument to point to this new script (this is line 158 in the default xend-config.sxp script):

vim /etc/xen/xend-config.sxp
#(network-script network-bridge)
#(network-script /bin/true)
(network-script an-network-script)

Warning: The next step may trigger fencing of the nodes! As such, be sure that you're not running anything critical. If unsure, please stop cman or reboot the nodes.

/etc/init.d/cman stop

Now restart xend.

/etc/init.d/xend restart

If everything worked, you should now be able to run ifconfig and see that all the ethX devices have matching pethX, virtual and bridge devices.

ifconfig
eth0      Link encap:Ethernet  HWaddr 48:5B:39:3C:53:14  
          inet addr:192.168.1.74  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::4a5b:39ff:fe3c:5314/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:78 errors:0 dropped:0 overruns:0 frame:0
          TX packets:57 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:9796 (9.5 KiB)  TX bytes:12574 (12.2 KiB)

eth1      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:5A  
          inet addr:192.168.2.74  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:9b5a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2 errors:0 dropped:0 overruns:0 frame:0
          TX packets:36 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:832 (832.0 b)  TX bytes:6234 (6.0 KiB)
          Memory:feae0000-feb00000 

eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:96:EA  
          inet addr:192.168.3.74  Bcast:192.168.3.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:96ea/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:32 errors:0 dropped:0 overruns:0 frame:0
          TX packets:29 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:5471 (5.3 KiB)  TX bytes:5867 (5.7 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

peth0     Link encap:Ethernet  HWaddr 48:5B:39:3C:53:14  
          inet6 addr: fe80::4a5b:39ff:fe3c:5314/64 Scope:Link
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:224486 errors:0 dropped:0 overruns:0 frame:0
          TX packets:55349 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:319406626 (304.6 MiB)  TX bytes:27384681 (26.1 MiB)
          Interrupt:225 Base address:0x8000 

peth2     Link encap:Ethernet  HWaddr 00:1B:21:72:96:EA  
          inet6 addr: fe80::21b:21ff:fe72:96ea/64 Scope:Link
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:35 errors:0 dropped:0 overruns:0 frame:0
          TX packets:70 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:6827 (6.6 KiB)  TX bytes:12470 (12.1 KiB)
          Memory:fe9e0000-fea00000

Note: The virbr0 may remain until you reboot your nodes.

If you see something like this, then you are ready to proceed! Now start your cluster back up.

/etc/init.d/cman start

We're done for now. There is more to do in Xen, but this was all we needed to do in order to proceed with the next several steps. Onces we have the clustered storage online, we'll come back to Xen for the domU setup.

Building the DRBD Array

Building the DRBD array requires a few steps. First, raw space on either node must be prepared. Next, DRBD must be told that it is to create a resource using this newly configured raw space. Finally, the new array must be initialized.

A Map of the Cluster's Storage

The layout of the storage in the cluster can quickly become difficult to follow. Below is an ASCII drawing which should help you see how DRBD will tie in to the rest of the cluster's storage. This map assumes a simple RAID level 1 array underlying each node. If your node has a single hard drive, simply collapse the first two layers into one. Similarly, if your underlying storage is a more complex RAID array, simply expand the number of physical devices at the top level.

               Node1                                Node2
           _____   _____                        _____   _____
          | sda | | sdb |                      | sda | | sdb |
          |_____| |_____|                      |_____| |_____|
             |_______|                            |_______|
     _______ ____|___ _______             _______ ____|___ _______
  __|__   __|__    __|__   __|__       __|__   __|__    __|__   __|__
 | md0 | | md1 |  | md2 | | md3 |     | md3 | | md2 |  | md1 | | md0 |
 |_____| |_____|  |_____| |_____|     |_____| |_____|  |_____| |_____|
    |       |        |       |           |       |        |       |
 ___|___   _|_   ____|____   |___________|   ____|____   _|_   ___|___
| /boot | | / | | <swap>  |        |        | <swap>  | | / | | /boot |
|_______| |___| |_________|  ______|______  |_________| |___| |_______|
                            | /dev/drbd0  |
                            |_____________|
                                   |
                               ____|______
                              | clvm PV   |
                              |___________|
                                   |
                              _____|_____
                             | drbd0_vg0 |
                             |___________|
                                   |
                              _____|_____ ___...____
                             |           |          |
                          ___|___     ___|___    ___|___
                         | lv_X  |   | lv_Y  |  | lv_N  |
                         |_______|   |_______|  |_______|

Install The DRBD Tools

DRBD has two components; The actual application and tools and the kernel module.

There are two options for installing the DRBD user-land tools at this point; AN!Cluster-built RPMs or using the ones shipped with Fedora. Regardless of which method you choose, you will need to either install the AN!Cluster DRBD kernel module RPMs or else rebuild the source RPMs referenced.

Install The AN!Cluster DRBD User-Land Tools

I am currently experimenting with ways to solve a DRBD triggered kernel oops in the Xen pvops 2.6.32 kernel. For this reason, I've recompiled the following user-land RPMs under the AN!Cluster variant dom0 kernel RPMs referenced earlier in this paper. If you used the AN! RPMs, then I suggest giving these RPMs a try. However, if you are using myoung's dom0, I recommend sticking to the Fedora-provided user-land DRBD tools.

yum -y install bash-completion heartbeat pacemaker
cd ~
wget -c https://alteeve.com/files/an-cluster/drbd-8.3.7-2.fc14.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/drbd-utils-8.3.7-2.fc14.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/drbd-xen-8.3.7-2.fc14.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/drbd-bash-completion-8.3.7-2.fc14.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/drbd-udev-8.3.7-2.fc14.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/drbd-heartbeat-8.3.7-2.fc14.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/drbd-pacemaker-8.3.7-2.fc14.x86_64.rpm
rpm -ivh drbd*8.3.7-2.fc14.x86_64.rpm

Install The Stock Fedora DRBD User-Land Tools

You will need to install the following tools:

yum install drbd.x86_64 drbd-xen.x86_64 drbd-utils.x86_64

Disable heartbeat

These packages require that the heartbeat packages be installed. This is for a different cluster platform which we are not using here, so we will disable it from starting with the system.

chkconfig heartbeat off

Install The DRBD Kernel Module

The kernel module must match the dom0 kernel that is running. If you update the kernel and neglect to update the DRBD kernel module, the DRBD array will not start.

To help simplify things, links to pre-compiled DRBD kernel modules are provided. If the kernel version you have installed doesn't match your kernel, instructions on recompiling the DRBD kernel module from source RPM is provided as well.

Install Pre-Compiled DRBD Kernel Module RPMs

Warning: The RPM provided here will only work with the kernel-2.6.32.21-168.xendom0_an1.fc14.x86_64.rpm kernel. If you are using Michael Young's dom0 kernel, please skip to the next section.

This RPM provides the DRBD kernel module. Note that these RPMs are compiled against the AN!Cluster variant of myoung's 2.6.32.21_168 dom0 kernel.

cd ~
wget -c https://alteeve.com/files/an-cluster/drbd-km-2.6.32.23_170.dom0_an1.fc14.x86_64-8.3.7-12.fc14.x86_64.rpm
rpm -ivh drbd-km-2.6.32.23_170.dom0_an1.fc14.x86_64-8.3.7-12.fc14.x86_64.rpm

If you would like to install the debuginfo

cd ~
wget -c https://alteeve.com/files/an-cluster/drbd-km-debuginfo-8.3.7-12.fc14.x86_64.rpm
rpm -ivh drbd-km-debuginfo-8.3.7-12.fc14.x86_64.rpm

Building DRBD Kernel Module RPMs From Source

If the above RPMs don't work or if the dom0 kernel you are using in any way differs, please follow the steps here to create a DRBD kernel module matched to your running dom0.

First, install the build environment.

yum -y groupinstall "Development Libraries"
yum -y groupinstall "Development Tools"

Install the kernel headers and development library for the dom0 kernel:

Note: The following commands use --force to get past the fact that the headers for the 2.6.33 are already installed, thus making RPM think that these are too old and will conflict. Please proceed with caution.

Building On Michael Young's Kernel

If you are using Michael Young's kernel:

cd ~
wget -c http://fedorapeople.org/~myoung/dom0/x86_64/kernel-headers-2.6.32.25-172.xendom0.fc12.x86_64.rpm
wget -c http://fedorapeople.org/~myoung/dom0/x86_64/kernel-devel-2.6.32.25-172.xendom0.fc12.x86_64.rpm
rpm -ivh --force kernel*2.6.32.25-172.xendom0.fc12.x86_64.rpm

Building On The AN!Cluster Kernel

If you are using the AN!Cluster dom0 kernel:

cd ~
wget -c https://alteeve.com/files/an-cluster/kernel-devel-2.6.32.25-172.dom0_an1.fc14.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/kernel-headers-2.6.32.25-172.dom0_an1.fc14.x86_64.rpm
rpm -ivh --force kernel-devel-2.6.32.25-172.dom0_an1.fc14.x86_64.rpm kernel-headers-2.6.32.25-172.dom0_an1.fc14.x86_64.rpm

Building The DRBD RPMs

Now you need to download, prepare, build and install the source RPM.

yum install -y drbd-utils bash-completion heartbeat pacemaker
wget -c https://alteeve.com/files/an-cluster/drbd-8.3.7-2.fc13.src.rpm
rpm -ivh drbd-8.3.7-2.fc13.src.rpm
cd /root/rpmbuild/SPECS/
rpmbuild -bp drbd.spec 
cd /root/rpmbuild/BUILD/drbd-8.3.7/
./configure --enable-spec --with-km
cp /root/rpmbuild/BUILD/drbd-8.3.7/drbd-km.spec /root/rpmbuild/SPECS/
cd /root/rpmbuild/SPECS/
rpmbuild -ba drbd-km.spec
rpmbuild -ba drbd.spec 
cd ~/rpmbuild/RPMS/x86_64/
rpm -Uvh drbd-*
chkconfig heartbeat off

You should be good to go now!

Allocating Raw Space For DRBD On Each Node

If you followed the setup steps provided for in "Two Node Fedora 14 Cluster", you will have a set amount of unconfigured hard drive space. This is what we will use for the DRBD space on either node. If you've got a different setup, you will need to allocate some raw space before proceeding.

Create a Simple Partition

If you do not have two drives, please follow the next section's steps, but pay attention to the "note"s. In short, you will need to create one partition, leave the default type of the partition as 83, write the changes to disk and the proceed to the DRBD Configuration Files section.

Creating a RAID level 1 'md' Device

This assumes that you have two raw drives, /dev/sda and /dev/sdb. It further assumes that you've created three partitions which have been assigned to three existing /dev/mdX devices. With these assumptions, we will create /dev/sda4 and /dev/sdb4 and, using them, create a new /dev/md3 device that will host the DRBD partition.

If you have multiple drives and plan to use a different RAID levels, please adjust the follow commands accordingly.

Creating The New Partitions

Warning: The next steps will have you directly accessing your server's hard drive configuration. Please do not proceed on a live server until you've had a chance to work through these steps on a test server. One mistake can blow away all your data.

Start the fdisk shell for the first hard drive; /dev/sda.

fdisk /dev/sda
WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
         switch off the mode (command 'c') and change display units to
         sectors (command 'u').

Command (m for help):

Note: Depending on your configuration, you may not see the above warning or you may see a different warning. Note it, but it is likely nothing to worry about it.

View the current configuration with the print option

p
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c6fe1

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1        5100    40960000   fd  Linux raid autodetect
/dev/sda2            5100        5622     4194304   fd  Linux raid autodetect
/dev/sda3   *        5622        5654      256000   fd  Linux raid autodetect

Command (m for help):

Now we know for sure that the next free partition number is 4. We will now create the new partition.

n
Command action
   e   extended
   p   primary partition (1-4)

We will make it a primary partition

p
Selected partition 4
First cylinder (5654-60801, default 5654):

Then we simply hit <enter> to select the default starting block.

<enter>
Using default value 5654
Last cylinder, +cylinders or +size{K,M,G} (5654-60801, default 60801):

Once again we will press <enter> to select the default ending block.

<enter>
Using default value 60801

Command (m for help):

Note: If you only have one drive and are not creating a RAID array, you do not to change the type of the partition so you can skip the next few steps. Continue at the step where you write the changes.

Now we need to change the type of partition that it is.

t
Partition number (1-4):

We know that we are modifying partition number 4.

4
Hex code (type L to list codes):

Now we need to set the hex code for the partition type to set. We want to set fd, which defines Linux raid autodetect.

fd
Changed system type of partition 4 to fd (Linux raid autodetect)

Now check that everything went as expected by once again printing the partition table.

p
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c6fe1

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1        5100    40960000   fd  Linux raid autodetect
/dev/sda2            5100        5622     4194304   fd  Linux raid autodetect
/dev/sda3   *        5622        5654      256000   fd  Linux raid autodetect
/dev/sda4            5654       60801   442972704+  fd  Linux raid autodetect

Command (m for help):

Note: If you only have one drive, your partitions will be 83 Linux or 82 Linux swap / Solaris, instead of fd Linux raid autodetect.

There it is. So finally, we need to write the changes to the disk.

w
The partition table has been altered!

Calling ioctl() to re-read partition table.

WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
Syncing disks.

Note: If you only have one drive, reboot now if you got the message above and then skip forward to the "DRBD Configuration Files" section.

If you see the above message, do not reboot yet. repeat these steps for the second drive, /dev/sdb, and then reboot.

Creating The New /dev/mdX Device

If you only have one drive, skip this step.

Now we need to use mdadm to create the new RAID level 1 device. This will be used as the device that DRBD will directly access.

mdadm --create /dev/md3 --homehost=localhost.localdomain --raid-devices=2 --level=1 /dev/sda4 /dev/sdb4
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90

Seeing as /boot doesn't exist on this device, we can safely ignore this warning.

y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/md4 started.

You can now cat /proc/mdstat to verify that it indeed built. If you're interested, you could open a new terminal window and use watch cat /proc/mdstat and watch the array build.

cat /proc/mdstat
md3 : active raid1 sdb4[1] sda4[0]
      442971544 blocks super 1.2 [2/2] [UU]
      [>....................]  resync =  0.8% (3678976/442971544) finish=111.0min speed=65920K/sec
      
md2 : active raid1 sda2[0] sdb2[1]
      4193272 blocks super 1.1 [2/2] [UU]
      
md1 : active raid1 sda1[0] sdb1[1]
      40958908 blocks super 1.1 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md0 : active raid1 sda3[0] sdb3[1]
      255988 blocks super 1.0 [2/2] [UU]
      
unused devices: <none>

Finally, we need to make sure that the new array will start when the system boots. To do this, we'll again use mdadm, but with different options that will have it output data in a format suitable for the /etc/mdadm.conf file. We'll redirect this output to that config file, thus updating it.

mdadm --detail --scan | grep md3 >> /etc/mdadm.conf
cat /etc/mdadm.conf
# mdadm.conf written out by anaconda
MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=b58df6d0:d925e7bb:c156168d:47c01718
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=ac2cf39c:77cd0314:fedb8407:9b945bb5
ARRAY /dev/md2 level=raid1 num-devices=2 UUID=4e514936:4a966f4e:0dd8402e:6403d10d
ARRAY /dev/md3 metadata=1.2 name=localhost.localdomain:3 UUID=f0b6d0c1:490d47e7:91c7e63a:f8dacc21

You'll note that the last line, which we just added, is different from the previous lines. This isn't a concern, but you are welcome to re-write it to match the existing format if you wish.

Before you proceed, it is strongly advised that you reboot each node and then verify that the new array did in fact start with the system. You do not need to wait for the sync to finish before rebooting. It will pick up where you left off once rebooted.

Note: You'll notice we did not format a file system on this raid array, this is intentional. DRBD use the raw device and does not need a file system on it.

DRBD Configuration Files

DRBD uses a global configuration file, /etc/drbd.d/global_common.conf, and one or more resource files. The resource files need to be created in the /etc/drbd.d/ directory and must have the suffix .res. For this example, we will create a single resource called r0 which we will configure in /etc/drbd.d/r0.res.

/etc/drbd.d/global_common.conf

The stock /etc/drbd.d/global_common.conf is sane, so we won't bother altering it here.

Full details on all the drbd.conf configuration file directives and arguments can be found here. Note: That link doesn't show this new configuration format. Please see Novell's link.

/etc/drbd.d/r0.res

This is the important part. This defines the resource to use, and must reflect the IP addresses and storage devices that DRBD will use for this resource.

vim /etc/drbd.d/r0.res
# This is the name of the resource and it's settings. Generally, 'r0' is used
# as the name of the first resource. This is by convention only, though.
resource r0
{
        # This tells DRBD where to make the new resource available at on each
        # node. This is, again, by convention only.
        device    /dev/drbd0;

        # The main argument here tells DRBD that we will have proper locking 
        # and fencing, and as such, to allow both nodes to set the resource to
        # 'primary' simultaneously.
        net
        {
                allow-two-primaries;
        }

        # This tells DRBD to automatically set both nodes to 'primary' when the
        # nodes start.
        startup
        {
                become-primary-on both;
        }

        # This tells DRBD to look for and store it's meta-data on the resource
        # itself.
        meta-disk       internal;

        # The name below must match the output from `uname -n` on each node.
        on an-node01.alteeve.com
        {
                # This must be the IP address of the interface on the storage 
                # network (an-node01.sn, in this case).
                address         192.168.2.71:7789;

                # This is the underlying partition to use for this resource on 
                # this node.
                disk            /dev/md3;
        }

        # Repeat as above, but for the other node.
        on an-node02.alteeve.com
        {
                address         192.168.2.72:7789;
                disk            /dev/md3;
        }
}

This file must be copied to BOTH nodes and must match before you proceed.

Starting The DRBD Resource

From the rest of this section, pay attention to whether you see

  • Node1
  • Node2
  • Both

These indicate which node to run the following commands on. There is no functional difference between either node, so just randomly choose one to be Node1 and the other will be Node2. Once you've chosen which is which, be consistent with which node you run the commands on. Of course, if a command block is proceeded by Both, run the following code block on both nodes.

Loading the 'drbd' Module

Both

Normally, we'd load the drbd module by simply starting the /etc/init.d/drbd daemon. However, if we did that at this stage, we'd generate errors because there isn't an UpToDate disk in the array. To get around this, we'll manually load the drbd kernel module using modprobe.

modprobe drbd

This won't return any output, but if you check, you should now see the special /proc/drbd file.

Monitoring Progress

Both

I find it very useful to monitor DRBD while running the rest of the setup. To do this, open a second terminal on each node and use watch to keep an eye on /proc/drbd. This way you will be able to monitor the progress of the array in near-real time.

Both

watch cat /proc/drbd

At this stage, it should look like this:

version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6145808fe2917 build by root@xenmaster002.iplink.net, 2010-09-07 16:02:46
 0: cs:Unconfigured

Initialize The Resource

Both

This step creates the DRBD meta-data on the new DRBD resource's backing devices. It is only needed when creating new DRBD partitions.

drbdadm create-md r0
  --==  Thank you for participating in the global usage survey  ==--
The server's response is:

you are the 9507th user to install this version
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success

The /proc/drbd output should not have changed at this stage.

Starting the Resource

Both

This will attach the backing device, /dev/md3 in our case, and then start the new resource r0.

drbdadm up r0

There will be no output at the command line. If you are watching /proc/drbd though, you should now see something like this:

version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6145808fe2917 build by root@xenmaster002.iplink.net, 2010-09-07 16:02:46
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:442957988

That it is Secondary/Secondary and Inconsistent/Inconsistent is expected.

Setting the First Primary Node

Node1

As this is a totally new resource, DRBD doesn't know which side of the array is "more valid" than the other. In reality, neither is as there was no existing data of note on either node. This means that we now need to choose a node and tell DRBD to treat it as the "source" node. This step will also tell DRBD to make the "source" node primary. Once set, DRBD will begin sync'ing in the background.

drbdadm -- --overwrite-data-of-peer primary r0

As before, there will be no output at the command line, but /proc/drbd will change to show the following:

GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@xenmaster002.iplink.net, 2010-09-07 16:02:46
 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
    ns:69024 nr:0 dw:0 dr:69232 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:442888964
        [>....................] sync'ed:  0.1% (432508/432576)M
        finish: 307:33:42 speed: 320 (320) K/sec

If you're watching the secondary node, the /proc/drbd will show ro:Secondary/Primary ds:Inconsistent/UpToDate. This is, as you can guess, simply a reflection of it being the "over-written" node.

Setting the Second Node to Primary

Node2

The last step to complete the array is to tell the second node to also become primary.

drbdadm primary r0

As with many drbdadm commands, nothing will be printed to the console. If you're watching the /proc/drbd though, you should see something like Primary/Primary ds:UpToDate/Inconsistent. The Inconsistent flag will remain until the sync is complete.

A Note On sync Speed

You will notice in the previous step that the sync speed seems awfully slow at 320 (320) K/sec.

In most DRBD applications, this is fine. As actual data is written to either side of the array, that data will be immediately copied to both nodes. As such, both nodes will always contain up to date copies of the real data. Given this, the syncer is intentionally set low so as to not put too much load on the underlying disks that could cause slow downs.

In clustered VM environments though, this is a problem. The reason is that, until the sync completes, the node whose DRBD resource is Inconsistent can not be used for redundancy. If the node that is UpToDate fails, DRBD will stop on the Inconsistent node. As a result, any VMs running on the DRBD will lose access to their storage and thus fail. Similarly, VMs lost on the other node will not be able to restart on the surviving node.

For this reason, we will push the sync speed up to about two-thirds of the disk's maximum write speed. For example; If you node can write at the rate of 60 MiB/sec, you will want to sync at about 40 MiB/sec. We don't want to set it too high so as to not risk applications timing out that access the drives outside of the DRBD partition itself.

drbdsetup /dev/drbd0 syncer -r 40M

The speed-up will not be instant. It will take a little while for the speed to pick up. Once the sync is finished, it is a good idea to revert to the default sync rate.

drbdadm syncer r0

You can proceed with configuration, but pause at the stage where you provision VMs if the sync has not completed.

Setting Up CLVM

The goal of DRBD in the cluster is to provide clustered LVM, referred to as CLVM to the nodes. This is done by turning the DRBD partition into an CLVM physical volume.

So now we will create a PV on top of the new DRBD partition, /dev/drbd0, that we created in the previous step. Since this new LVM PV will exist on top of the shared DRBD partition, whatever get written to it's logical volumes will be immediately available on either node, regardless of which node actually initiated the write.

This capability is the underlying reason for creating this cluster; Neither machine is truly needed so if one machine dies, anything on top of the DRBD partition will still be available. When the failed machine returns, the surviving node will have a list of what blocks changed while the other node was gone and can use this list to quickly re-sync the other server.

Making LVM Cluster-Aware

Normally, LVM is run on a single server. This means that at any time, the LVM can write data to the underlying drive and not need to worry if any other device might change anything. In clusters, this isn't the case. The other node could try to write to the shared storage, so then nodes need to enable "locking" to prevent the two nodes from trying to work on the same bit of data at the same time.

The process of enabling this locking is known as making LVM "cluster-aware".

LVM has tool called lvmconf that can be used to enable LVM locking. This is provided as part of the lvm2-cluster package.

yum -y install lvm2-cluster.x86_64

Now to enable cluster awareness in LVM, run to following command.

lvmconf --enable-cluster

By default, clvmd, the cluster lvm daemon, is stopped and not set to run on boot. Now that we've enabled LVM locking, we need to start it:

/etc/init.d/clvmd status
clvmd is stopped
active volumes: (none)

As expected, it is stopped, so lets start it:

Note: At this point cman is still set to not start a boot. Since we rebooted after creating the partitions that make up /dev/md3, cman will likely, and in my case was still off. clvmd will fail to start because the cluster manager (cman) is not started. --SRSullivan 17:40, 18 October 2010 (UTC)

/etc/init.d/clvmd start
Activating VGs:   No volume groups found
                                                           [  OK  ]

Note: I've seen on a few occasions where starting clvmd will time out and, on occasion, fences will be issued. I've not sorted out why, but I have usually been able to resolve this by stopping clvmd and cman, then restarting cman and, finally, restarting clvmd. If I can sort out a way to reliably trigger this problem, I will submit a bug report.

Filtering Out Devices

ToDo: Find a less-aggressive filter.

With the stock /etc/lvm/lvm.conf configuration, all devices on the system will be checked for LVM volumes. This can cause a problem as LVM will give preference to the LVM data on the RAID device over the DRBD device. It sees a duplicate as both are, effectively, one and the same.

To work around this, we need to alter the filter = [] entry. At the time of writing, simply rejecting the underlying /dev/md3 device as a candidate wasn't enough. So for now, we will tell LVM to accept DRBD devices and reject all other devices. To do this, we'll insert "a|/dev/drbd*|" as the first array entry and change the existing entry to "r/.*/".

Note: I would love feedback on a filter argument that successfully ignored just /dev/md3, if anyone can suggest one.

vim /etc/lvm/lvm.conf
    # By default we accept every block device:
    #filter = [ "a/.*/" ]
    filter = [ "a|/dev/drbd*|", "r/.*/" ]

Now delete the existing cache file so that LVM is forced to rescan the system.

rm -f /etc/lvm/cache/.cache

The changes take effect immediately.

Creating a new PV using the DRBD Partition

Node1

We can now proceed with setting up the new DRBD-based LVM physical volume. Once the PV is created, we can create a new volume group and start allocating space to logical volumes.

Note: As we will be using our DRBD device, and as it is a shared block device, most of the following commands only need to be run on one node. Once the block device changes in any way, those changes will near-instantly appear on the other node. For this reason, unless explicitly stated to do so, only run the following commands on one node.

To setup the DRBD partition as an LVM PV, run pvcreate:

pvcreate /dev/drbd0
  Physical volume "/dev/drbd0" successfully created

Both

Now, on both nodes, check that the new physical volume is visible by using pvdisplay:

pvdisplay
  "/dev/drbd0" is a new physical volume of "422.44 GiB"
  --- NEW Physical volume ---
  PV Name               /dev/drbd0
  VG Name               
  PV Size               422.44 GiB
  Allocatable           NO
  PE Size               0   
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               YHmdip-SuJN-KIEv-2tbK-BT9Q-wfOo-OuQuaW

If you see PV Name /dev/drbd0 (or your underlying partition) on both nodes, then your DRBD setup and LVM configuration changes are working perfectly!

Creating a VG on the new PV

Node1

Now we need to create the volume group using the vgcreate command:

vgcreate -c y drbd0_vg0 /dev/drbd0
  Clustered volume group "drbd0_vg0" successfully created

Both

Now we'll check that the new VG is visible on both nodes using vgdisplay:

vgdisplay
  --- Volume group ---
  VG Name               drbd0_vg0
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               422.43 GiB
  PE Size               4.00 MiB
  Total PE              108143
  Alloc PE / Size       0 / 0   
  Free  PE / Size       108143 / 422.43 GiB
  VG UUID               Bb8l9e-es2z-PhaF-Gg3o-2is2-DZ1S-V2RsBF

If the new VG is visible on both nodes, we are ready to create our first logical volume using the lvcreate tool.

Creating the First LV on the new VG

Node1

Now we'll create a simple 20 GiB logical volumes. We will use it as a shared GFS2 store for shared files and to store our Xen domU config files later on.

lvcreate -L 20G -n xen_shared drbd0_vg0
  Logical volume "xen_shared" created

Both

As before, we will check that the new logical volume is visible from both nodes by using the lvdisplay command:

lvdisplay
  --- Logical volume ---
  LV Name                /dev/drbd0_vg0/xen_shared
  VG Name                drbd0_vg0
  LV UUID                AqQizc-KBpX-2scN-WFLb-jIeF-QDcM-PlQW84
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                20.00 GiB
  Current LE             5120
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0

Again, if this is visible from both nodes, we're set! Repeat this process for all future LVs you will want to create. We will do this a little later to create LVs for Xen VMs.

Creating A Shared GFS FileSystem

GFS is a cluster-aware file system that can be simultaneously mounted on two or more nodes at once. We will use it as a place to store ISOs that we'll use to provision our virtual machines.

Install The GFS2 Utilities

Start by installing the GFS2 tools:

yum -y install gfs2-utils.x86_64

Format Our CLVM LV With The GFS2 File System

Node1

Note: The following example is designed for the cluster used in the prerequisite HowTo.

  • If you have more than 2 nodes, increase the -j 2 to the number of nodes you want to mount this file system on.
  • If your cluster is named something other than an-cluster (as set in the cluster.conf file), change -t an-cluster:xen_shared to match you cluster's name. The xen_shared can be whatever you like, but it must be unique in the cluster. I tend to use a name that matches the LV name, but this is my own preference and is not required.

To format the partition run:

mkfs.gfs2 -p lock_dlm -j 2 -t an-cluster:xen_shared /dev/drbd0_vg0/xen_shared
This will destroy any data on /dev/drbd0_vg0/xen_shared.
It appears to contain: symbolic link to `../dm-0'

Are you sure you want to proceed? [y/n]

Acknowledge the warning, if any, and then press y if you are ready to proceed.

y
Device:                    /dev/drbd0_vg0/xen_shared
Blocksize:                 4096
Device Size                20.00 GB (5242880 blocks)
Filesystem Size:           20.00 GB (5242878 blocks)
Journals:                  2
Resource Groups:           80
Locking Protocol:          "lock_dlm"
Lock Table:                "an-cluster:xen_shared"
UUID:                      A1487063-2A3F-43B1-3A36-44936B0B4D1E

Once the format completes, you can mount /dev/drbd0_vg0/xen_shared as you would a normal file system.

Both:

To complete the example, lets mount the GFS2 partition we made just now on /shared and then use df -h to verify.

mkdir /xen_shared
mount /dev/drbd0_vg0/xen_shared /xen_shared
df -h
Filesystem            Size  Used Avail Use% Mounted on
Filesystem            Size  Used Avail Use% Mounted on
/dev/md1               39G  2.8G   34G   8% /
tmpfs                 466M   29M  438M   7% /dev/shm
/dev/md0              243M   70M  161M  31% /boot
xenstore              466M   32K  466M   1% /var/lib/xenstored
/dev/dm-0              20G  259M   20G   2% /xen_shared

You may have noticed that it shows /dev/dm-0 instead of /dev/drbd0_vg0/xen_shared. If you look at the later, you will see that it is simply a symlink to the former.

ls -lah /dev/drbd0_vg0/xen_shared
lrwxrwxrwx. 1 root root 7 Sep  9 13:24 /dev/drbd0_vg0/xen_shared -> ../dm-0

Add An Entry To /etc/fstab

The last step is to add an entry for this new partition to each node's /etc/fstab file.

Reference The GFS2 Partition By Device Path

This is the more traditional method of referencing the GFS2 partition by using it's device path directly.

Warning: An incorrect edit of the /etc/fstab file can leave your system unable to boot! Please review the line generated above to make sure it is accurate and compatible with your setup before proceeding.

vim /etc/fstab
#
# /etc/fstab
# Created by anaconda on Tue Sep  7 06:29:51 2010
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=865690db-85d0-44c5-9f32-ffb6fdf47060 /                       ext3    defaults        1 1
UUID=8b9822b6-a92e-48c9-96b5-f8943142319e /boot                   ext3    defaults        1 2
UUID=94b03547-a7e3-45bb-b2d5-837498b370f4 swap                    swap    defaults        0 0
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
xenfs                   /proc/xen               xenfs   defaults        0 0
/dev/drbd0_vg0/xen_shared /xen_shared           gfs2    rw,suid,dev,exec,nouser,async    0 0

Reference The GFS2 Partition By UUID

It is sometimes preferable to create an fstab entry that locates the device path via it's UUID. To do this, you can run the following command which, though a bit cryptic, will print out an /etc/fstab compatible string.

Warning: The same warnings apply here as above

echo `gfs2_edit -p sb /dev/drbd0_vg0/xen_shared | grep sb_uuid | sed -e "s/.*sb_uuid  *\(.*\)/UUID=\L\1\E \/xen_shared\t\tgfs2\trw,suid,dev,exec,nouser,async\t0 0/"`
UUID=a1487063-2a3f-43b1-3a36-44936b0b4d1e /xen_shared gfs2 rw,suid,dev,exec,nouser,async 0 0

Note: EL5 doesn't recognize the UUID= argument. You must instead use the actual path (ie: /dev/drbd0_vg0/xen_shared /xen_shared gfs2 rw,suid,dev,exec,nouser,async 0 0).

You may have noticed that defaults isn't used. Rather, all but the auto option are manually set. This is because the system will drop to single-user mode at boot if it can't mount an auto partition at boot time (auto being implied by defaults). Given that our GFS2 partition sits on top of DRBD and the cluster, there is no way to make it available that early in the boot process.

Further, the gfs2 init script specifically excludes entries in /etc/fstab that have the 'noauto option set. For this reason, we can't simply specify that as we need the init script to see the partition so that it is mounted when GFS2 starts and unmounted when it stops.

Now add this string to /etc/fstab.

vim /etc/fstab
#
# /etc/fstab
# Created by anaconda on Tue Sep  7 06:29:51 2010
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=865690db-85d0-44c5-9f32-ffb6fdf47060 /                       ext3    defaults        1 1
UUID=8b9822b6-a92e-48c9-96b5-f8943142319e /boot                   ext3    defaults        1 2
UUID=94b03547-a7e3-45bb-b2d5-837498b370f4 swap                    swap    defaults        0 0
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
xenfs                   /proc/xen               xenfs   defaults        0 0
UUID=a1487063-2a3f-43b1-3a36-44936b0b4d1e /xen_shared gfs2 rw,suid,dev,exec,nouser,async 0 0

Please note; At the time of writing this HowTo, there is a bug in findfs and mount. According to RFC 4122, programs should accept a UUID in either upper or lower case. However, this is not currently the case, so you must pass the UUID in lower-case. Please see bugs 632373 and 632385.

Testing The gfs2 Initialization Script

To verify that the new entry is valid, check gfs2's status.

/etc/init.d/gfs2 status
Configured GFS2 mountpoints: 
/xen_shared
Active GFS2 mountpoints: 
/xen_shared

Now test stopping and restarting to ensure that the GFS2 partition unmounts and mounts properly.

Stop gfs2.

/etc/init.d/gfs2 stop
Unmounting GFS2 filesystem (/xen_shared):                   [  OK  ]

Check with df -h to ensure that the mount is gone.

df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md1               39G  2.0G   35G   6% /
tmpfs                 466M   23M  444M   5% /dev/shm
/dev/md0              243M   70M  161M  31% /boot
xenstore              466M   32K  466M   1% /var/lib/xenstored

Start gfs2 again.

/etc/init.d/gfs2 start
Mounting GFS2 filesystem (/xen_shared):                     [  OK  ]

Again, check with df -h that it has been remounted.

df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md1               39G  2.0G   35G   6% /
tmpfs                 466M   29M  438M   7% /dev/shm
/dev/md0              243M   70M  161M  31% /boot
xenstore              466M   32K  466M   1% /var/lib/xenstored
/dev/dm-0              20G  259M   20G   2% /xen_shared

Done!

Further Reading

Altering Daemon Start Order

It is important that the various daemons in use by our cluster start in the right order. Most daemons will rely on services provided by another daemon to be running, and will not start or will not operate reliably otherwise.

We need to make sure that xend starts so that the network is stable. Then cman needs to start so that fencing and dlm are available. Next, drbd starts so that the clustered storage is available. Then clvmd must start so that the data on the DRBD resource is accessible. Now gfs2 needs to start so that the Xen domU configuration files can be found and finally xendomains must start to boot up the actual domU virtual machines.

To restate as a list, the start order must be:

  • xend
  • cman
  • drbd
  • clvmd
  • gfs2
  • xendomains

To make sure the start order is sane then, we'll edit each of the six daemon's init scripts and alter their Required-Start lines. To make the changes take effect, we will use chkconfig to remove and re-add them to the various start levels.

Altering xend

This should already be done. If it isn't, please see "Making xend play nice with clustering" above. If you are revisiting that section, you can skip the cman edit as we will need to make another change in the next step.

Altering cman

This should already be done. If it isn't, please see "Making xend play nice with clustering" above. If you are revisiting that section, you can skip the cman edit as we will need to make another change in the next step.

Altering drbd

Now we will tell drbd to start after cman.

This requires the additional step of altering the chkconfig: - 70 08 line to instead read chkconfig: - 20 08. This isn't strictly needed, but will give more room for chkconfig to order the dependent daemons by allowing DRBD to be started as low as position 20, rather than waiting until position 70. This is somewhat more compatible with cman and clvmd which normally start at positions 21 and 24, respectively

vim /etc/init.d/drbd
#!/bin/bash
#
# chkconfig: - 20 08
# description: Loads and unloads the drbd module
#
# Copright 2001-2008 LINBIT Information Technologies
# Philipp Reisner, Lars Ellenberg
#
### BEGIN INIT INFO
# Provides: drbd
# Required-Start: $local_fs $network $syslog cman
# Required-Stop:  $local_fs $network $syslog
# Should-Start:   sshd multipathd
# Should-Stop:    sshd multipathd
# Default-Start:
# Default-Stop:
# Short-Description:    Control drbd resources.
### END INIT INFO

Altering clvmd

Now we will now tell clvmd to start after drbd.

Note: There is currently a minor bug with lvm2-cluster version 2.02.73-2 in that /etc/init.d/clvmd is set by default to mode 0555. This is easily corrected by running the following command. Please check bug 636066 to see if this has been resolved.

chmod u+w /etc/init.d/clvmd

Once you've got write access, edit the file.

vim /etc/init.d/clvmd
#!/bin/bash
#
# chkconfig: - 24 76
# description: Starts and stops clvmd
#
# For Red-Hat-based distributions such as Fedora, RHEL, CentOS.
#              
### BEGIN INIT INFO
# Provides: clvmd
# Required-Start: $local_fs drbd
# Required-Stop: $local_fs
# Default-Start:
# Default-Stop: 0 1 6
# Short-Description: Clustered LVM Daemon
### END INIT INFO

Altering gfs2

Now we will now tell gfs2 to start after clvmd. You will notice that cman is already listed under Required-Start and Required-Stop. It's true that cman must be started, but we've created a chain here so we can safely replace it with clvmd in the start line.

vim /etc/init.d/gfs2
#!/bin/bash
#
# gfs2 mount/unmount helper
#
# chkconfig: - 26 74
# description: mount/unmount gfs2 filesystems configured in /etc/fstab

### BEGIN INIT INFO
# Provides:             gfs2
# Required-Start:       $network clvmd
# Required-Stop:        $network
# Default-Start:
# Default-Stop:
# Short-Description:    mount/unmount gfs2 filesystems configured in /etc/fstab
# Description:          mount/unmount gfs2 filesystems configured in /etc/fstab
### END INIT INFO

Altering xendomains

Finally, we will alter xendomains so that it starts last, after gfs2.

vim /etc/init.d/xendomains
#!/bin/bash
#
# /etc/init.d/xendomains
# Start / stop domains automatically when domain 0 boots / shuts down.
#
# chkconfig: 345 99 00
# description: Start / stop Xen domains.
#
# This script offers fairly basic functionality.  It should work on Redhat
# but also on LSB-compliant SuSE releases and on Debian with the LSB package
# installed.  (LSB is the Linux Standard Base)
#
# Based on the example in the "Designing High Quality Integrated Linux
# Applications HOWTO" by Avi Alkalay
# <http://www.tldp.org/HOWTO/HighQuality-Apps-HOWTO/>
#
### BEGIN INIT INFO
# Provides:          xendomains
# Required-Start:    $syslog $remote_fs xend gfs2
# Should-Start:
# Required-Stop:     $syslog $remote_fs xend
# Should-Stop:
# Default-Start:     3 4 5
# Default-Stop:      0 1 2 6
# Default-Enabled:   yes
# Short-Description: Start/stop secondary xen domains
# Description:       Start / stop domains automatically when domain 0 
#                    boots / shuts down.
### END INIT INFO

Applying The Changes

Change the start order by removing and re-adding all cluster-related daemons using chkconfig.

chkconfig xenstored off; chkconfig xenconsoled off; chkconfig xend off; chkconfig cman off; chkconfig drbd off; chkconfig clvmd off; chkconfig gfs2 off; chkconfig xendomains off
chkconfig xendomains on; chkconfig gfs2 on; chkconfig clvmd on; chkconfig drbd on; chkconfig cman on; chkconfig xend on; chkconfig xenconsoled on; chkconfig xenstored on

Now verify that the start order is as we want it.

ls -lah /etc/rc3.d/
lrwxrwxrwx.  1 root root   19 Sep 20 13:37 S26xenstored -> ../init.d/xenstored
lrwxrwxrwx.  1 root root   21 Sep 20 13:37 S27xenconsoled -> ../init.d/xenconsoled
lrwxrwxrwx.  1 root root   14 Sep 20 13:37 S28xend -> ../init.d/xend
lrwxrwxrwx.  1 root root   14 Sep 20 13:37 S29cman -> ../init.d/cman
lrwxrwxrwx.  1 root root   14 Sep 20 13:37 S70drbd -> ../init.d/drbd
lrwxrwxrwx.  1 root root   15 Sep 20 13:37 S71clvmd -> ../init.d/clvmd
lrwxrwxrwx.  1 root root   14 Sep 20 13:37 S72gfs2 -> ../init.d/gfs2
lrwxrwxrwx.  1 root root   20 Sep 20 13:37 S99xendomains -> ../init.d/xendomains

Setting Up Xen

WARNING: Everything below here is pretty seriously screwed up.

Note: This is not meant to be an extensive tutorial on Xen itself. It covers enough to get domU VMs provisioned in a manner that will take advantage of the cluster. As such, there is minimal explanation of configuration file options. If you need further help, please drop by the ##xen (yes, two ##) IRC channel on freenode.org.

Install The Hypervisor Tools

These tools are very useful in provisioning and managing domU VMs.

yum -y install virt-install virt-viewer

Install The HVM/KVM Tools

For hvm (Hardware Virtual Machines), which is required for paravirtualized Microsoft VMs, you must install the following packages as well.

yum -y install qemu-kvm.x86_64 qemu-kvm-tools.x86_64

Ensure That Virtualization Is Enabled

Many motherboards disable hvm by default in their BIOS. Assuming that you've got a dom0 kernel running at this stage, you can check if this is the case by checking the xm info output.

xm info
host                   : an-node04.alteeve.com
release                : 2.6.32.23-170.dom0_an1.fc14.x86_64
version                : #1 SMP Sun Oct 10 20:39:19 EDT 2010
machine                : x86_64
nr_cpus                : 4
nr_nodes               : 1
cores_per_socket       : 4
threads_per_core       : 1
cpu_mhz                : 2209
hw_caps                : 178bf3ff:efd3fbff:00000000:00001310:00802001:00000000:000037ff:00000000
virt_caps              : hvm
total_memory           : 4063
free_memory            : 2987
node_to_cpu            : node0:0-3
node_to_memory         : node0:2987
node_to_dma32_mem      : node0:2928
max_node_id            : 0
xen_major              : 4
xen_minor              : 0
xen_extra              : .1
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable
xen_commandline        : dom0_mem=1024M
cc_compiler            : gcc version 4.4.4 20100630 (Red Hat 4.4.4-10) (GCC) 
cc_compile_by          : root
cc_compile_domain      : 
cc_compile_date        : Mon Oct 11 01:10:38 EDT 2010
xend_config_format     : 4

Look at the virt_caps and xen_caps lines. Notice the hvm entries? This shows that HVM, also known as "secure virtualization", has been enabled. If you do not see this, please check your mainboard manual for information on enabling this on your system.

Note: The next paragraph applies only when running a vanilla kernel.

If you are running a vanilla kernel, you can check to see if your CPU has support for HVM guests but checking /proc/cpuinfo. What you're looking for depends on your CPU manufacturer. If you have an Intel CPU, you need to look for the vmx flag. Likewise, with AMD CPUs, you need to look for the svm flag.

For a more complete, if somewhat dated paper on this topic, please Fedora 6 Xen Quickstart Guide, System Requirements.

Enabling Migration

By default, xend will not allow domU VMs from being migrated onto or off of a given dom0 host. Given that we've got a cluster though, we very much want this behaviour, so now we will enable it. This is done by making edits to /etc/xen/xend-config.sxp. Below is a concise list of options that must be set. Some exist already in the file and need to be commented out or altered.

Warning: The values below are very permissive. Please review each option and improve the security to fit your network before going into production!

vim /etc/xen/xend-config.sxp
(xend-http-server yes)
(xend-unix-server yes)
(xend-tcp-xmlrpc-server yes)
(xend-relocation-server yes)
(xend-udev-event-server yes)
(xend-port            8000)
(xend-relocation-port 8002)
(xend-address '')
(xend-relocation-address '')
(xend-relocation-hosts-allow '')

Once done, restart xend. It is usually safest to stop the cluster before hand to avoid accidental fencing caused by the underlying network being reconfigured.

/etc/init.d/gfs2 stop
/etc/init.d/clvmd stop
/etc/init.d/cman stop
/etc/init.d/xend restart
/etc/init.d/cman start
/etc/init.d/clvmd start
/etc/init.d/gfs2 start

Virtual Machine Naming Convention

Note: This section acts as a recommendation only. Feel free to alter this to fit your style and needs.

Personally, I like to name my VMs similar to c5_shorewall_01. To elaborate, I like to use the format:

  • os_role_seq (<Operating System ID>_<Role of the VM>_<Sequence Integer>).

There are no (known) restrictions on virtual machine names, so feel free to use names that made sense for you. I do strongly recommend that you match the name of your domU VM to the name of it's host LVM logical volume.

Provisioning domU VMs

There are two ways to provision new VMs that we will cover (there are many others); Using virt-install and using xm create -f /path/to/domain.cfg where domain.cfg is a hand-crafted python script.

Provisioning with virt-install

This uses a long command line argument that provisions the VM and loads it into libvirt. Where possible, this is probably the best way to provision a domU. However, there are occasions where it may not work.

Here is an example where a domU is provisioned for a Fedora 14, x86_64 VM with a dedicated logical volume on the clustered LVM. The command to create the LV precedes the command to provision the VM. Please review and adjust values as you need. Consult man virt-install for a more complete list of available options and their uses.

Following the provision command are two examples of how to backup the configuration to a flat file. The first directly dumps the configuration into the libvirt format. The second backs up the configuration to an XML file and then shows and example of how that file can be converted into a standard python script.

# Fedora 14 x86_64 RPM builder VM
lvcreate -L 40G -n f14_builder_01 drbd0_vg0
virt-install --connect xen \
             --name f14_builder_01 \
             --ram 1024 \
             --arch x86_64 \
             --vcpus 1 \
             --cpuset 1 \
             --location http://192.168.1.10/f14/x86_64/img/ \
             --os-type linux \
             --os-variant fedora14 \
             --disk path=/dev/drbd0_vg0/f14_builder_01 \
             --network bridge=eth0,mac=00:16:3e:00:10:01 \
             --vnc \
             --paravirt
# Backup the domU configuration to the libvirt native format.
xm list -l f14_builder_01 > /xen_shared/dom_config/f14_builder.cfg
# Backup the config to an XML file.
virsh dumpxml f14_builder_01 > /xen_shared/domU_config/f14_builder_01.xml
# Convert it to a "traditional" python script. Be sure the edit the resulting .cfg file to remove the 'vifname=' sections.
virsh -c xen:/// domxml-to-native xen-xm f14_builder_01.xml > f14_builder_01.cfg

The next two examples show the provisioning of a CentOS v5.5 and RHEL v6.0, beta 2 machines. The dump and backup methods above should be easily adapted to work with these VMs.

# A CentOS test server
lvcreate -L 40G -n c5_test_01 drbd0_vg0
virt-install --connect xen \
             --name c5_test_01 \
             --ram 1024 \
             --arch x86_64 \
             --vcpus 1 \
             --cpuset 1-3 \
             --location http://192.168.1.10/c5/x86_64/img/ \
             --os-type linux \
             --os-variant rhel5.4 \
             --disk path=/dev/drbd0_vg0/c5_test_01 \
             --network bridge=eth0,mac=00:16:3e:00:10:02 \
             --vnc \
             --paravirt
# Red Hat Enterprise Linux 6 beta 2 test server
lvcreate -L 40G -n rh6b2_test_01 drbd0_vg0
virt-install --connect xen \
             --name rh6b2_test_01 \
             --ram 1024 \
             --arch x86_64 \
             --vcpus 1 \
             --cpuset 1 \
             --location http://192.168.1.10/rhel6/x86_64/img/ \
             --os-type linux \
             --os-variant rhel6 \
             --disk path=/dev/drbd0_vg0/rh6b2_test_01 \
             --network bridge=eth0,mac=00:16:3e:00:10:03 \
             --vnc \
             --paravirt

Provisioning With 'xm create'

At the time of writing this, I could not sort out the magical incantation for provisioning a Windows VM using virt-install. Instead, I used the "old" style of crafting a configuration file using a python script. This is useful to know as many templates exist on the web for various VMs. Following the steps below, you should be able to fairly easily adapt them.

In this example, we will provision a VM using HVM from an ISO image of the Windows 2008 Server installation DBD. In this case, there is a problem with virt-install finding the qemu-dm file.

This configuration file will be saved in the /xen_shared/domU_config directory, which exists on the shared GFS2 partition we created earlier.

mkdir /xen_shared/domU_config
vim /xen_shared/domU_config/win2008_sql_01.cfg
# This is the Windows 2008 Enterprise Server x86_64 hosting MS SQL Server 2008 Enterprise
kernel = "/usr/lib/xen/boot/hvmloader"
builder='hvm'
memory = 1024

# Should be at least 2KB per MB of domain memory, plus a few MB per vcpu.
shadow_memory = 8
name = "win2008_sql_01"
#vif = [ 'type=ioemu, bridge=xenbr0' ]
vif = [ 'type=ioemu, bridge=eth0,mac=00:16:3e:00:30:03' ]
acpi = 1
apic = 1
# Remove the 'file:...' entry (or change it to another ISO) after the install is complete.
disk = [ 'phy:/dev/drbd0_vg0/win2008_sql_01,hda,w', 'file:/xen_shared/iso/MS-Win2008-Ent-x86_64-SP2.iso,hdc:cdrom,r' ]

device_model = '/usr/lib/xen/bin/qemu-dm'

#-----------------------------------------------------------------------------
# boot on floppy (a), hard disk (c) or CD-ROM (d) 
# default: hard disk, cd-rom, floppy
boot="dc"
sdl=0
vnc=1
vncconsole=1
vncpasswd=''

serial='pty'
usbdevice='tablet'

Now provision the VM using xm create.

xm create -f /xen_shared/domU_config/win2008_sql_01.cfg

At this point, the VM is not loaded into libvirt. This is a problem as, on boot, the node will not know that the VM exists. As a consequence, some tools like virt-manager will not see the VM until it is manually started with xm create -f /xen_shared/domU_config/domain.cfg. Further, and perhaps more troubling, changes made to the VM's config made outside to config file will be lost when you restart from the config file.

To fix this, we'll use a few hacks to load the config into libvirt. This needs to be done while the domU is loaded and it must be done on the node currently hosting the VM.

First, make sure that the domU's configuration is visible. Then, dump the config and filter it through grep and sed to pull out the UUID. Once you know that you get the proper UUID, create the directory under /var/lib/xend/domains/ and then create the config.sxp file with the domU's configuration.

NOTE: Be sure to change win2008_01 in the examples below to match the name of the domU that you want to setup.

xm list -l win2008_01
xm list -l win2008_01 | grep '^    (uuid' | sed -e "s/    (uuid \(.*\))/\1/"
mkdir /var/lib/xend/domains/`xm list -l win2008_01 | grep '^    (uuid' | sed -e "s/    (uuid \(.*\))/\1/"`
cd /var/lib/xend/domains/`xm list -l win2008_01 | grep '^    (uuid' | sed -e "s/    (uuid \(.*\))/\1/"`
xm list -l win2008_01 > config.sxp

There are several uuid strings in the output from xm list -l domain. Thankfully, the one we want is the only one indented by four spaces. This is how we can be fairly confident that the UUID returned by sed above is, in fact,

Making VMs Highly Available

Now this is the point, isn't it?

In this final step, we're going to move the startup of drbd, clvmd, gfs2 and xendomains out of init.d and move them into our cluster. The reason for this is that the cluster is much wiser about how to handle clustered services. We do not want a node that is not in the cluster, that is, a node without quorum, to try and connect to shared resources. The cluster manager handles this.

Note: This how-to is trying to keep things simple, so we will use rgmanager which is built in to cman. There is a compelling argument to use Pacemaker instead. That is a bit beyond the scope of the How-To though.

Disable All Cluster Software From Starting At Boot

Note: This will be moved to another section or possibly removed entirely before the final release.

There have been problems on booting both nodes at the same time causing DRBD split brain. For this reason, I am currently advising disabling drbd, clvmd and gfs2 in addition to the other services to be disabled in the next step. The reason for this is that, by manually starting these services, you can catch failures and correct them before they cascade to other daemons. By default, if the drbd daemon were to fail to come up, the system would move on and try to start clvmd and the rest, which would obviously fail given the lack of DRBD. I am working on merging the startup of these services into the cluster.conf file, but that will be a little while yet.

Install rgmanager

The cluster tool rgmanager will provide the cluster-related service management and restart VMs lost on a failed node. We will install it now.

yum -y install rgmanager

Removing Clustered Services From initd

As stated, we will now remove the cluster-related services from starting with the os.

yum -y install rgmanager
chkconfig xendomains off; chkconfig gfs2 off; chkconfig clvmd off; chkconfig drbd off; chkconfig rgmanager off

Manual Startup

When manually starting the cluster, please do it in the following order, ensuring that each service did indeed start before moving on.

Starting drbd

Start DRBD on both nodes at close to the same time. Once started, check /proc/drbd to ensure that the nodes are syncing or connected.

/etc/init.d/drbd start
cat /proc/drbd

If the DRBD array is in StandAlone and fails to connect, then you likely have a split brain condition. To recover, you need to identify which node you trust has the most recent view of the data. For this article, lets assume that an-node01 is the node we trust and an-node02 is the node we will discard.

Warning: When you invalidate a node's DRBD array, and changes made to it since the split brain occurred will be lost. For this reason, be very careful about how you proceed. If you are at all in doubt, backup the DRBD device of the node to be invalidated before proceeding. Assuming that the DRBD backing device is /dev/md3, then to back it up you could use dd (disk duplicate) to copy the contents to a destination of equal or greater size of the /dev/md3 device. If you do not have enough space, you may be able to pipe the output through a compression program like gzip or bzip2.

To backup the device to be invalidated, if uncertain that it is safe to overwrite, run the following.

dd if=/dev/md3 of=/path/to/drbd_backup.img

Once you are ready to proceed, and remembering that we have decided that an-node01 is the most up to date and that r0 is the name of the split-brain resource, run the following sets of commands on the appropriate machines.

Both

/etc/init.d/drbd stop
modprobe drbd
drbdadm attach r0

an-node02

drbdadm invalidate r0

Both

drbdadm connect r0
cat /proc/drbd

Ensure that both nodes are connected and that the array has begun syncing. Note that at this stage, both nodes will be Secondary, an-node01 will be UpToDate and an-node02 will be Inconsistent. If this is the case, then you are safe to proceed. If not, resolve the issue before going any further.

Both

drbdadm primary r0

Note: There is no need to run /etc/init.d/drbd start now as we've replicated what it does to get to this stage.

Warning: While DRBD is synchronizing, the Inconsistent node will shut down it's DRBD array if the UpToDate node shuts down or is fenced!

Starting clvmd

Warning: In rarer cases, I've seen clvmd spinlock (kernel lock that kill -9 won't stop). In these cases, I've found that there is no recourse shy of forcing down the node. Generally, trying to reboot will often cause one of the two nodes to get fenced. For this reason, if clvmd appears to hang and ps aux | grep clvmd shows clvmd -T30 which can not be stopped with kill -9, then force power down the node via a fence call or by pressing and holding the node's power button until the power is off.

/etc/init.d/clvmd start

Editing vm.sh

ToDo

SSH Setup

You need to make sure that each node's SSH public key is copied into it's own authorized_keys file. Then you should be able to migrate your VMs using virsh.

ToFinish

=== Testing Live Migration Using

# Assuming that the VM 'f14_builder_01' is running on this node and 'an-node02' is the destination node.
virsh migrate --live f14_builder_01 xen+ssh:/// xenmigr://an-node02

http://libvirt.org/remote.html

Syncing domU Configuration

To make a VM available on both nodes at all times, we will use virsh on the original host to first export the XML configuration to a file on our GFS partition, and then we will use it on the second node to define that VM. This process must be repeated whenever the configuration of the domU changes.

In this example, we will copy the configuration for the domU called f14_builder_01 running on an-node01 so that it becomes available on an-node02.

an-node01

virsh dumpxml f14_builder_01 > /xen_shared/domU_config/f14_builder_01.xml
cat /xen_shared/domU_config/f14_builder_01.xml
<domain type='xen'>
  <name>f14_builder_01</name>
  <uuid>ad211e8f-a685-79fe-b217-cb94bea6d2bd</uuid>
  <memory>1048576</memory>
  <currentMemory>1048576</currentMemory>
  <vcpu cpuset='1'>1</vcpu>
  <bootloader>/usr/bin/pygrub</bootloader>
  <os>
    <type>linux</type>
  </os>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/lib/xen/bin/qemu-dm</emulator>
    <disk type='block' device='disk'>
      <driver name='phy'/>
      <source dev='/dev/drbd0_vg0/f14_builder_01'/>
      <target dev='xvda' bus='xen'/>
    </disk>
    <interface type='bridge'>
      <mac address='00:16:3e:00:10:01'/>
      <source bridge='eth0'/>
      <script path='/etc/xen/scripts/vif-bridge'/>
      <target dev='vif-1.0'/>
    </interface>
    <console type='pty'>
      <target port='0'/>
    </console>
    <input type='mouse' bus='xen'/>
    <graphics type='vnc' port='-1' autoport='yes' keymap='en-us'/>
  </devices>
</domain>

an-node02

virsh define /xen_shared/domU_config/f14_builder_01.xml
Domain f14_builder_01 defined from /xen_shared/domU_config/f14_builder_01.xml
xm list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  1024     4     r-----    943.4
f14_builder_01                                  1024     1                 0.0
win2008_01                                      1024     1                66.6

Adding The <rm> Section to cluster.conf

ToDo

Pushing The Updated cluster.conf To The Other Node

ToDo

Testing rgmanager

Testing rgmanager involves stopping the VM, freezing it, using rg_test to start the VM, check it's status and stopping the VM. If this works, we'll then thaw the resource, manually start the domU and then do a test kill of the node hosting the VM to see if the second node will, in fact, start the lost VM.

Start the cluster, if it's not already running. Make sure that the VM is working by manually starting and stopping it before proceeding.

For this test, we will use the f14_builder_01 domU VM running on an-node01. Freeze the resource using clusvcadm. This only needs to be run on one node to freeze the service (VM) on both nodes.

clusvcadm -Z vm:f14_builder_01
Local machine freezing vm:f14_builder_01...Success

Check that it is in fact frozen with clustat. Note that at this stage the domU VM is stopped.

clustat
Cluster Status for an-cluster03 @ Wed Oct 20 23:48:14 2010
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 vm:f14_builder_01              (none)                         stopped    [Z]

Now that we know it's seen by rgmanager, is stopped and is froze ([Z]), we can proceed with the test start.

an-node01

rg_test test /etc/cluster/cluster.conf start vm f14_builder_01
Running in test mode.
Loading resource rule from /usr/share/cluster/oracledb.sh
Loading resource rule from /usr/share/cluster/lvm.sh
Loading resource rule from /usr/share/cluster/lvm_by_lv.sh
Loading resource rule from /usr/share/cluster/mysql.sh
Loading resource rule from /usr/share/cluster/service.sh
Loading resource rule from /usr/share/cluster/vm.sh
Loading resource rule from /usr/share/cluster/apache.sh
Loading resource rule from /usr/share/cluster/nfsexport.sh
Loading resource rule from /usr/share/cluster/drbd.sh
Loading resource rule from /usr/share/cluster/fs.sh
Loading resource rule from /usr/share/cluster/script.sh
Loading resource rule from /usr/share/cluster/SAPInstance
Loading resource rule from /usr/share/cluster/tomcat-6.sh
Loading resource rule from /usr/share/cluster/postgres-8.sh
Loading resource rule from /usr/share/cluster/nfsclient.sh
Loading resource rule from /usr/share/cluster/samba.sh
Loading resource rule from /usr/share/cluster/ip.sh
Loading resource rule from /usr/share/cluster/nfsserver.sh
Loading resource rule from /usr/share/cluster/ASEHAagent.sh
Loading resource rule from /usr/share/cluster/netfs.sh
Loading resource rule from /usr/share/cluster/tomcat-5.sh
Loading resource rule from /usr/share/cluster/lvm_by_vg.sh
Loading resource rule from /usr/share/cluster/SAPDatabase
Loading resource rule from /usr/share/cluster/clusterfs.sh
Loading resource rule from /usr/share/cluster/openldap.sh
Loading resource rule from /usr/share/cluster/ocf-shellfuncs
Loading resource rule from /usr/share/cluster/named.sh
Loading resource rule from /usr/share/cluster/svclib_nfslock
Loading resource rule from /usr/share/cluster/smb.sh
Starting f14_builder_01...
Hypervisor: xen
Management tool: virsh
Hypervisor URI: xen+ssh:///
Migration URI format: xenmigr://target_host/
Virtual machine f14_builder_01 is shut off
<debug>  virsh -c xen+ssh:/// start f14_builder_01
[vm] virsh -c xen+ssh:/// start f14_builder_01
Domain f14_builder_01 started

Start of f14_builder_01 complete

Now confirm that it really did start.

an-node01

xm list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  1024     4     r-----   1218.4
f14_builder_01                               1  1024     1     ------      9.7
win2008_01                                      1024     1                 0.0

Wonderful!

Note that if you now run clustat on either node, the VM will not show as running. This is because it is frozen and this is expected.

Now check the status.

rg_test test /etc/cluster/cluster.conf status vm f14_builder_01
Running in test mode.
Loading resource rule from /usr/share/cluster/oracledb.sh
Loading resource rule from /usr/share/cluster/lvm.sh
Loading resource rule from /usr/share/cluster/lvm_by_lv.sh
Loading resource rule from /usr/share/cluster/mysql.sh
Loading resource rule from /usr/share/cluster/service.sh
Loading resource rule from /usr/share/cluster/vm.sh
Loading resource rule from /usr/share/cluster/apache.sh
Loading resource rule from /usr/share/cluster/nfsexport.sh
Loading resource rule from /usr/share/cluster/drbd.sh
Loading resource rule from /usr/share/cluster/fs.sh
Loading resource rule from /usr/share/cluster/script.sh
Loading resource rule from /usr/share/cluster/SAPInstance
Loading resource rule from /usr/share/cluster/tomcat-6.sh
Loading resource rule from /usr/share/cluster/postgres-8.sh
Loading resource rule from /usr/share/cluster/nfsclient.sh
Loading resource rule from /usr/share/cluster/samba.sh
Loading resource rule from /usr/share/cluster/ip.sh
Loading resource rule from /usr/share/cluster/nfsserver.sh
Loading resource rule from /usr/share/cluster/ASEHAagent.sh
Loading resource rule from /usr/share/cluster/netfs.sh
Loading resource rule from /usr/share/cluster/tomcat-5.sh
Loading resource rule from /usr/share/cluster/lvm_by_vg.sh
Loading resource rule from /usr/share/cluster/SAPDatabase
Loading resource rule from /usr/share/cluster/clusterfs.sh
Loading resource rule from /usr/share/cluster/openldap.sh
Loading resource rule from /usr/share/cluster/ocf-shellfuncs
Loading resource rule from /usr/share/cluster/named.sh
Loading resource rule from /usr/share/cluster/svclib_nfslock
Loading resource rule from /usr/share/cluster/smb.sh
Checking status of f14_builder_01...
Hypervisor: xen
Management tool: virsh
Hypervisor URI: xen+ssh:///
Migration URI format: xenmigr://target_host/
Virtual machine f14_builder_01 is idle
Status of f14_builder_01 is good

The last line is exactly what we want. Finally, stop the domU VM. If you're watching the VM itself over VNC, you should see it do graceful shutdown in the next step.

rg_test test /etc/cluster/cluster.conf stop vm f14_builder_01
Running in test mode.
Loading resource rule from /usr/share/cluster/oracledb.sh
Loading resource rule from /usr/share/cluster/lvm.sh
Loading resource rule from /usr/share/cluster/lvm_by_lv.sh
Loading resource rule from /usr/share/cluster/mysql.sh
Loading resource rule from /usr/share/cluster/service.sh
Loading resource rule from /usr/share/cluster/vm.sh
Loading resource rule from /usr/share/cluster/apache.sh
Loading resource rule from /usr/share/cluster/nfsexport.sh
Loading resource rule from /usr/share/cluster/drbd.sh
Loading resource rule from /usr/share/cluster/fs.sh
Loading resource rule from /usr/share/cluster/script.sh
Loading resource rule from /usr/share/cluster/SAPInstance
Loading resource rule from /usr/share/cluster/tomcat-6.sh
Loading resource rule from /usr/share/cluster/postgres-8.sh
Loading resource rule from /usr/share/cluster/nfsclient.sh
Loading resource rule from /usr/share/cluster/samba.sh
Loading resource rule from /usr/share/cluster/ip.sh
Loading resource rule from /usr/share/cluster/nfsserver.sh
Loading resource rule from /usr/share/cluster/ASEHAagent.sh
Loading resource rule from /usr/share/cluster/netfs.sh
Loading resource rule from /usr/share/cluster/tomcat-5.sh
Loading resource rule from /usr/share/cluster/lvm_by_vg.sh
Loading resource rule from /usr/share/cluster/SAPDatabase
Loading resource rule from /usr/share/cluster/clusterfs.sh
Loading resource rule from /usr/share/cluster/openldap.sh
Loading resource rule from /usr/share/cluster/ocf-shellfuncs
Loading resource rule from /usr/share/cluster/named.sh
Loading resource rule from /usr/share/cluster/svclib_nfslock
Loading resource rule from /usr/share/cluster/smb.sh
Stopping f14_builder_01...
Hypervisor: xen
Management tool: virsh
Hypervisor URI: xen+ssh:///
Migration URI format: xenmigr://target_host/
<debug>  Virtual machine f14_builder_01 is idle
[vm] Virtual machine f14_builder_01 is idle
virsh shutdown f14_builder_01 ...
Domain f14_builder_01 is being shutdown

Stop of f14_builder_01 complete

If the domU VM shut down, then this stage of the testing completed successfully!

So now, the last step is the thaw and then check the service is, indeed, thawed.

clusvcadm -U vm:f14_builder_01
Local machine unfreezing vm:f14_builder_01...Success
clustat
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 vm:f14_builder_01              (none)                         stopped

Note that the [Z] is gone now.

Starting The domU VM Using clusvcadm

In order for rgmanager to know that a service is running, in this case our VM, we must start the service using clusvcadm.

clusvcadm -e vm:f14_builder_01
Local machine trying to enable vm:f14_builder_01...Success
vm:f14_builder_01 is now running on an-node04.alteeve.com

After a few moments, you should be able to see the domU VM listed as started on both nodes.

clustat
Cluster Status for an-cluster03 @ Thu Oct 21 00:34:10 2010
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 vm:f14_builder_01              an-node04.alteeve.com          started

With the VM running, we are now ready to do our destructive test. If this works, f13_builder_01 should start on an-node01's death. To do this, we will killall -9 corosync on an-node01 while watching an-node02. Personally, I like to have terminals opened running:

  • clear; tail -f -n 0 /var/log/messages
  • watch clustat
  • watch xm list
  • watch cat /proc/drbd
  • watch cman_tool status

If you have limited screen real-estate, watch at least /var/log/message and clustat as they will be the most informative in this test.

an-node01

This next command will kill corosync. Within a second or two, an-node02 should declare an-node01 dead and then fence it. Once the fence succeeds, a new cluster configuration will form and rgmanager should start the VM on the surviving node.

killall -9 corosync


Thanks

  • A huge thanks to Interlink Connectivity! They hire me as a contractor and have allowed me to extend these docs while working on their clusters. Development of these How-Tos would be much slower if not for them. If you need hosting or colo services, drop them a line. Their website is a bit out of date though, so disregard the prices. :)
  • To sdake of corosync for helping me sort out the plock component and corosync in general.
  • To Angus Salkeld for helping me nail down the Corosync and OpenAIS differences.
  • To HJ Lee from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
  • To Steven Dake for clarifying the to_x vs. logoutput: x arguments in openais.conf.
  • To Fabio Massimo Di Nitto for helping me get caught up with clustering and VMs.
  • To Lon Hohberger, lon at fedoraproject.org, for the rgmanager help.

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.