Alteeve Wiki :: How To :: Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM

Warning: This document is old, abandoned and very out of date. DON'T USE ANYTHING HERE! Consider it only as historical note taking.

This HowTo will walk you through setting up Xen VMs using DRBD and CLVM for high availability.

Prerequisite

This talk is an extension of the Two Node Fedora 13 Cluster HowTo. As such, you will be expected to have a freshly built two-node cluster with spare disk space on either node.

Please do not proceed until you have completed the first tutorial.

Overview

This tutorial will cover several topics; DRBD, CLVM, GFS2, Xen dom0 and domU VMs and rgmanager. Their relationship is thus:

DRBD provides a mechanism to replicate data across both nodes in real time and guarantees a consistent view of that data from either node. Think of it like RAID level 1, but across machines.
CLVM sits on the DRBD partition and provides the underlying mechanism for allowing both nodes to access shared data in a clustered environment. It will host a shared filesystem by way of GFS2 as well as LVs that Xen's domU VMs will use as their disk space.
GFS2 will be the clustered file system used on one of the DBRD-backed, CLVM-managed partitions. Files that need to be shared between nodes, like the Xen VM configuration files, will exist on this partition.
Xen will be the hypervisor in use that will manage the various virtual machines. Each virtual machine will exist in an LVM LV.
- Xen's dom0 is the special "host" virtual machine. In this case, dom0 will be the OS installed in the first HowTo.
- Xen's domU virtual machines will be the "floating", highly available servers.
Lastly, rgmanager will be the component of cman that will be configured to manage the automatic migration of the virtual machines when failures occur and when nodes recover.

Setting Up Xen

It may seem odd to start with Xen at this stage, but it is going to rather fundamentally alter each node's "host" operating system.

At this point, each node's host OS is a traditional operating system operating on the bare metal. When we install a dom0 kernel though, we tell Xen to boot a mini operating system first, and then to boot our "host" operating system. In effect, this converts the host node's operating system into just another virtual machine, albeit with a special view of the underlying hardware and Xen hypervisor.

This conversion is somewhat disruptive, so I like to get it out of the way right away. We will then do the rest of the setup before returning to Xen later on to create the floating virtual machines.

A Note On The State Of Xen dom0 Support In Fedora

As of Fedora 8, support for Xen dom0 has been removed, but support for the hypervisor and domU virtual machines remains. Red Hat's position is that KVM will be the supported platform going forward. That said, this page seems to indicate that PV Ops dom0 kernels will be supported in the future. Specifically, when dom0 support is merged into the mainline Linux kernel. When this will be is open to speculation, though "by Fedora 16" seems to be a reasonable educated guess.

What this means for us is that we need to use a non-standard dom0 kernel. Specifically, we will use a kernel created by myoung (Micheal Young) for Fedora 12. This kernel does not directly support DRBD, so be aware that we will need to build new DRBD kernel modules for his kernel and then rebuild the DRBD modules each time his kernel is updated.

A Note On Rolling Your Own RPMs

If you want to roll the source RPMs for both the hypervisor and the dom0 kernel, you will need to make both before you can boot into your dom0 for the first time. This is because the dom0 kernel needs the Xen microkernel provided by the xen hypervisor package to boot.

Install The Hypervisor

We will use Xen 4.0.1, as provided by Pasik. We'll use the source RPM to build our own RPM.

Regardless of whether you install from source RPMs or the pre-compiled ones, you will need to install the libvirt, qemu, SDL and PyXML packages from the standard repositories before you can proceed.

yum -y install libvirt PyXML.x86_64 qemu.x86_64 SDL.x86_64

Please don't remove existing xen utilities and libraries prior to installing the newer version. If you do, core clustering components may be removed. Instead, be sure to use the rpm -Uvh switches to upgrade the existing packages that may be installed already.

Installing Prebuilt RPMs

These are locally stored copies of the Xen RPMs built for x86_64 on Fedora 13. This requires some dependent packages be installed first.

Warning: These are all recompiled for this website against Fedora 13, x86_64. If you feel more comfortable, please use RPMs from a source you are familiar with.

cd ~
wget -c https://alteeve.com/files/an-cluster/xen-4.0.1-0.2.fc13.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/xen-doc-4.0.1-0.2.fc13.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/xen-hypervisor-4.0.1-0.2.fc13.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/xen-libs-4.0.1-0.2.fc13.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/xen-runtime-4.0.1-0.2.fc13.x86_64.rpm
rpm -Uvh xen*4.0.1-0.2.fc13.x86_64.rpm

Optionally, you may wish to install debug and/or devel RPMs as well.

wget -c https://alteeve.com/files/xen-debuginfo-4.0.1-0.2.fc13.x86_64.rpm
wget -c https://alteeve.com/files/xen-devel-4.0.1-0.2.fc13.x86_64.rpm
rpm -Uvh xen-d*4.0.1-0.2.fc13.x86_64.rpm

Building RPMs From Source

To build the RPMs from the source, you need to make sure that you have the build environment installed.

Note: If you are following these instruction after having installed a prior dom0 kernel, you will need to comment out exclude=kernel* in /etc/yum.conf so that the kernel-header package can be installed.

vim /etc/yum.conf

# PUT YOUR REPOS HERE OR IN separate files named file.repo
# in /etc/yum.repos.d
#exclude=kernel*

Now install the development environment.

yum -y groupinstall "Development Libraries"
yum -y groupinstall "Development Tools"
yum install transfig texi2html SDL-devel libX11-devel tetex-latex gtk2-devel libaio-devel dev86 iasl xz-devel e2fsprogs-devel glibc-devel.i686 xmlto asciidoc elfutils-libelf-devel

Note: If you had to uncomment the exclude=kernel* line earlier, comment it back out now before you forget.

vim /etc/yum.conf

# PUT YOUR REPOS HERE OR IN separate files named file.repo
# in /etc/yum.repos.d
exclude=kernel*

Now we can now build the RPMs from source.

cd ~
wget -c https://alteeve.com/files/an-cluster/xen-4.0.1-0.2.fc13.src.rpm
rpm -ivh xen-4.0.1-0.2.fc13.src.rpm
cd /root/rpmbuild/SPECS/
rpmbuild -ba xen.spec

Once this is done, you will have seven RPMs built. Two of them are not really needed (xen-debuginfo-4.0.1-0.2.fc13.x86_64.rpm and xen-devel-4.0.1-0.2.fc13.x86_64.rpm) and I will leave it up to you may wish to install them or not. If you don't, modify the next command or move the debug and devel RPMs out of the way first.

cd ~/rpmbuild/RPMS/x86_64/
rpm -Uvh xen*4.0.1-0.2.fc13.x86_64.rpm

Installing The AN!Cluster dom0 Kernel

Notice!

The kernel provided here was recompiled on Fedora 13 and is a slightly modified version of Micheal Young's kernel available below. I was originally driven to recompile in an effort to solve a DRBD-related kernel oops. For now, unless you have the same DRBD kernel oops, I'd strongly recommend against using the AN!Cluster dom0 kernel until it has been tested much more thoroughly.

With that warning out of the way...

This kernel was compiled on a Fedora 13, x86_64. The DRBD RPMs available a little later where compiled against this dom0 kernel.

Note: The --force is required because the current kernel is newer than 2.6.32 used here. Without this switch, the RPM would not install.

cd ~
wget -c https://alteeve.com/files/an-cluster/kernel-2.6.32.23-170.dom0_an1.fc13.x86_64.rpm
rpm -ivh --force kernel-2.6.32.23-170.dom0_an1.fc13.x86_64.rpm

If you would like to install the debug, devel and/or header RPMs for this kernel, they are available below.

Note: The kernel-debuginfo-2.6.32.23-170.dom0_an1.fc13.x86_64.rpm RPM is 213 MiB.

wget -c https://alteeve.com/files/an-cluster/kernel-2.6.32.23-170.dom0_an1.fc13.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/kernel-debuginfo-2.6.32.23-170.dom0_an1.fc13.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/kernel-devel-2.6.32.23-170.dom0_an1.fc13.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/kernel-headers-2.6.32.23-170.dom0_an1.fc13.x86_64.rpm
rpm -ivh --force kernel-de*-2.6.32.23-170.dom0_an1.fc13.x86_64.rpm kernel-headers-2.6.32.23-170.dom0_an1.fc13.x86_64.rpm

Post AN!Cluster dom0 Install Configuration

The entry in grub's /boot/grub/menu.lst won't work. You will need to edit it so that it calls the existing installed operating system as a module.

Note: Copy and modify the entry created by the RPM. Simply copying this entry will almost certainly not work! Your root= is likely different and your rd_MD_UUID= will definitely be different, even on the same machine across installs. Generally speaking, what follows the kernel /vmlinuz-2.6.32.23-170.dom0_an1.fc13.x86_64 ... entry made by the dom0 kernel can be copied after the module /vmlinuz-2.6.32.23-170.dom0_an1.fc13.x86_64 ... entry in the example below.

vim /boot/grub/menu.lst

title Xen 4.0.x, Linux 2.6.32.23-170.dom0_an1.fc13.x86_64
        root   (hd0,0)
        kernel /xen.gz dom0_mem=1024M
        module /vmlinuz-2.6.32.23-170.dom0_an1.fc13.x86_64 ...
        module /initramfs-2.6.32.23-170.dom0_an1.fc13.x86_64.img

Installing Micheal Young's dom0 Kernel

This uses a kernel built for Fedora 12, but it works on Fedora 13. This step involves either installing it over HTML or adding and enabling his repository and then installing it from there.

Installing Via myoung's Repository

This is almost always the preferred method. However, do note that when myoung updates his kernel, there will be a lag where the dom0 dependent RPMs provided here will no longer be compatible.

To add the repository, download the myoung.dom0.repo into the /etc/yum.repos.d/ directory.

cd /etc/yum.repos.d/
wget -c http://myoung.fedorapeople.org/dom0/myoung.dom0.repo

To enable his repository, edit the repository file and change the two enabled=0 entries to enabled=1.

vim /etc/yum.repos.d/myoung.dom0.repo

[myoung-dom0]
name=myoung's repository of Fedora based dom0 kernels - $basearch
baseurl=http://fedorapeople.org/~myoung/dom0/$basearch/
enabled=1
gpgcheck=0

[myoung-dom0-source]
name=myoung's repository of Fedora based dom0 kernels - Source
baseurl=http://fedorapeople.org/~myoung/dom0/src/
enabled=1
gpgcheck=0

Install the Xen dom0 kernel (edit the version number with the updated version if it has changed).

yum install kernel-2.6.32.21-170.xendom0.fc12.x86_64

Post Michael Young's dom0 Install Configuration

The entry in grub's /boot/grub/menu.lst won't work. You will need to edit it so that it calls the existing installed operating system as a module.

Note: Copy and modify the entry created by the RPM. Simply copying this entry will almost certainly not work! Your root= is likely different and your rd_MD_UUID= will definitely be different, even on the same machine across installs. Generally speaking, what follows the kernel /vmlinuz-2.6.32.21-170.xendom0.fc12.x86_64 ... entry made by the dom0 kernel can be copied after the module /vmlinuz-2.6.32.21-170.xendom0.fc12.x86_64 ... entry in the example below.

vim /boot/grub/menu.lst

title Xen 4.0.x, Linux kernel 2.6.32.21-170.xendom0.fc12.x86_64
	root   (hd0,0)
	kernel /xen.gz dom0_mem=1024M
	module /vmlinuz-2.6.32.21-170.xendom0.fc12.x86_64 ...
	module /initramfs-2.6.32.21-170.xendom0.fc12.x86_64.img

Disabling Automatic Kernel Updates

Seeing as we're using an older kernel, yum will want to replace it whenever there is an updated kernel* package available. Likewise if myoung updates his kernel. In the latter case, the updated kernel from Mr. Young would break compatibility with our DRBD module. So to be safe, we want to tell yum to never update the kernel.

To do this, we need to add exclude=kernel* to the /etc/yum.conf file.

echo "exclude=kernel*" >> /etc/yum.conf
cat /etc/yum.conf

[main]
cachedir=/var/cache/yum/$basearch/$releasever
keepcache=0
debuglevel=2
logfile=/var/log/yum.log
exactarch=1
obsoletes=1
gpgcheck=1
plugins=1
installonly_limit=3
color=never

#  This is the default, if you make this bigger yum won't see if the metadata
# is newer on the remote and so you'll "gain" the bandwidth of not having to
# download the new metadata and "pay" for it by yum not having correct
# information.
#  It is esp. important, to have correct metadata, for distributions like
# Fedora which don't keep old packages around. If you don't like this checking
# interupting your command line usage, it's much better to have something
# manually check the metadata once an hour (yum-updatesd will do this).
# metadata_expire=90m

# PUT YOUR REPOS HERE OR IN separate files named file.repo
# in /etc/yum.repos.d

exclude=kernel*

Make xend play nice with clustering

By default under Fedora 13, cman will start before xend. This is a problem because xend takes the network down as part of it's setup. This causes totem communication to fail which leads to fencing.

Note: Move xenconsoled and xenstore to 09 and 10 start positions and then make xend depend on the before starting.

To avoid this, edit the initialization scripts for /etc/init.d/xend and it's dependents xenconsoled and xenstore to have a lower minimum start position. We need to maintain the start order of xenstore first, xenconsoled second and lastly xend. By default, their minimum start positions are 96, 97 and 98 respectively. We will change these to 10, 11 and 12, again, respectively.

Note that we are not altering the start position of xendomains! This is intentional as this daemon will start the domU VMs. This can not happen until all other cluster related daemons have started.

To change the start order we will change the line chkconfig: 2345 9x 01 lines to chkconfig: 2345 1x 01, where x is the given daemon's start number. Further, we'll make sure that xenstored begins first by add it to xenconsoled's Required-Start line. We'll then make sure that xenconsoled starts before xend by adding it to xend's Required-Start line.

To recap the changes;

xenstored will start first.
- We'll change it's start position from 96 to 10.
- We will not add anything to it's Required-Start as it must be the first daemon to come up.
xenconsoled will start second.
- We'll change it's start position from 97 to 11.
- We will add xenstored to it's Required-Start line.
xend will start third.
- We'll change it's start position from 98 to 12.
- We will add xenconsoled to it's Required-Start line.

When done, the three initialization scripts should look like the examples below.

vim /etc/init.d/xenstored

#!/bin/bash
#
# xenstored     Script to start and stop the Xen control daemon.
#
# Author:       Daniel Berrange <berrange@redhat.com
#
# chkconfig: 2345 10 01
# description: Starts and stops the Xen xenstored daemon.
### BEGIN INIT INFO
# Provides:          xenstored
# Required-Start:    $syslog $remote_fs
# Should-Start:
# Required-Stop:     $syslog $remote_fs
# Should-Stop:
# Default-Start:     3 4 5
# Default-Stop:      0 1 2 6
# Default-Enabled:   yes
# Short-Description: Start/stop xenstored
# Description:       Starts and stops the Xen xenstored daemon.
### END INIT INFO

vim /etc/init.d/xenconsoled

#!/bin/bash
#
# xenconsoled   Script to start and stop the Xen xenconsoled daemon
#
# Author:       Daniel P. Berrange <berrange@redhat.com>
#
# chkconfig: 2345 11 01
# description: Starts and stops the Xen control daemon.
### BEGIN INIT INFO
# Provides:          xenconsoled
# Required-Start:    $syslog $remote_fs xenstored
# Should-Start:
# Required-Stop:     $syslog $remote_fs
# Should-Stop:
# Default-Start:     3 4 5
# Default-Stop:      0 1 2 6
# Default-Enabled:   yes
# Short-Description: Start/stop xenconsoled
# Description:       Starts and stops the Xen xenconsoled daemon.
### END INIT INFO

vim /etc/init.d/xend

#!/bin/bash
#
# xend          Script to start and stop the Xen control daemon.
#
# Author:       Keir Fraser <keir.fraser@cl.cam.ac.uk>
#
# chkconfig: 2345 12 98
# description: Starts and stops the Xen control daemon.
### BEGIN INIT INFO
# Provides:          xend
# Required-Start:    $syslog $remote_fs xenconsoled
# Should-Start:
# Required-Stop:     $syslog $remote_fs
# Should-Stop:
# Default-Start:     3 4 5
# Default-Stop:      0 1 2 6
# Default-Enabled:   yes
# Short-Description: Start/stop xend
# Description:       Starts and stops the Xen control daemon.
### END INIT INFO

With xend set to start at a position lower than 98, we now have room for chkconfig to put other daemons after it in the start order, which will be needed a little later. First and foremost, we now need to tell cman to not start until after xend is up.

As above, we will now edit cman's /etc/init.d/cman script. This time though, we will not edit it's chkconfig line. Instead, we will simply add xend to the Required-Start line.

vim /etc/init.d/cman

#!/bin/bash
#
# cman - Cluster Manager init script
#
# chkconfig: - 21 79
# description: Starts and stops cman
#
#
### BEGIN INIT INFO
# Provides:             cman
# Required-Start:       $network $time xend
# Required-Stop:        $network $time
# Default-Start:
# Default-Stop:
# Short-Description:    Starts and stops cman
# Description:          Starts and stops the Cluster Manager set of daemons
### END INIT INFO

Finally, remove and re-add the xend and cman daemons to re-order them in the start list:

chkconfig xenstored off; chkconfig xenconsoled off; chkconfig xend off; chkconfig cman off; 
chkconfig xenstored on; chkconfig xenconsoled on; chkconfig xend on; chkconfig cman on

Confirm that the order has changed so that xend is earlier in the boot sequence than cman. Assuming you've switched to run-level 3, run:

ls -lah /etc/rc3.d/

Your start sequence should now look like:

lrwxrwxrwx.  1 root root   19 Sep 15 19:29 S26xenstored -> ../init.d/xenstored
lrwxrwxrwx.  1 root root   21 Sep 15 19:29 S27xenconsoled -> ../init.d/xenconsoled
lrwxrwxrwx.  1 root root   14 Sep 15 19:29 S28xend -> ../init.d/xend
lrwxrwxrwx.  1 root root   14 Sep 15 19:29 S29cman -> ../init.d/cman

Booting Into The New dom0

If everything went well, you should be able to boot the new dom0 operating system. If you watch the boot process closely, you will see that the boot process is different. You should now see the Xen hypervisor boot prior to handing off to the "host" operating system. This can be confirmed once the dom0 operating system has booted by checking that the file /proc/xen/capabilities exists. What it contains doesn't matter at this stage, only that it exists at all.

Configure Networking

Networking in Xen, particularly in a cluster, can be confusing. If you are not familiar with networking in Xen, please review to following article before proceeding.

A note of a major change from previous layouts. In Xen 3.x, ethX would be copied to a virtual interface called vethX. Then the real ethX would be renamed to pethX and the virtual interface vethX would be renamed to ethX to take it's place. Finally, a bridge called xenbrX would be created and the real pethX and virtual ethX would be connected to it.

This has been changed somewhat it that now, by default, ethX is left alone and a simple bridge called virbrX would be created. We'll be changing this to be somewhat similar to the old style.

Specifically, the real ethX will be renamed to pethX. Then a bridge will be created called ethX, which plays the role of dom0's interface and bridges connections from VMs through pethX and out into the real world.

This is explained in more detail, and with diagrams, in the article below.

Networking in Xen

Adding New NICs to Xen

By default, xend manages eth0 only. We need to add eth2. Personally, I don't like to put the storage network ethernet devices (eth1) under Xen's control as this potentially can cause DRBD problems on xend restart. Whether you add it or not I will leave to your preferences.

You can see which, if any, network devices are under Xen's control by running ifconfig and checking to see if there is a virbrX corresponding to a given ethX device.

ifconfig

eth0      Link encap:Ethernet  HWaddr 48:5B:39:3C:53:14  
          inet addr:192.168.1.74  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::4a5b:39ff:fe3c:5314/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:224261 errors:0 dropped:0 overruns:0 frame:0
          TX packets:55174 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:319384110 (304.5 MiB)  TX bytes:27348739 (26.0 MiB)
          Interrupt:225 Base address:0x8000 

eth1      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:5A  
          inet addr:192.168.2.74  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:9b5a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2 errors:0 dropped:0 overruns:0 frame:0
          TX packets:36 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:832 (832.0 b)  TX bytes:6234 (6.0 KiB)
          Memory:feae0000-feb00000 

eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:96:EA  
          inet addr:192.168.3.74  Bcast:192.168.3.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:96ea/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2 errors:0 dropped:0 overruns:0 frame:0
          TX packets:34 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:818 (818.0 b)  TX bytes:6081 (5.9 KiB)
          Memory:fe9e0000-fea00000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

virbr0    Link encap:Ethernet  HWaddr 02:23:C8:98:31:17  
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:4013 (3.9 KiB)

In the above example, eth0 has a corresponding virbr0 bridge having it's own subnet. In non-clustered systems, this is fine. For our purposes though, it will not do.

Removing The qemu virbr0 Bridge

By default, QEMU creates a bridge called virbr0 designed to connect virtual machines to the first eth0 interface. Our system will not need this, so we will remove it. This bridge is configured in the /etc/libvirt/qemu/networks/default.xml file, so to remove this bridge, simply delete the contents of the file.

cat /dev/null >/etc/libvirt/qemu/networks/default.xml

The next time you reboot, that bridge will be gone.

Madi: Put in the command to delete the bridge before a reboot.

Create /etc/xen/scripts/an-network-script

This script will be used by Xen to turn the dom0 ethX interfaces into bridges. All traffic to the bridge, be it from dom0 or domU VMs, will be routeable out of the corresponding pethX device. As domU VMs come online, a hotplug script will create virtual interfaces between this new bridge and the domU's interface(s). Think of the vifX.Y devices as being the network cables you'd normally run between a server and a switch.

Before we proceed, please note three things;

You don't need to use the file name an-network-script. I suggest this name mainly to keep in line with the rest of the 'AN!x' naming used here.
If you install convirt or other hypervisor tools, they will likely create their own bridge script.
Adding eth1 is optional, as we know ahead of time that eth1 will not be made available to any virtual machines as it is dedicated to DRBD. I'm adding it here because I like having things consistent; Do whichever makes more sense to you.

First, touch the file and then chmod it to be executable.

touch /etc/xen/scripts/an-network-script
chmod 755 /etc/xen/scripts/an-network-script

Now edit it to contain the following:

vim /etc/xen/scripts/an-network-script

#!/bin/sh
dir=$(dirname "$0")
"$dir/network-bridge" "$@" vifnum=0 netdev=eth0 bridge=eth0
"$dir/network-bridge" "$@" vifnum=2 netdev=eth2 bridge=eth2

Now tell Xen to execute that script by editing /etc/xen/xend-config.sxp file and changing the network-script argument to point to this new script (this is line 158 in the default xend-config.sxp script):

vim /etc/xen/xend-config.sxp

#(network-script network-bridge)
#(network-script /bin/true)
(network-script an-network-script)

Warning: The next step may trigger fencing of the nodes! As such, be sure that you're not running anything critical. If unsure, please stop cman or reboot the nodes.

/etc/init.d/cman stop

Now restart xend.

/etc/init.d/xend restart

If everything worked, you should now be able to run ifconfig and see that all the ethX devices have matching pethX, virtual and bridge devices.

ifconfig

eth0      Link encap:Ethernet  HWaddr 48:5B:39:3C:53:14  
          inet addr:192.168.1.74  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::4a5b:39ff:fe3c:5314/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:78 errors:0 dropped:0 overruns:0 frame:0
          TX packets:57 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:9796 (9.5 KiB)  TX bytes:12574 (12.2 KiB)

eth1      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:5A  
          inet addr:192.168.2.74  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:9b5a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2 errors:0 dropped:0 overruns:0 frame:0
          TX packets:36 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:832 (832.0 b)  TX bytes:6234 (6.0 KiB)
          Memory:feae0000-feb00000 

eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:96:EA  
          inet addr:192.168.3.74  Bcast:192.168.3.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe72:96ea/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:32 errors:0 dropped:0 overruns:0 frame:0
          TX packets:29 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:5471 (5.3 KiB)  TX bytes:5867 (5.7 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

peth0     Link encap:Ethernet  HWaddr 48:5B:39:3C:53:14  
          inet6 addr: fe80::4a5b:39ff:fe3c:5314/64 Scope:Link
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:224486 errors:0 dropped:0 overruns:0 frame:0
          TX packets:55349 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:319406626 (304.6 MiB)  TX bytes:27384681 (26.1 MiB)
          Interrupt:225 Base address:0x8000 

peth2     Link encap:Ethernet  HWaddr 00:1B:21:72:96:EA  
          inet6 addr: fe80::21b:21ff:fe72:96ea/64 Scope:Link
          UP BROADCAST RUNNING PROMISC MULTICAST  MTU:1500  Metric:1
          RX packets:35 errors:0 dropped:0 overruns:0 frame:0
          TX packets:70 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:6827 (6.6 KiB)  TX bytes:12470 (12.1 KiB)
          Memory:fe9e0000-fea00000

Note: The virbr0 may remain until you reboot your nodes.

If you see something like this, then you are ready to proceed! Now start your cluster back up.

/etc/init.d/cman start

We're done for now. There is more to do in Xen, but this was all we needed to do in order to proceed with the next several steps. Onces we have the clustered storage online, we'll come back to Xen for the domU setup.

Building the DRBD Array

Building the DRBD array requires a few steps. First, raw space on either node must be prepared. Next, DRBD must be told that it is to create a resource using this newly configured raw space. Finally, the new array must be initialized.

A Map of the Cluster's Storage

The layout of the storage in the cluster can quickly become difficult to follow. Below is an ASCII drawing which should help you see how DRBD will tie in to the rest of the cluster's storage. This map assumes a simple RAID level 1 array underlying each node. If your node has a single hard drive, simply collapse the first two layers into one. Similarly, if your underlying storage is a more complex RAID array, simply expand the number of physical devices at the top level.

               Node1                                Node2
           _____   _____                        _____   _____
          | sda | | sdb |                      | sda | | sdb |
          |_____| |_____|                      |_____| |_____|
             |_______|                            |_______|
     _______ ____|___ _______             _______ ____|___ _______
  __|__   __|__    __|__   __|__       __|__   __|__    __|__   __|__
 | md0 | | md1 |  | md2 | | md3 |     | md3 | | md2 |  | md1 | | md0 |
 |_____| |_____|  |_____| |_____|     |_____| |_____|  |_____| |_____|
    |       |        |       |           |       |        |       |
 ___|___   _|_   ____|____   |___________|   ____|____   _|_   ___|___
| /boot | | / | | <swap>  |        |        | <swap>  | | / | | /boot |
|_______| |___| |_________|  ______|______  |_________| |___| |_______|
                            | /dev/drbd0  |
                            |_____________|
                                   |
                               ____|______
                              | clvm PV   |
                              |___________|
                                   |
                              _____|_____
                             | drbd0_vg0 |
                             |___________|
                                   |
                              _____|_____ ___...____
                             |           |          |
                          ___|___     ___|___    ___|___
                         | lv_X  |   | lv_Y  |  | lv_N  |
                         |_______|   |_______|  |_______|

Install The DRBD Tools

DRBD has two components; The actual application and tools and the kernel module.

There are two options for installing the DRBD user-land tools at this point; AN!Cluster-built RPMs or using the ones shipped with Fedora. Regardless of which method you choose, you will need to either install the AN!Cluster DRBD kernel module RPMs or else rebuild the source RPMs referenced.

Install The AN!Cluster DRBD User-Land Tools

I am currently experimenting with ways to solve a DRBD triggered kernel oops in the Xen pvops 2.6.32 kernel. For this reason, I've recompiled the following user-land RPMs under the AN!Cluster variant dom0 kernel RPMs referenced earlier in this paper. If you used the AN! RPMs, then I suggest giving these RPMs a try. However, if you are using myoung's dom0, I recommend sticking to the Fedora-provided user-land DRBD tools.

yum -y install bash-completion heartbeat pacemaker
cd ~
wget -c https://alteeve.com/files/an-cluster/drbd-8.3.7-2.fc13.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/drbd-utils-8.3.7-2.fc13.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/drbd-xen-8.3.7-2.fc13.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/drbd-bash-completion-8.3.7-2.fc13.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/drbd-udev-8.3.7-2.fc13.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/drbd-heartbeat-8.3.7-2.fc13.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/drbd-pacemaker-8.3.7-2.fc13.x86_64.rpm
rpm -ivh drbd*8.3.7-2.fc13.x86_64.rpm

Install The Stock Fedora DRBD User-Land Tools

You will need to install the following tools:

yum install drbd.x86_64 drbd-xen.x86_64 drbd-utils.x86_64

Disable heartbeat

These packages require that the heartbeat packages be installed. This is for a different cluster platform which we are not using here, so we will disable it from starting with the system.

chkconfig heartbeat off

Install The DRBD Kernel Module

The kernel module must match the dom0 kernel that is running. If you update the kernel and neglect to update the DRBD kernel module, the DRBD array will not start.

To help simplify things, links to pre-compiled DRBD kernel modules are provided. If the kernel version you have installed doesn't match your kernel, instructions on recompiling the DRBD kernel module from source RPM is provided as well.

Install Pre-Compiled DRBD Kernel Module RPMs

Warning: The RPM provided here will only work with the kernel-2.6.32.21-168.xendom0_an1.fc13.x86_64.rpm kernel. If you are using Michael Young's dom0 kernel, please skip to the next section.

This RPM provides the DRBD kernel module. Note that these RPMs are compiled against the AN!Cluster variant of myoung's 2.6.32.21_168 dom0 kernel.

cd ~
wget -c https://alteeve.com/files/an-cluster/drbd-km-2.6.32.23_170.dom0_an1.fc13.x86_64-8.3.7-12.fc13.x86_64.rpm
rpm -ivh drbd-km-2.6.32.23_170.dom0_an1.fc13.x86_64-8.3.7-12.fc13.x86_64.rpm

If you would like to install the debuginfo

cd ~
wget -c https://alteeve.com/files/an-cluster/drbd-km-debuginfo-8.3.7-12.fc13.x86_64.rpm
rpm -ivh drbd-km-debuginfo-8.3.7-12.fc13.x86_64.rpm

Building DRBD Kernel Module RPMs From Source

If the above RPMs don't work or if the dom0 kernel you are using in any way differs, please follow the steps here to create a DRBD kernel module matched to your running dom0.

First, install the build environment.

yum -y groupinstall "Development Libraries"
yum -y groupinstall "Development Tools"

Install the kernel headers and development library for the dom0 kernel:

Note: The following commands use --force to get past the fact that the headers for the 2.6.33 are already installed, thus making RPM think that these are too old and will conflict. Please proceed with caution.

If you are using Michael Young's kernel:

cd ~
wget -c http://fedorapeople.org/~myoung/dom0/x86_64/kernel-headers-2.6.32.21-170.xendom0.fc12.x86_64.rpm
wget -c http://fedorapeople.org/~myoung/dom0/x86_64/kernel-devel-2.6.32.21-170.xendom0.fc12.x86_64.rpm
rpm -ivh --force kernel*2.6.32.21-170.xendom0.fc12.x86_64.rpm

If you are using the AN!Cluster dom0 kernel:

wget -c https://alteeve.com/files/an-cluster/kernel-devel-2.6.32.23-170.dom0_an1.fc13.x86_64.rpm
wget -c https://alteeve.com/files/an-cluster/kernel-headers-2.6.32.23-170.dom0_an1.fc13.x86_64.rpm
rpm -ivh --force kernel-devel-2.6.32.23-170.dom0_an1.fc13.x86_64.rpm kernel-headers-2.6.32.23-170.dom0_an1.fc13.x86_64.rpm

Now you need to download, prepare, build and install the source RPM.

rpm -ivh http://fedora.mirror.iweb.ca/releases/13/Everything/source/SRPMS/drbd-8.3.7-2.fc13.src.rpm
cd /root/rpmbuild/SPECS/
rpmbuild -bp drbd.spec 
cd /root/rpmbuild/BUILD/drbd-8.3.7/
./configure --enable-spec --with-km
cp /root/rpmbuild/BUILD/drbd-8.3.7/drbd-km.spec /root/rpmbuild/SPECS/
cd /root/rpmbuild/SPECS/
rpmbuild -ba drbd-km.spec
cd /root/rpmbuild/RPMS/x86_64
rpm -Uvh drbd-km-*

This may be needed if drbd-utils, the user-land DRBD tools, is listed as a requirement when trying to install the drbd-km* RPMs. This step will build all of the DRBD tools RPMs.

yum install bash-completion heartbeat pacemaker
cd /root/rpmbuild/SPECS/
rpmbuild -ba drbd.spec 
cd ~/rpmbuild/RPMS/x86_64/
rpm -Uvh drbd-*
chkconfig off heartbeat

You should be good to go now!

Allocating Raw Space For DRBD On Each Node

If you followed the setup steps provided for in "Two Node Fedora 13 Cluster", you will have a set amount of unconfigured hard drive space. This is what we will use for the DRBD space on either node. If you've got a different setup, you will need to allocate some raw space before proceeding.

Create a Simple Partition

If you do not have two drives, please follow the next section's steps, but pay attention to the "note"s. In short, you will need to create one partition, leave the default type of the partition as 83, write the changes to disk and the proceed to the DRBD Configuration Files section.

Creating a RAID level 1 'md' Device

This assumes that you have two raw drives, /dev/sda and /dev/sdb. It further assumes that you've created three partitions which have been assigned to three existing /dev/mdX devices. With these assumptions, we will create /dev/sda4 and /dev/sdb4 and, using them, create a new /dev/md3 device that will host the DRBD partition.

If you have multiple drives and plan to use a different RAID levels, please adjust the follow commands accordingly.

Creating The New Partitions

Warning: The next steps will have you directly accessing your server's hard drive configuration. Please do not proceed on a live server until you've had a chance to work through these steps on a test server. One mistake can blow away all your data.

Start the fdisk shell for the first hard drive; /dev/sda.

fdisk /dev/sda

WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
         switch off the mode (command 'c') and change display units to
         sectors (command 'u').

Command (m for help):

Note: Depending on your configuration, you may not see the above warning or you may see a different warning. Note it, but it is likely nothing to worry about it.

View the current configuration with the print option

Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c6fe1

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1        5100    40960000   fd  Linux raid autodetect
/dev/sda2            5100        5622     4194304   fd  Linux raid autodetect
/dev/sda3   *        5622        5654      256000   fd  Linux raid autodetect

Command (m for help):

Now we know for sure that the next free partition number is 4. We will now create the new partition.

Command action
   e   extended
   p   primary partition (1-4)

We will make it a primary partition

Selected partition 4
First cylinder (5654-60801, default 5654):

Then we simply hit <enter> to select the default starting block.

<enter>

Using default value 5654
Last cylinder, +cylinders or +size{K,M,G} (5654-60801, default 60801):

Once again we will press <enter> to select the default ending block.

<enter>

Using default value 60801

Command (m for help):

Note: If you only have one drive and are not creating a RAID array, you do not to change the type of the partition so you can skip the next few steps. Continue at the step where you write the changes.

Now we need to change the type of partition that it is.

Partition number (1-4):

We know that we are modifying partition number 4.

Hex code (type L to list codes):

Now we need to set the hex code for the partition type to set. We want to set fd, which defines Linux raid autodetect.

fd

Changed system type of partition 4 to fd (Linux raid autodetect)

Now check that everything went as expected by once again printing the partition table.

Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c6fe1

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1        5100    40960000   fd  Linux raid autodetect
/dev/sda2            5100        5622     4194304   fd  Linux raid autodetect
/dev/sda3   *        5622        5654      256000   fd  Linux raid autodetect
/dev/sda4            5654       60801   442972704+  fd  Linux raid autodetect

Command (m for help):

Note: If you only have one drive, your partitions will be 83 Linux or 82 Linux swap / Solaris, instead of fd Linux raid autodetect.

There it is. So finally, we need to write the changes to the disk.

The partition table has been altered!

Calling ioctl() to re-read partition table.

WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
Syncing disks.

Note: If you only have one drive, reboot now if you got the message above and then skip forward to the "DRBD Configuration Files" section.

If you see the above message, do not reboot yet. repeat these steps for the second drive, /dev/sdb, and then reboot.

Creating The New /dev/mdX Device

If you only have one drive, skip this step.

Now we need to use mdadm to create the new RAID level 1 device. This will be used as the device that DRBD will directly access.

mdadm --create /dev/md3 --homehost=localhost.localdomain --raid-devices=2 --level=1 /dev/sda4 /dev/sdb4

mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90

Seeing as /boot doesn't exist on this device, we can safely ignore this warning.

mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/md4 started.

You can now cat /proc/mdstat to verify that it indeed built. If you're interested, you could open a new terminal window and use watch cat /proc/mdstat and watch the array build.

cat /proc/mdstat

md3 : active raid1 sdb4[1] sda4[0]
      442971544 blocks super 1.2 [2/2] [UU]
      [>....................]  resync =  0.8% (3678976/442971544) finish=111.0min speed=65920K/sec
      
md2 : active raid1 sda2[0] sdb2[1]
      4193272 blocks super 1.1 [2/2] [UU]
      
md1 : active raid1 sda1[0] sdb1[1]
      40958908 blocks super 1.1 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md0 : active raid1 sda3[0] sdb3[1]
      255988 blocks super 1.0 [2/2] [UU]
      
unused devices: <none>

Finally, we need to make sure that the new array will start when the system boots. To do this, we'll again use mdadm, but with different options that will have it output data in a format suitable for the /etc/mdadm.conf file. We'll redirect this output to that config file, thus updating it.

mdadm --detail --scan | grep md3 >> /etc/mdadm.conf
cat /etc/mdadm.conf

# mdadm.conf written out by anaconda
MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=b58df6d0:d925e7bb:c156168d:47c01718
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=ac2cf39c:77cd0314:fedb8407:9b945bb5
ARRAY /dev/md2 level=raid1 num-devices=2 UUID=4e513936:4a966f4e:0dd8402e:6403d10d
ARRAY /dev/md3 metadata=1.2 name=localhost.localdomain:3 UUID=f0b6d0c1:490d47e7:91c7e63a:f8dacc21

You'll note that the last line, which we just added, is different from the previous lines. This isn't a concern, but you are welcome to re-write it to match the existing format if you wish.

Before you proceed, it is strongly advised that you reboot each node and then verify that the new array did in fact start with the system. You do not need to wait for the sync to finish before rebooting. It will pick up where you left off once rebooted.

Note: You'll notice we did not format a file system on this raid array, this is intentional. DRBD use the raw device and does not need a file system on it.

DRBD Configuration Files

DRBD uses a global configuration file, /etc/drbd.d/global_common.conf, and one or more resource files. The resource files need to be created in the /etc/drbd.d/ directory and must have the suffix .res. For this example, we will create a single resource called r0 which we will configure in /etc/drbd.d/r0.res.

/etc/drbd.d/global_common.conf

The stock /etc/drbd.d/global_common.conf is sane, so we won't bother altering it here.

Full details on all the drbd.conf configuration file directives and arguments can be found here. Note: That link doesn't show this new configuration format. Please see Novell's link.

/etc/drbd.d/r0.res

This is the important part. This defines the resource to use, and must reflect the IP addresses and storage devices that DRBD will use for this resource.

vim /etc/drbd.d/r0.res

# This is the name of the resource and it's settings. Generally, 'r0' is used
# as the name of the first resource. This is by convention only, though.
resource r0
{
        # This tells DRBD where to make the new resource available at on each
        # node. This is, again, by convention only.
        device    /dev/drbd0;

        # The main argument here tells DRBD that we will have proper locking 
        # and fencing, and as such, to allow both nodes to set the resource to
        # 'primary' simultaneously.
        net
        {
                allow-two-primaries;
        }

        # This tells DRBD to automatically set both nodes to 'primary' when the
        # nodes start.
        startup
        {
                become-primary-on both;
        }

        # This tells DRBD to look for and store it's meta-data on the resource
        # itself.
        meta-disk       internal;

        # The name below must match the output from `uname -n` on each node.
        on an-node01.alteeve.com
        {
                # This must be the IP address of the interface on the storage 
                # network (an-node01.sn, in this case).
                address         192.168.2.71:7789;

                # This is the underlying partition to use for this resource on 
                # this node.
                disk            /dev/md3;
        }

        # Repeat as above, but for the other node.
        on an-node02.alteeve.com
        {
                address         192.168.2.72:7789;
                disk            /dev/md3;
        }
}

This file must be copied to BOTH nodes and must match before you proceed.

Starting The DRBD Resource

From the rest of this section, pay attention to whether you see

Node1
Node2
Both

These indicate which node to run the following commands on. There is no functional difference between either node, so just randomly choose one to be Node1 and the other will be Node2. Once you've chosen which is which, be consistent with which node you run the commands on. Of course, if a command block is proceeded by Both, run the following code block on both nodes.

Loading the 'drbd' Module

Both

Normally, we'd load the drbd module by simply starting the /etc/init.d/drbd daemon. However, if we did that at this stage, we'd generate errors because there isn't an UpToDate disk in the array. To get around this, we'll manually load the drbd kernel module using modprobe.

modprobe drbd

This won't return any output, but if you check, you should now see the special /proc/drbd file.

Monitoring Progress

Both

I find it very useful to monitor DRBD while running the rest of the setup. To do this, open a second terminal on each node and use watch to keep an eye on /proc/drbd. This way you will be able to monitor the progress of the array in near-real time.

Both

watch cat /proc/drbd

At this stage, it should look like this:

version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@xenmaster002.iplink.net, 2010-09-07 16:02:46
 0: cs:Unconfigured

Initialize The Resource

Both

This step creates the DRBD meta-data on the new DRBD resource's backing devices. It is only needed when creating new DRBD partitions.

drbdadm create-md r0

  --==  Thank you for participating in the global usage survey  ==--
The server's response is:

you are the 9507th user to install this version
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success

The /proc/drbd output should not have changed at this stage.

Starting the Resource

Both

This will attach the backing device, /dev/md3 in our case, and then start the new resource r0.

drbdadm up r0

There will be no output at the command line. If you are watching /proc/drbd though, you should now see something like this:

version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@xenmaster002.iplink.net, 2010-09-07 16:02:46
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:442957988

That it is Secondary/Secondary and Inconsistent/Inconsistent is expected.

Setting the First Primary Node

Node1

As this is a totally new resource, DRBD doesn't know which side of the array is "more valid" than the other. In reality, neither is as there was no existing data of note on either node. This means that we now need to choose a node and tell DRBD to treat it as the "source" node. This step will also tell DRBD to make the "source" node primary. Once set, DRBD will begin sync'ing in the background.

drbdadm -- --overwrite-data-of-peer primary r0

As before, there will be no output at the command line, but /proc/drbd will change to show the following:

GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@xenmaster002.iplink.net, 2010-09-07 16:02:46
 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
    ns:69024 nr:0 dw:0 dr:69232 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:442888964
        [>....................] sync'ed:  0.1% (432508/432576)M
        finish: 307:33:42 speed: 320 (320) K/sec

If you're watching the secondary node, the /proc/drbd will show ro:Secondary/Primary ds:Inconsistent/UpToDate. This is, as you can guess, simply a reflection of it being the "over-written" node.

Setting the Second Node to Primary

Node2

The last step to complete the array is to tell the second node to also become primary.

drbdadm primary r0

As with many drbdadm commands, nothing will be printed to the console. If you're watching the /proc/drbd though, you should see something like Primary/Primary ds:UpToDate/Inconsistent. The Inconsistent flag will remain until the sync is complete.

A Note On sync Speed

You will notice in the previous step that the sync speed seems awfully slow at 320 (320) K/sec.

In most DRBD applications, this is fine. As actual data is written to either side of the array, that data will be immediately copied to both nodes. As such, both nodes will always contain up to date copies of the real data. Given this, the syncer is intentionally set low so as to not put too much load on the underlying disks that could cause slow downs.

In clustered VM environments though, this is a problem. The reason is that, until the sync completes, the node whose DRBD resource is Inconsistent can not be used for redundancy. If the node that is UpToDate fails, DRBD will stop on the Inconsistent node. As a result, any VMs running on the DRBD will lose access to their storage and thus fail. Similarly, VMs lost on the other node will not be able to restart on the surviving node.

For this reason, we will push the sync speed up to about two-thirds of the disk's maximum write speed. For example; If you node can write at the rate of 60 MiB/sec, you will want to sync at about 40 MiB/sec. We don't want to set it too high so as to not risk applications timing out that access the drives outside of the DRBD partition itself.

drbdsetup /dev/drbd0 syncer -r 40M

The speed-up will not be instant. It will take a little while for the speed to pick up. Once the sync is finished, it is a good idea to revert to the default sync rate.

drbdadm syncer r0

You can proceed with configuration, but pause at the stage where you provision VMs if the sync has not completed.

Setting Up CLVM

The goal of DRBD in the cluster is to provide clustered LVM, referred to as CLVM to the nodes. This is done by turning the DRBD partition into an CLVM physical volume.

So now we will create a PV on top of the new DRBD partition, /dev/drbd0, that we created in the previous step. Since this new LVM PV will exist on top of the shared DRBD partition, whatever get written to it's logical volumes will be immediately available on either node, regardless of which node actually initiated the write.

This capability is the underlying reason for creating this cluster; Neither machine is truly needed so if one machine dies, anything on top of the DRBD partition will still be available. When the failed machine returns, the surviving node will have a list of what blocks changed while the other node was gone and can use this list to quickly re-sync the other server.

Making LVM Cluster-Aware

Normally, LVM is run on a single server. This means that at any time, the LVM can write data to the underlying drive and not need to worry if any other device might change anything. In clusters, this isn't the case. The other node could try to write to the shared storage, so then nodes need to enable "locking" to prevent the two nodes from trying to work on the same bit of data at the same time.

The process of enabling this locking is known as making LVM "cluster-aware".

LVM has tool called lvmconf that can be used to enable LVM locking. This is provided as part of the lvm2-cluster package.

yum -y install lvm2-cluster.x86_64

Now to enable cluster awareness in LVM, run to following command.

lvmconf --enable-cluster

By default, clvmd, the cluster lvm daemon, is stopped and not set to run on boot. Now that we've enabled LVM locking, we need to start it:

/etc/init.d/clvmd status

clvmd is stopped
active volumes: (none)

As expected, it is stopped, so lets start it:

Note: At this point cman is still set to not start a boot. Since we rebooted after creating the partitions that make up /dev/md3, cman will likely, and in my case was still off. clvmd will fail to start because the cluster manager (cman) is not started. --SRSullivan 17:40, 18 October 2010 (UTC)

/etc/init.d/clvmd start

Activating VGs:   No volume groups found
                                                           [  OK  ]

Note: I've seen on a few occasions where starting clvmd will time out and, on occasion, fences will be issued. I've not sorted out why, but I have usually been able to resolve this by stopping clvmd and cman, then restarting cman and, finally, restarting clvmd. If I can sort out a way to reliably trigger this problem, I will submit a bug report.

Filtering Out Devices

ToDo: Find a less-aggressive filter.

With the stock /etc/lvm/lvm.conf configuration, all devices on the system will be checked for LVM volumes. This can cause a problem as LVM will give preference to the LVM data on the RAID device over the DRBD device. It sees a duplicate as both are, effectively, one and the same.

To work around this, we need to alter the filter = [] entry. At the time of writing, simply rejecting the underlying /dev/md3 device as a candidate wasn't enough. So for now, we will tell LVM to accept DRBD devices and reject all other devices. To do this, we'll insert "a|/dev/drbd*|" as the first array entry and change the existing entry to "r/.*/".

Note: I would love feedback on a filter argument that successfully ignored just /dev/md3, if anyone can suggest one.

vim /etc/lvm/lvm.conf

    # By default we accept every block device:
    #filter = [ "a/.*/" ]
    filter = [ "a|/dev/drbd*|", "r/.*/" ]

Now delete the existing cache file so that LVM is forced to rescan the system.

rm -f /etc/lvm/cache/.cache

The changes take effect immediately.

Creating a new PV using the DRBD Partition

Node1

We can now proceed with setting up the new DRBD-based LVM physical volume. Once the PV is created, we can create a new volume group and start allocating space to logical volumes.

Note: As we will be using our DRBD device, and as it is a shared block device, most of the following commands only need to be run on one node. Once the block device changes in any way, those changes will near-instantly appear on the other node. For this reason, unless explicitly stated to do so, only run the following commands on one node.

To setup the DRBD partition as an LVM PV, run pvcreate:

pvcreate /dev/drbd0

  Physical volume "/dev/drbd0" successfully created

Both

Now, on both nodes, check that the new physical volume is visible by using pvdisplay:

pvdisplay

  "/dev/drbd0" is a new physical volume of "422.44 GiB"
  --- NEW Physical volume ---
  PV Name               /dev/drbd0
  VG Name               
  PV Size               422.44 GiB
  Allocatable           NO
  PE Size               0   
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               YHmdip-SuJN-KIEv-2tbK-BT9Q-wfOo-OuQuaW

If you see PV Name /dev/drbd0 (or your underlying partition) on both nodes, then your DRBD setup and LVM configuration changes are working perfectly!

Creating a VG on the new PV

Node1

Now we need to create the volume group using the vgcreate command:

vgcreate -c y drbd0_vg0 /dev/drbd0

  Clustered volume group "drbd0_vg0" successfully created

Both

Now we'll check that the new VG is visible on both nodes using vgdisplay:

vgdisplay

  --- Volume group ---
  VG Name               drbd0_vg0
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               422.43 GiB
  PE Size               4.00 MiB
  Total PE              108143
  Alloc PE / Size       0 / 0   
  Free  PE / Size       108143 / 422.43 GiB
  VG UUID               Bb8l9e-es2z-PhaF-Gg3o-2is2-DZ1S-V2RsBF

If the new VG is visible on both nodes, we are ready to create our first logical volume using the lvcreate tool.

Creating the First LV on the new VG

Node1

Now we'll create a simple 20 GiB logical volumes. We will use it as a shared GFS2 store for shared files and to store our Xen domU config files later on.

lvcreate -L 20G -n xen_shared drbd0_vg0

  Logical volume "xen_shared" created

Both

As before, we will check that the new logical volume is visible from both nodes by using the lvdisplay command:

lvdisplay

  --- Logical volume ---
  LV Name                /dev/drbd0_vg0/xen_shared
  VG Name                drbd0_vg0
  LV UUID                AqQizc-KBpX-2scN-WFLb-jIeF-QDcM-PlQW84
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                20.00 GiB
  Current LE             5120
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0

Again, if this is visible from both nodes, we're set! Repeat this process for all future LVs you will want to create. We will do this a little later to create LVs for Xen VMs.

Creating A Shared GFS FileSystem

GFS is a cluster-aware file system that can be simultaneously mounted on two or more nodes at once. We will use it as a place to store ISOs that we'll use to provision our virtual machines.

Install The GFS2 Utilities

Start by installing the GFS2 tools:

yum -y install gfs2-utils.x86_64

Format Our CLVM LV With The GFS2 File System

Node1

Note: The following example is designed for the cluster used in the prerequisite HowTo.

If you have more than 2 nodes, increase the -j 2 to the number of nodes you want to mount this file system on.
If your cluster is named something other than an-cluster (as set in the cluster.conf file), change -t an-cluster:xen_shared to match you cluster's name. The xen_shared can be whatever you like, but it must be unique in the cluster. I tend to use a name that matches the LV name, but this is my own preference and is not required.

To format the partition run:

mkfs.gfs2 -p lock_dlm -j 2 -t an-cluster:xen_shared /dev/drbd0_vg0/xen_shared

This will destroy any data on /dev/drbd0_vg0/xen_shared.
It appears to contain: symbolic link to `../dm-0'

Are you sure you want to proceed? [y/n]

Acknowledge the warning, if any, and then press y if you are ready to proceed.

Device:                    /dev/drbd0_vg0/xen_shared
Blocksize:                 4096
Device Size                20.00 GB (5242880 blocks)
Filesystem Size:           20.00 GB (5242878 blocks)
Journals:                  2
Resource Groups:           80
Locking Protocol:          "lock_dlm"
Lock Table:                "an-cluster:xen_shared"
UUID:                      A1487063-2A3F-43B1-3A36-44936B0B4D1E

Once the format completes, you can mount /dev/drbd0_vg0/xen_shared as you would a normal file system.

Both:

To complete the example, lets mount the GFS2 partition we made just now on /shared and then use df -h to verify.

mkdir /xen_shared
mount /dev/drbd0_vg0/xen_shared /xen_shared
df -h

Filesystem            Size  Used Avail Use% Mounted on
Filesystem            Size  Used Avail Use% Mounted on
/dev/md1               39G  2.8G   34G   8% /
tmpfs                 466M   29M  438M   7% /dev/shm
/dev/md0              243M   70M  161M  31% /boot
xenstore              466M   32K  466M   1% /var/lib/xenstored
/dev/dm-0              20G  259M   20G   2% /xen_shared

You may have noticed that it shows /dev/dm-0 instead of /dev/drbd0_vg0/xen_shared. If you look at the later, you will see that it is simply a symlink to the former.

ls -lah /dev/drbd0_vg0/xen_shared

lrwxrwxrwx. 1 root root 7 Sep  9 13:24 /dev/drbd0_vg0/xen_shared -> ../dm-0

Add An Entry To /etc/fstab

The last step is to add an entry for this new partition to each node's /etc/fstab file.

Reference The GFS2 Partition By Device Path

This is the more traditional method of referencing the GFS2 partition by using it's device path directly.

Warning: An incorrect edit of the /etc/fstab file can leave your system unable to boot! Please review the line generated above to make sure it is accurate and compatible with your setup before proceeding.

vim /etc/fstab

#
# /etc/fstab
# Created by anaconda on Tue Sep  7 06:29:51 2010
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=865690db-85d0-44c5-9f32-ffb6fdf47060 /                       ext3    defaults        1 1
UUID=8b9822b6-a92e-48c9-96b5-f8943142319e /boot                   ext3    defaults        1 2
UUID=94b03547-a7e3-45bb-b2d5-837498b370f4 swap                    swap    defaults        0 0
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
xenfs                   /proc/xen               xenfs   defaults        0 0
/dev/drbd0_vg0/xen_shared /xen_shared           gfs2    rw,suid,dev,exec,nouser,async    0 0

Reference The GFS2 Partition By UUID

It is sometimes preferable to create an fstab entry that locates the device path via it's UUID. To do this, you can run the following command which, though a bit cryptic, will print out an /etc/fstab compatible string.

Warning: The same warnings apply here as above

echo `gfs2_edit -p sb /dev/drbd0_vg0/xen_shared | grep sb_uuid | sed -e "s/.*sb_uuid  *\(.*\)/UUID=\L\1\E \/xen_shared\t\tgfs2\trw,suid,dev,exec,nouser,async\t0 0/"`

UUID=a1487063-2a3f-43b1-3a36-44936b0b4d1e /xen_shared gfs2 rw,suid,dev,exec,nouser,async 0 0

You may have noticed that defaults isn't used. Rather, all but the auto option are manually set. This is because the system will drop to single-user mode at boot if it can't mount an auto partition at boot time (auto being implied by defaults). Given that our GFS2 partition sits on top of DRBD and the cluster, there is no way to make it available that early in the boot process.

Further, the gfs2 init script specifically excludes entries in /etc/fstab that have the 'noauto option set. For this reason, we can't simply specify that as we need the init script to see the partition so that it is mounted when GFS2 starts and unmounted when it stops.

Now add this string to /etc/fstab.

vim /etc/fstab

#
# /etc/fstab
# Created by anaconda on Tue Sep  7 06:29:51 2010
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=865690db-85d0-44c5-9f32-ffb6fdf47060 /                       ext3    defaults        1 1
UUID=8b9822b6-a92e-48c9-96b5-f8943142319e /boot                   ext3    defaults        1 2
UUID=94b03547-a7e3-45bb-b2d5-837498b370f4 swap                    swap    defaults        0 0
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
xenfs                   /proc/xen               xenfs   defaults        0 0
UUID=a1487063-2a3f-43b1-3a36-44936b0b4d1e /xen_shared gfs2 rw,suid,dev,exec,nouser,async 0 0

Please note; At the time of writing this HowTo, there is a bug in findfs and mount. According to RFC 4122, programs should accept a UUID in either upper or lower case. However, this is not currently the case, so you must pass the UUID in lower-case. Please see bugs 632373 and 632385.

Testing The gfs2 Initialization Script

To verify that the new entry is valid, check gfs2's status.

/etc/init.d/gfs2 status

Configured GFS2 mountpoints: 
/xen_shared
Active GFS2 mountpoints: 
/xen_shared

Now test stopping and restarting to ensure that the GFS2 partition unmounts and mounts properly.

Stop gfs2.

/etc/init.d/gfs2 stop

Unmounting GFS2 filesystem (/xen_shared):                   [  OK  ]

Check with df -h to ensure that the mount is gone.

df -h

Filesystem            Size  Used Avail Use% Mounted on
/dev/md1               39G  2.0G   35G   6% /
tmpfs                 466M   23M  444M   5% /dev/shm
/dev/md0              243M   70M  161M  31% /boot
xenstore              466M   32K  466M   1% /var/lib/xenstored

Start gfs2 again.

/etc/init.d/gfs2 start

Mounting GFS2 filesystem (/xen_shared):                     [  OK  ]

Again, check with df -h that it has been remounted.

df -h

Filesystem            Size  Used Avail Use% Mounted on
/dev/md1               39G  2.0G   35G   6% /
tmpfs                 466M   29M  438M   7% /dev/shm
/dev/md0              243M   70M  161M  31% /boot
xenstore              466M   32K  466M   1% /var/lib/xenstored
/dev/dm-0              20G  259M   20G   2% /xen_shared

Done!

Altering Daemon Start Order

It is important that the various daemons in use by our cluster start in the right order. Most daemons will rely on services provided by another daemon to be running, and will not start or will not operate reliably otherwise.

We need to make sure that xend starts so that the network is stable. Then cman needs to start so that fencing and dlm are available. Next, drbd starts so that the clustered storage is available. Then clvmd must start so that the data on the DRBD resource is accessible. Now gfs2 needs to start so that the Xen domU configuration files can be found and finally xendomains must start to boot up the actual domU virtual machines.

To restate as a list, the start order must be:

xend
cman
drbd
clvmd
gfs2
xendomains

To make sure the start order is sane then, we'll edit each of the six daemon's init scripts and alter their Required-Start lines. To make the changes take effect, we will use chkconfig to remove and re-add them to the various start levels.

Altering xend

This should already be done. If it isn't, please see "Making xend play nice with clustering" above. If you are revisiting that section, you can skip the cman edit as we will need to make another change in the next step.

Altering cman

This should already be done. If it isn't, please see "Making xend play nice with clustering" above. If you are revisiting that section, you can skip the cman edit as we will need to make another change in the next step.

Altering drbd

Now we will tell drbd to start after cman.

This requires the additional step of altering the chkconfig: - 70 08 line to instead read chkconfig: - 20 08. This isn't strictly needed, but will give more room for chkconfig to order the dependent daemons by allowing DRBD to be started as low as position 20, rather than waiting until position 70. This is somewhat more compatible with cman and clvmd which normally start at positions 21 and 24, respectively

vim /etc/init.d/drbd

#!/bin/bash
#
# chkconfig: - 20 08
# description: Loads and unloads the drbd module
#
# Copright 2001-2008 LINBIT Information Technologies
# Philipp Reisner, Lars Ellenberg
#
### BEGIN INIT INFO
# Provides: drbd
# Required-Start: $local_fs $network $syslog cman
# Required-Stop:  $local_fs $network $syslog
# Should-Start:   sshd multipathd
# Should-Stop:    sshd multipathd
# Default-Start:
# Default-Stop:
# Short-Description:    Control drbd resources.
### END INIT INFO

Altering clvmd

Now we will now tell clvmd to start after drbd.

Note: There is currently a minor bug with lvm2-cluster version 2.02.73-2 in that /etc/init.d/clvmd is set by default to mode 0555. This is easily corrected by running the following command. Please check bug 636066 to see if this has been resolved.

chmod u+w /etc/init.d/clvmd

Once you've got write access, edit the file.

vim /etc/init.d/clvmd

#!/bin/bash
#
# chkconfig: - 24 76
# description: Starts and stops clvmd
#
# For Red-Hat-based distributions such as Fedora, RHEL, CentOS.
#              
### BEGIN INIT INFO
# Provides: clvmd
# Required-Start: $local_fs drbd
# Required-Stop: $local_fs
# Default-Start:
# Default-Stop: 0 1 6
# Short-Description: Clustered LVM Daemon
### END INIT INFO

Altering gfs2

Now we will now tell gfs2 to start after clvmd. You will notice that cman is already listed under Required-Start and Required-Stop. It's true that cman must be started, but we've created a chain here so we can safely replace it with clvmd in the start line.

vim /etc/init.d/gfs2

#!/bin/bash
#
# gfs2 mount/unmount helper
#
# chkconfig: - 26 74
# description: mount/unmount gfs2 filesystems configured in /etc/fstab

### BEGIN INIT INFO
# Provides:             gfs2
# Required-Start:       $network clvmd
# Required-Stop:        $network
# Default-Start:
# Default-Stop:
# Short-Description:    mount/unmount gfs2 filesystems configured in /etc/fstab
# Description:          mount/unmount gfs2 filesystems configured in /etc/fstab
### END INIT INFO

Altering xendomains

Finally, we will alter xendomains so that it starts last, after gfs2.

vim /etc/init.d/xendomains

#!/bin/bash
#
# /etc/init.d/xendomains
# Start / stop domains automatically when domain 0 boots / shuts down.
#
# chkconfig: 345 99 00
# description: Start / stop Xen domains.
#
# This script offers fairly basic functionality.  It should work on Redhat
# but also on LSB-compliant SuSE releases and on Debian with the LSB package
# installed.  (LSB is the Linux Standard Base)
#
# Based on the example in the "Designing High Quality Integrated Linux
# Applications HOWTO" by Avi Alkalay
# <http://www.tldp.org/HOWTO/HighQuality-Apps-HOWTO/>
#
### BEGIN INIT INFO
# Provides:          xendomains
# Required-Start:    $syslog $remote_fs xend gfs2
# Should-Start:
# Required-Stop:     $syslog $remote_fs xend
# Should-Stop:
# Default-Start:     3 4 5
# Default-Stop:      0 1 2 6
# Default-Enabled:   yes
# Short-Description: Start/stop secondary xen domains
# Description:       Start / stop domains automatically when domain 0 
#                    boots / shuts down.
### END INIT INFO

Applying The Changes

Change the start order by removing and re-adding all cluster-related daemons using chkconfig.

chkconfig xenstored off; chkconfig xenconsoled off; chkconfig xend off; chkconfig cman off; chkconfig drbd off; chkconfig clvmd off; chkconfig gfs2 off; chkconfig xendomains off
chkconfig xendomains on; chkconfig gfs2 on; chkconfig clvmd on; chkconfig drbd on; chkconfig cman on; chkconfig xend on; chkconfig xenconsoled on; chkconfig xenstored on

Now verify that the start order is as we want it.

ls -lah /etc/rc3.d/

lrwxrwxrwx.  1 root root   19 Sep 20 13:37 S26xenstored -> ../init.d/xenstored
lrwxrwxrwx.  1 root root   21 Sep 20 13:37 S27xenconsoled -> ../init.d/xenconsoled
lrwxrwxrwx.  1 root root   14 Sep 20 13:37 S28xend -> ../init.d/xend
lrwxrwxrwx.  1 root root   14 Sep 20 13:37 S29cman -> ../init.d/cman
lrwxrwxrwx.  1 root root   14 Sep 20 13:37 S70drbd -> ../init.d/drbd
lrwxrwxrwx.  1 root root   15 Sep 20 13:37 S71clvmd -> ../init.d/clvmd
lrwxrwxrwx.  1 root root   14 Sep 20 13:37 S72gfs2 -> ../init.d/gfs2
lrwxrwxrwx.  1 root root   20 Sep 20 13:37 S99xendomains -> ../init.d/xendomains

Setting Up Xen

WARNING: Everything below here is pretty seriously screwed up.

Note: This is not meant to be an extensive tutorial on Xen itself. It covers enough to get domU VMs provisioned in a manner that will take advantage of the cluster. As such, there is minimal explanation of configuration file options. If you need further help, please drop by the ##xen (yes, two ##) IRC channel on freenode.org.

Install The Hypervisor Tools

These tools are very useful in provisioning and managing domU VMs.

yum -y install virt-install virt-viewer

Install The HVM/KVM Tools

For hvm (Hardware Virtual Machines), which is required for paravirtualized Microsoft VMs, you must install the following packages as well.

yum -y install qemu-kvm.x86_64 qemu-kvm-tools.x86_64

Ensure That Virtualization Is Enabled

Many motherboards disable hvm by default in their BIOS. Assuming that you've got a dom0 kernel running at this stage, you can check if this is the case by checking the xm info output.

xm info

host                   : an-node04.alteeve.com
release                : 2.6.32.23-170.dom0_an1.fc13.x86_64
version                : #1 SMP Sun Oct 10 20:39:19 EDT 2010
machine                : x86_64
nr_cpus                : 4
nr_nodes               : 1
cores_per_socket       : 4
threads_per_core       : 1
cpu_mhz                : 2209
hw_caps                : 178bf3ff:efd3fbff:00000000:00001310:00802001:00000000:000037ff:00000000
virt_caps              : hvm
total_memory           : 4063
free_memory            : 2987
node_to_cpu            : node0:0-3
node_to_memory         : node0:2987
node_to_dma32_mem      : node0:2928
max_node_id            : 0
xen_major              : 4
xen_minor              : 0
xen_extra              : .1
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable
xen_commandline        : dom0_mem=1024M
cc_compiler            : gcc version 4.4.4 20100630 (Red Hat 4.4.4-10) (GCC) 
cc_compile_by          : root
cc_compile_domain      : 
cc_compile_date        : Mon Oct 11 01:10:38 EDT 2010
xend_config_format     : 4

Look at the virt_caps and xen_caps lines. Notice the hvm entries? This shows that HVM, also known as "secure virtualization", has been enabled. If you do not see this, please check your mainboard manual for information on enabling this on your system.

Note: The next paragraph applies only when running a vanilla kernel.

If you are running a vanilla kernel, you can check to see if your CPU has support for HVM guests but checking /proc/cpuinfo. What you're looking for depends on your CPU manufacturer. If you have an Intel CPU, you need to look for the vmx flag. Likewise, with AMD CPUs, you need to look for the svm flag.

For a more complete, if somewhat dated paper on this topic, please Fedora 6 Xen Quickstart Guide, System Requirements.

Enabling Migration

By default, xend will not allow domU VMs from being migrated onto or off of a given dom0 host. Given that we've got a cluster though, we very much want this behaviour, so now we will enable it. This is done by making edits to /etc/xen/xend-config.sxp. Below is a concise list of options that must be set. Some exist already in the file and need to be commented out or altered.

Warning: The values below are very permissive. Please review each option and improve the security to fit your network before going into production!

vim /etc/xen/xend-config.sxp

(xend-http-server yes)
(xend-unix-server yes)
(xend-tcp-xmlrpc-server yes)
(xend-relocation-server yes)
(xend-udev-event-server yes)
(xend-port            8000)
(xend-relocation-port 8002)
(xend-address '')
(xend-relocation-address '')
(xend-relocation-hosts-allow '')

Once done, restart xend. It is usually safest to stop the cluster before hand to avoid accidental fencing caused by the underlying network being reconfigured.

/etc/init.d/gfs2 stop
/etc/init.d/clvmd stop
/etc/init.d/cman stop
/etc/init.d/xend restart
/etc/init.d/cman start
/etc/init.d/clvmd start
/etc/init.d/gfs2 start

Virtual Machine Naming Convention

Note: This section acts as a recommendation only. Feel free to alter this to fit your style and needs.

Personally, I like to name my VMs similar to c5_shorewall_01. To elaborate, I like to use the format:

os_role_seq (<Operating System ID>_<Role of the VM>_<Sequence Integer>).

There are no (known) restrictions on virtual machine names, so feel free to use names that made sense for you. I do strongly recommend that you match the name of your domU VM to the name of it's host LVM logical volume.

Provisioning domU VMs

There are two ways to provision new VMs that we will cover (there are many others); Using virt-install and using xm create -f /path/to/domain.cfg where domain.cfg is a hand-crafted python script.

Provisioning with virt-install

This uses a long command line argument that provisions the VM and loads it into libvirt. Where possible, this is probably the best way to provision a domU. However, there are occasions where it may not work.

Here is an example where a domU is provisioned for a Fedora 13, x86_64 VM with a dedicated logical volume on the clustered LVM. The command to create the LV precedes the command to provision the VM. Please review and adjust values as you need. Consult man virt-install for a more complete list of available options and their uses.

Following the provision command are two examples of how to backup the configuration to a flat file. The first directly dumps the configuration into the libvirt format. The second backs up the configuration to an XML file and then shows and example of how that file can be converted into a standard python script.

# Fedora 13 x86_64 RPM builder VM
lvcreate -L 40G -n f13_builder_01 drbd0_vg0
virt-install --connect xen \
             --name f13_builder_01 \
             --ram 1024 \
             --arch x86_64 \
             --vcpus 1 \
             --cpuset 1 \
             --location http://192.168.1.10/f13/x86_64/img/ \
             --os-type linux \
             --os-variant fedora13 \
             --disk path=/dev/drbd0_vg0/f13_builder_01 \
             --network bridge=eth0,mac=00:16:3e:00:10:01 \
             --vnc \
             --paravirt
# Backup the domU configuration to the libvirt native format.
xm list -l f13_builder_01 > /xen_shared/dom_config/f13_builder.cfg
# Backup the config to an XML file.
virsh dumpxml f13_builder_01 > /xen_shared/domU_config/f13_builder_01.xml
# Convert it to a "traditional" python script. Be sure the edit the resulting .cfg file to remove the 'vifname=' sections.
virsh -c xen:/// domxml-to-native xen-xm f13_builder_01.xml > f13_builder_01.cfg

The next two examples show the provisioning of a CentOS v5.5 and RHEL v6.0, beta 2 machines. The dump and backup methods above should be easily adapted to work with these VMs.

# A CentOS test server
lvcreate -L 40G -n c5_test_01 drbd0_vg0
virt-install --connect xen \
             --name c5_test_01 \
             --ram 1024 \
             --arch x86_64 \
             --vcpus 1 \
             --cpuset 1-3 \
             --location http://192.168.1.10/c5/x86_64/img/ \
             --os-type linux \
             --os-variant rhel5.4 \
             --disk path=/dev/drbd0_vg0/c5_test_01 \
             --network bridge=eth0,mac=00:16:3e:00:10:02 \
             --vnc \
             --paravirt

# Red Hat Enterprise Linux 6 beta 2 test server
lvcreate -L 40G -n rh6b2_test_01 drbd0_vg0
virt-install --connect xen \
             --name rh6b2_test_01 \
             --ram 1024 \
             --arch x86_64 \
             --vcpus 1 \
             --cpuset 1 \
             --location http://192.168.1.10/rhel6/x86_64/img/ \
             --os-type linux \
             --os-variant rhel6 \
             --disk path=/dev/drbd0_vg0/rh6b2_test_01 \
             --network bridge=eth0,mac=00:16:3e:00:10:03 \
             --vnc \
             --paravirt

Provisioning With 'xm create'

At the time of writing this, I could not sort out the magical incantation for provisioning a Windows VM using virt-install. Instead, I used the "old" style of crafting a configuration file using a python script. This is useful to know as many templates exist on the web for various VMs. Following the steps below, you should be able to fairly easily adapt them.

In this example, we will provision a VM using HVM from an ISO image of the Windows 2008 Server installation DBD. In this case, there is a problem with virt-install finding the qemu-dm file.

This configuration file will be saved in the /xen_shared/domU_config directory, which exists on the shared GFS2 partition we created earlier.

mkdir /xen_shared/domU_config
vim /xen_shared/domU_config/win2008_sql_01.cfg

# This is the Windows 2008 Enterprise Server x86_64 hosting MS SQL Server 2008 Enterprise
kernel = "/usr/lib/xen/boot/hvmloader"
builder='hvm'
memory = 1024

# Should be at least 2KB per MB of domain memory, plus a few MB per vcpu.
shadow_memory = 8
name = "win2008_sql_01"
#vif = [ 'type=ioemu, bridge=xenbr0' ]
vif = [ 'type=ioemu, bridge=eth0,mac=00:16:3e:00:30:03' ]
acpi = 1
apic = 1
# Remove the 'file:...' entry (or change it to another ISO) after the install is complete.
disk = [ 'phy:/dev/drbd0_vg0/win2008_sql_01,hda,w', 'file:/xen_shared/iso/MS-Win2008-Ent-x86_64-SP2.iso,hdc:cdrom,r' ]

device_model = '/usr/lib/xen/bin/qemu-dm'

#-----------------------------------------------------------------------------
# boot on floppy (a), hard disk (c) or CD-ROM (d) 
# default: hard disk, cd-rom, floppy
boot="dc"
sdl=0
vnc=1
vncconsole=1
vncpasswd=''

serial='pty'
usbdevice='tablet'

Now provision the VM using xm create.

xm create -f /xen_shared/domU_config/win2008_sql_01.cfg

At this point, the VM is not loaded into libvirt. This is a problem as, on boot, the node will not know that the VM exists. As a consequence, some tools like virt-manager will not see the VM until it is manually started with xm create -f /xen_shared/domU_config/domain.cfg. Further, and perhaps more troubling, changes made to the VM's config made outside to config file will be lost when you restart from the config file.

To fix this, we'll use a few hacks to load the config into libvirt. This needs to be done while the domU is loaded and it must be done on the node currently hosting the VM.

First, make sure that the domU's configuration is visible. Then, dump the config and filter it through grep and sed to pull out the UUID. Once you know that you get the proper UUID, create the directory under /var/lib/xend/domains/ and then create the config.sxp file with the domU's configuration.

NOTE: Be sure to change win2008_01 in the examples below to match the name of the domU that you want to setup.

xm list -l win2008_01
xm list -l win2008_01 | grep '^    (uuid' | sed -e "s/    (uuid \(.*\))/\1/"
mkdir /var/lib/xend/domains/`xm list -l win2008_01 | grep '^    (uuid' | sed -e "s/    (uuid \(.*\))/\1/"`
cd /var/lib/xend/domains/`xm list -l win2008_01 | grep '^    (uuid' | sed -e "s/    (uuid \(.*\))/\1/"`
xm list -l win2008_01 > config.sxp

There are several uuid strings in the output from xm list -l domain. Thankfully, the one we want is the only one indented by four spaces. This is how we can be fairly confident that the UUID returned by sed above is, in fact,

Making VMs Highly Available

Now this is the point, isn't it?

In this final step, we're going to move the startup of drbd, clvmd, gfs2 and xendomains out of init.d and move them into our cluster. The reason for this is that the cluster is much wiser about how to handle clustered services. We do not want a node that is not in the cluster, that is, a node without quorum, to try and connect to shared resources. The cluster manager handles this.

Note: This how-to is trying to keep things simple, so we will use rgmanager which is built in to cman. There is a compelling argument to use Pacemaker instead. That is a bit beyond the scope of the How-To though.

Disable All Cluster Software From Starting At Boot

Note: This will be moved to another section or possibly removed entirely before the final release.

There have been problems on booting both nodes at the same time causing DRBD split brain. For this reason, I am currently advising disabling drbd, clvmd and gfs2 in addition to the other services to be disabled in the next step. The reason for this is that, by manually starting these services, you can catch failures and correct them before they cascade to other daemons. By default, if the drbd daemon were to fail to come up, the system would move on and try to start clvmd and the rest, which would obviously fail given the lack of DRBD. I am working on merging the startup of these services into the cluster.conf file, but that will be a little while yet.

Install rgmanager

The cluster tool rgmanager will provide the cluster-related service management and restart VMs lost on a failed node. We will install it now.

yum -y install rgmanager

Removing Clustered Services From initd

As stated, we will now remove the cluster-related services from starting with the os.

yum -y install rgmanager
chkconfig xendomains off; chkconfig gfs2 off; chkconfig clvmd off; chkconfig drbd off; chkconfig rgmanager off

Manual Startup

When manually starting the cluster, please do it in the following order, ensuring that each service did indeed start before moving on.

Starting drbd

Start DRBD on both nodes at close to the same time. Once started, check /proc/drbd to ensure that the nodes are syncing or connected.

/etc/init.d/drbd start
cat /proc/drbd

If the DRBD array is in StandAlone and fails to connect, then you likely have a split brain condition. To recover, you need to identify which node you trust has the most recent view of the data. For this article, lets assume that an-node01 is the node we trust and an-node02 is the node we will discard.

Warning: When you invalidate a node's DRBD array, and changes made to it since the split brain occurred will be lost. For this reason, be very careful about how you proceed. If you are at all in doubt, backup the DRBD device of the node to be invalidated before proceeding. Assuming that the DRBD backing device is /dev/md3, then to back it up you could use dd (disk duplicate) to copy the contents to a destination of equal or greater size of the /dev/md3 device. If you do not have enough space, you may be able to pipe the output through a compression program like gzip or bzip2.

To backup the device to be invalidated, if uncertain that it is safe to overwrite, run the following.

dd if=/dev/md3 of=/path/to/drbd_backup.img

Once you are ready to proceed, and remembering that we have decided that an-node01 is the most up to date and that r0 is the name of the split-brain resource, run the following sets of commands on the appropriate machines.

Both

/etc/init.d/drbd stop
modprobe drbd
drbdadm attach r0

an-node02

drbdadm invalidate r0

Both

drbdadm connect r0
cat /proc/drbd

Ensure that both nodes are connected and that the array has begun syncing. Note that at this stage, both nodes will be Secondary, an-node01 will be UpToDate and an-node02 will be Inconsistent. If this is the case, then you are safe to proceed. If not, resolve the issue before going any further.

Both

drbdadm primary r0

Note: There is no need to run /etc/init.d/drbd start now as we've replicated what it does to get to this stage.

Warning: While DRBD is synchronizing, the Inconsistent node will shut down it's DRBD array if the UpToDate node shuts down or is fenced!

Starting clvmd

Warning: In rarer cases, I've seen clvmd spinlock (kernel lock that kill -9 won't stop). In these cases, I've found that there is no recourse shy of forcing down the node. Generally, trying to reboot will often cause one of the two nodes to get fenced. For this reason, if clvmd appears to hang and ps aux | grep clvmd shows clvmd -T30 which can not be stopped with kill -9, then force power down the node via a fence call or by pressing and holding the node's power button until the power is off.

/etc/init.d/clvmd start

Editing vm.sh

ToDo

SSH Setup

You need to make sure that each node's SSH public key is copied into it's own authorized_keys file. Then you should be able to migrate your VMs using virsh.

ToFinish

=== Testing Live Migration Using

# Assuming that the VM 'f13_builder_01' is running on this node and 'an-node02' is the destination node.
virsh migrate --live f13_builder_01 xen+ssh:/// xenmigr://an-node02

http://libvirt.org/remote.html

Syncing domU Configuration

To make a VM available on both nodes at all times, we will use virsh on the original host to first export the XML configuration to a file on our GFS partition, and then we will use it on the second node to define that VM. This process must be repeated whenever the configuration of the domU changes.

In this example, we will copy the configuration for the domU called f13_builder_01 running on an-node01 so that it becomes available on an-node02.

an-node01

virsh dumpxml f13_builder_01 > /xen_shared/domU_config/f13_builder_01.xml
cat /xen_shared/domU_config/f13_builder_01.xml

<domain type='xen'>
  <name>f13_builder_01</name>
  <uuid>ad211e8f-a685-79fe-b217-cb94bea6d2bd</uuid>
  <memory>1048576</memory>
  <currentMemory>1048576</currentMemory>
  <vcpu cpuset='1'>1</vcpu>
  <bootloader>/usr/bin/pygrub</bootloader>
  <os>
    <type>linux</type>
  </os>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/lib/xen/bin/qemu-dm</emulator>
    <disk type='block' device='disk'>
      <driver name='phy'/>
      <source dev='/dev/drbd0_vg0/f13_builder_01'/>
      <target dev='xvda' bus='xen'/>
    </disk>
    <interface type='bridge'>
      <mac address='00:16:3e:00:10:01'/>
      <source bridge='eth0'/>
      <script path='/etc/xen/scripts/vif-bridge'/>
      <target dev='vif-1.0'/>
    </interface>
    <console type='pty'>
      <target port='0'/>
    </console>
    <input type='mouse' bus='xen'/>
    <graphics type='vnc' port='-1' autoport='yes' keymap='en-us'/>
  </devices>
</domain>

an-node02

virsh define /xen_shared/domU_config/f13_builder_01.xml

Domain f13_builder_01 defined from /xen_shared/domU_config/f13_builder_01.xml

xm list

Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  1024     4     r-----    943.4
f13_builder_01                                  1024     1                 0.0
win2008_01                                      1024     1                66.6

Adding The <rm> Section to cluster.conf

ToDo

Pushing The Updated cluster.conf To The Other Node

ToDo

Testing rgmanager

Testing rgmanager involves stopping the VM, freezing it, using rg_test to start the VM, check it's status and stopping the VM. If this works, we'll then thaw the resource, manually start the domU and then do a test kill of the node hosting the VM to see if the second node will, in fact, start the lost VM.

Start the cluster, if it's not already running. Make sure that the VM is working by manually starting and stopping it before proceeding.

For this test, we will use the f13_builder_01 domU VM running on an-node01. Freeze the resource using clusvcadm. This only needs to be run on one node to freeze the service (VM) on both nodes.

clusvcadm -Z vm:f13_builder_01

Local machine freezing vm:f13_builder_01...Success

Check that it is in fact frozen with clustat. Note that at this stage the domU VM is stopped.

clustat

Cluster Status for an-cluster03 @ Wed Oct 20 23:48:14 2010
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 vm:f13_builder_01              (none)                         stopped    [Z]

Now that we know it's seen by rgmanager, is stopped and is froze ([Z]), we can proceed with the test start.

an-node01

rg_test test /etc/cluster/cluster.conf start vm f13_builder_01

Running in test mode.
Loading resource rule from /usr/share/cluster/oracledb.sh
Loading resource rule from /usr/share/cluster/lvm.sh
Loading resource rule from /usr/share/cluster/lvm_by_lv.sh
Loading resource rule from /usr/share/cluster/mysql.sh
Loading resource rule from /usr/share/cluster/service.sh
Loading resource rule from /usr/share/cluster/vm.sh
Loading resource rule from /usr/share/cluster/apache.sh
Loading resource rule from /usr/share/cluster/nfsexport.sh
Loading resource rule from /usr/share/cluster/drbd.sh
Loading resource rule from /usr/share/cluster/fs.sh
Loading resource rule from /usr/share/cluster/script.sh
Loading resource rule from /usr/share/cluster/SAPInstance
Loading resource rule from /usr/share/cluster/tomcat-6.sh
Loading resource rule from /usr/share/cluster/postgres-8.sh
Loading resource rule from /usr/share/cluster/nfsclient.sh
Loading resource rule from /usr/share/cluster/samba.sh
Loading resource rule from /usr/share/cluster/ip.sh
Loading resource rule from /usr/share/cluster/nfsserver.sh
Loading resource rule from /usr/share/cluster/ASEHAagent.sh
Loading resource rule from /usr/share/cluster/netfs.sh
Loading resource rule from /usr/share/cluster/tomcat-5.sh
Loading resource rule from /usr/share/cluster/lvm_by_vg.sh
Loading resource rule from /usr/share/cluster/SAPDatabase
Loading resource rule from /usr/share/cluster/clusterfs.sh
Loading resource rule from /usr/share/cluster/openldap.sh
Loading resource rule from /usr/share/cluster/ocf-shellfuncs
Loading resource rule from /usr/share/cluster/named.sh
Loading resource rule from /usr/share/cluster/svclib_nfslock
Loading resource rule from /usr/share/cluster/smb.sh
Starting f13_builder_01...
Hypervisor: xen
Management tool: virsh
Hypervisor URI: xen+ssh:///
Migration URI format: xenmigr://target_host/
Virtual machine f13_builder_01 is shut off
<debug>  virsh -c xen+ssh:/// start f13_builder_01
[vm] virsh -c xen+ssh:/// start f13_builder_01
Domain f13_builder_01 started

Start of f13_builder_01 complete

Now confirm that it really did start.

an-node01

xm list

Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  1024     4     r-----   1218.4
f13_builder_01                               1  1024     1     ------      9.7
win2008_01                                      1024     1                 0.0

Wonderful!

Note that if you now run clustat on either node, the VM will not show as running. This is because it is frozen and this is expected.

Now check the status.

rg_test test /etc/cluster/cluster.conf status vm f13_builder_01

Running in test mode.
Loading resource rule from /usr/share/cluster/oracledb.sh
Loading resource rule from /usr/share/cluster/lvm.sh
Loading resource rule from /usr/share/cluster/lvm_by_lv.sh
Loading resource rule from /usr/share/cluster/mysql.sh
Loading resource rule from /usr/share/cluster/service.sh
Loading resource rule from /usr/share/cluster/vm.sh
Loading resource rule from /usr/share/cluster/apache.sh
Loading resource rule from /usr/share/cluster/nfsexport.sh
Loading resource rule from /usr/share/cluster/drbd.sh
Loading resource rule from /usr/share/cluster/fs.sh
Loading resource rule from /usr/share/cluster/script.sh
Loading resource rule from /usr/share/cluster/SAPInstance
Loading resource rule from /usr/share/cluster/tomcat-6.sh
Loading resource rule from /usr/share/cluster/postgres-8.sh
Loading resource rule from /usr/share/cluster/nfsclient.sh
Loading resource rule from /usr/share/cluster/samba.sh
Loading resource rule from /usr/share/cluster/ip.sh
Loading resource rule from /usr/share/cluster/nfsserver.sh
Loading resource rule from /usr/share/cluster/ASEHAagent.sh
Loading resource rule from /usr/share/cluster/netfs.sh
Loading resource rule from /usr/share/cluster/tomcat-5.sh
Loading resource rule from /usr/share/cluster/lvm_by_vg.sh
Loading resource rule from /usr/share/cluster/SAPDatabase
Loading resource rule from /usr/share/cluster/clusterfs.sh
Loading resource rule from /usr/share/cluster/openldap.sh
Loading resource rule from /usr/share/cluster/ocf-shellfuncs
Loading resource rule from /usr/share/cluster/named.sh
Loading resource rule from /usr/share/cluster/svclib_nfslock
Loading resource rule from /usr/share/cluster/smb.sh
Checking status of f13_builder_01...
Hypervisor: xen
Management tool: virsh
Hypervisor URI: xen+ssh:///
Migration URI format: xenmigr://target_host/
Virtual machine f13_builder_01 is idle
Status of f13_builder_01 is good

The last line is exactly what we want. Finally, stop the domU VM. If you're watching the VM itself over VNC, you should see it do graceful shutdown in the next step.

rg_test test /etc/cluster/cluster.conf stop vm f13_builder_01

Running in test mode.
Loading resource rule from /usr/share/cluster/oracledb.sh
Loading resource rule from /usr/share/cluster/lvm.sh
Loading resource rule from /usr/share/cluster/lvm_by_lv.sh
Loading resource rule from /usr/share/cluster/mysql.sh
Loading resource rule from /usr/share/cluster/service.sh
Loading resource rule from /usr/share/cluster/vm.sh
Loading resource rule from /usr/share/cluster/apache.sh
Loading resource rule from /usr/share/cluster/nfsexport.sh
Loading resource rule from /usr/share/cluster/drbd.sh
Loading resource rule from /usr/share/cluster/fs.sh
Loading resource rule from /usr/share/cluster/script.sh
Loading resource rule from /usr/share/cluster/SAPInstance
Loading resource rule from /usr/share/cluster/tomcat-6.sh
Loading resource rule from /usr/share/cluster/postgres-8.sh
Loading resource rule from /usr/share/cluster/nfsclient.sh
Loading resource rule from /usr/share/cluster/samba.sh
Loading resource rule from /usr/share/cluster/ip.sh
Loading resource rule from /usr/share/cluster/nfsserver.sh
Loading resource rule from /usr/share/cluster/ASEHAagent.sh
Loading resource rule from /usr/share/cluster/netfs.sh
Loading resource rule from /usr/share/cluster/tomcat-5.sh
Loading resource rule from /usr/share/cluster/lvm_by_vg.sh
Loading resource rule from /usr/share/cluster/SAPDatabase
Loading resource rule from /usr/share/cluster/clusterfs.sh
Loading resource rule from /usr/share/cluster/openldap.sh
Loading resource rule from /usr/share/cluster/ocf-shellfuncs
Loading resource rule from /usr/share/cluster/named.sh
Loading resource rule from /usr/share/cluster/svclib_nfslock
Loading resource rule from /usr/share/cluster/smb.sh
Stopping f13_builder_01...
Hypervisor: xen
Management tool: virsh
Hypervisor URI: xen+ssh:///
Migration URI format: xenmigr://target_host/
<debug>  Virtual machine f13_builder_01 is idle
[vm] Virtual machine f13_builder_01 is idle
virsh shutdown f13_builder_01 ...
Domain f13_builder_01 is being shutdown

Stop of f13_builder_01 complete

If the domU VM shut down, then this stage of the testing completed successfully!

So now, the last step is the thaw and then check the service is, indeed, thawed.

clusvcadm -U vm:f13_builder_01

Local machine unfreezing vm:f13_builder_01...Success

clustat

Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 vm:f13_builder_01              (none)                         stopped

Note that the [Z] is gone now.

Starting The domU VM Using clusvcadm

In order for rgmanager to know that a service is running, in this case our VM, we must start the service using clusvcadm.

clusvcadm -e vm:f13_builder_01

Local machine trying to enable vm:f13_builder_01...Success
vm:f13_builder_01 is now running on an-node04.alteeve.com

After a few moments, you should be able to see the domU VM listed as started on both nodes.

clustat

Cluster Status for an-cluster03 @ Thu Oct 21 00:34:10 2010
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 vm:f13_builder_01              an-node04.alteeve.com          started

With the VM running, we are now ready to do our destructive test. If this works, f13_builder_01 should start on an-node01's death. To do this, we will killall -9 corosync on an-node01 while watching an-node02. Personally, I like to have terminals opened running:

clear; tail -f -n 0 /var/log/messages
watch clustat
watch xm list
watch cat /proc/drbd
watch cman_tool status

If you have limited screen real-estate, watch at least /var/log/message and clustat as they will be the most informative in this test.

an-node01

This next command will kill corosync. Within a second or two, an-node02 should declare an-node01 dead and then fence it. Once the fence succeeds, a new cluster configuration will form and rgmanager should start the VM on the surviving node.

killall -9 corosync

Any questions, feedback, advice, complaints or meanderings are welcome.
`Alteeve's Niche!`	`Alteeve Enterprise Support`	`Community Support`
© 2025 Alteeve. Intelligent Availability® is a registered trademark of Alteeve's Niche! Inc. 1997-2025
`legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.`

Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM

Prerequisite

Overview

Setting Up Xen

A Note On The State Of Xen dom0 Support In Fedora

A Note On Rolling Your Own RPMs

Install The Hypervisor

Installing Prebuilt RPMs

Building RPMs From Source

Installing The AN!Cluster dom0 Kernel

Post AN!Cluster dom0 Install Configuration

Installing Micheal Young's dom0 Kernel

Installing Via myoung's Repository

Post Michael Young's dom0 Install Configuration

Disabling Automatic Kernel Updates

Make xend play nice with clustering

Booting Into The New dom0

Configure Networking

Adding New NICs to Xen

Removing The qemu virbr0 Bridge

Create /etc/xen/scripts/an-network-script

Building the DRBD Array

A Map of the Cluster's Storage

Install The DRBD Tools

Install The AN!Cluster DRBD User-Land Tools

Install The Stock Fedora DRBD User-Land Tools

Disable heartbeat

Install The DRBD Kernel Module

Install Pre-Compiled DRBD Kernel Module RPMs

Building DRBD Kernel Module RPMs From Source

Allocating Raw Space For DRBD On Each Node

Create a Simple Partition

Creating a RAID level 1 'md' Device

Creating The New Partitions

Creating The New /dev/mdX Device

DRBD Configuration Files

/etc/drbd.d/global_common.conf

/etc/drbd.d/r0.res

Starting The DRBD Resource

Loading the 'drbd' Module

Monitoring Progress

Initialize The Resource

Starting the Resource

Setting the First Primary Node

Setting the Second Node to Primary

A Note On sync Speed

Setting Up CLVM

Making LVM Cluster-Aware

Filtering Out Devices

Creating a new PV using the DRBD Partition

Creating a VG on the new PV

Creating the First LV on the new VG

Creating A Shared GFS FileSystem

Install The GFS2 Utilities

Format Our CLVM LV With The GFS2 File System

Add An Entry To /etc/fstab

Reference The GFS2 Partition By Device Path

Reference The GFS2 Partition By UUID

Testing The gfs2 Initialization Script

Further Reading

Altering Daemon Start Order

Altering xend

Altering cman

Altering drbd

Altering clvmd

Altering gfs2

Altering xendomains

Applying The Changes

Setting Up Xen

Install The Hypervisor Tools

Install The HVM/KVM Tools

Ensure That Virtualization Is Enabled

Enabling Migration

Virtual Machine Naming Convention

Provisioning domU VMs

Provisioning with virt-install

Provisioning With 'xm create'

Making VMs Highly Available

Disable All Cluster Software From Starting At Boot

Install rgmanager