Red Hat Cluster Service 2 Tutorial - Archive: Difference between revisions

From Alteeve Wiki
Jump to navigation Jump to search
Line 489: Line 489:
chkconfig iptables off
chkconfig iptables off
chkconfig ip6tables off
chkconfig ip6tables off
/etc/init.d/iptables stop
/etc/init.d/ip6tables stop
</source>
</source>



Revision as of 22:58, 1 May 2011

 AN!Wiki :: How To :: Red Hat Cluster Service 2 Tutorial - Archive

This paper has one goal;

  • Creating a 2-node, high-availability cluster hosting Xen virtual machines using RHCS "stable 2" with DRBD and clustered LVM for synchronizing storage data.

We'll create a dedicated firewall VM to isolate and protect the VM network, discuss provisioning and maintaining Xen VMs, explore some basics of daily administration of a VM cluster and test various failures and how to recover from them.

Grab a coffee, a comfy chair, put on some nice music and settle in for some geekly fun.

The Task Ahead

Before we start, let's take a few minutes to discuss clustering and it's complexities.

Technologies We Will Use

  • Enterprise Linux 5; specifically we will be using CentOS v5.6.
  • Red Hat Cluster Services "Stable" version 2. This describes the following core components:
    • OpenAIS; Provides cluster communications using the totem protocol.
    • Cluster Manager (cman); Manages the starting, stopping and managing of the cluster.
    • Resource Manager (rgmanager); Manages cluster resources and services. Handles service recovery during failures.
    • Cluster Logical Volume Manager (clvm); Cluster-aware (disk) volume manager. Backs GFS2 filesystems and Xen virtual machines.
    • Global File Systems version 2 (gfs2); Cluster-aware, concurrently mountable file system.
  • Distributed Redundant Block Device (DRBD); Keeps shared data synchronized across cluster nodes.
  • Xen; Hypervisor that controls and supports virtual machines.

A Note on Patience

There is nothing inherently hard about clustering. However, there are many components that you need to understand before you can begin. The result is that clustering has an inherently steep learning curve.

You must have patience. Lots of it.

Many technologies can be learned by creating a very simple base and then building on it. The classic "Hello, World!" script created when first learning a programming language is an example of this. Unfortunately, there is no real analog to this in clustering. Even the most basic cluster requires several pieces be in place and working together. If you try to rush by ignoring pieces you think are not important, you will almost certainly waste time. A good example is setting aside fencing, thinking that your test cluster's data isn't important. The cluster software has no concept of "test". It treats everything as critical all the time and will shut down if anything goes wrong.

Take your time, work through these steps, and you will have the foundation cluster sooner than you realize. Clustering is fun because it is a challenge.

Prerequisites

It is assumed that you are familiar with Linux systems administration, specifically Red Hat Enterprise Linux and it's derivatives. You will need to have somewhat advanced networking experience as well. You should be comfortable working in a terminal (directly or over ssh). Familiarity with XML will help, but is not terribly required as it's use here is pretty self-evident.

If you feel a little out of depth at times, don't hesitate to set this tutorial aside. Branch over to the components you feel the need to study more, then return and continue on. Finally, and perhaps most importantly, you must have patience! If you have a manager asking you to "go live" with a cluster in a month, tell him or her that it simply won't happen. If you rush, you will skip important points and you will fail. Patience is vastly more important than any pre-existing skill.

Focus and Goal

There is a different cluster for every problem. Generally speaking though, there are two main problems that clusters try to resolve; Performance and High Availability. Performance clusters are generally tailored to the application requiring the performance increase. There are some general tools for performance clustering, like Red Hat's LVS (Linux Virtual Server) for load-balancing common applications like the Apache web-server.

This tutorial will focus on High Availability clustering, often shortened to simply HA and not to be confused with the HA Linux "heartbeat" cluster suite, which we will not be using here. The cluster will provide a shared file systems and will provide for the high availability on Xen-based virtual servers. The goal will be to have the virtual servers live-migrate during planned node outages and automatically restart on a surviving node when the original host node fails.

A very brief overview;

High Availability clusters like our have two main parts; Cluster management and resource management.

The cluster itself is responsible for maintaining the cluster nodes in a group. This group is part of a "Closed Process Group", or CPG. When a node fails, the cluster manager must detect the failure, reliably eject the node from the cluster using fencing and then reform the CPG. Each time the cluster changes, or "re-forms", the resource manager is called. The resource manager checks to see how the cluster changed, consults it's configuration and determines what to do, if anything.

The details of all this will be discussed in detail a little later on. For now, it's sufficient to have in mind these two major roles and understand that they are somewhat independent entities.

Platform

This tutorial was written using CentOS version 5.6, x86_64. No attempt was made to test on i686 or other EL5 derivatives. That said, there is no reason to believe that this tutorial will not apply to any variant. As much as possible, the language will be distro-agnostic. For reasons of memory constraints, it is advised that you use an x86_64 (64-bit) platform if at all possible.

Do note that as of EL5.4 and above, significant changes were made to how RHCS are supported. It is strongly advised that you use at least version 5.4 or newer while working with this tutorial.

A Word On Complexity

Clustering is not inherently hard, but it is inherently complex. Consider;

  • Any given program has N bugs.
    • RHCS uses; cman, openais, totem, fenced, rgmanager, dlm and GFS2,
    • We will be adding DRBD, CLVM and Xen.
    • Right there, we have N^10 possible bugs. We'll call this A.
  • A cluster has Y nodes.
    • In our case, 2 nodes, each with 3 networks.
    • The network infrastructure (Switches, routers, etc). If you use managed switches, add another layer of complexity.
    • This gives us another Y^(2*3), and then ^2 again for managed switches. We'll call this B.
  • Let's add the human factor. Let's say that a person needs roughly 5 years of cluster experience to be considered an expert. For each year less than this, add a Z "oops" factor, (5-Z)^2. We'll call this C.
  • So, finally, add up the complexity, using this tutorial's layout, 0-years of experience and managed switches.
    • (N^10) * (Y^(2*3)^2) * ((5-0)^2) == (A * B * C) == an-unknown-but-big-number.

This isn't meant to scare you away, but it is meant to be a sobering statement. Obviously, those numbers are somewhat artificial, but the point remains.

Any one piece is easy to understand, thus, clustering is inherently easy. However, given the large number of variables, you must really understand all the pieces and how they work together. DO NOT think that you will have this mastered and working in a month. Certainly don't try to sell clusters as a service without a lot of internal testing.

Clustering is kind of like chess. The rules are pretty straight forward, but the complexity can take some time to master.

Overview of Components

When looking at a cluster, there is a tendency to want to dive right into the configuration file. That is not very useful in clustering.

  • When you look at the configuration file, it is quite short.

It isn't like most applications or technologies though. Most of us learn by taking something, like a configuration file, and tweaking it this way and that to see what happens. I tried that with clustering and learned only what it was like to bang my head against the wall.

  • Understanding the parts and how they work together is critical.

You will find that the discussion on the components of clustering, and how those components and concepts interact, will be much longer than the initial configuration. It is true that we could talk very briefly about the actual syntax, but it would be a disservice. Please, don't rush through the next section or, worse, skip it and go right to the configuration. You will waste far more time than you will save.

  • Clustering is easy, but it has a complex web of inter-connectivity. You must grasp this network if you want to be an effective cluster administrator!

Component; cman

This was, traditionally, the cluster manager. In the 3.0 series, it acts mainly as a service manager, handling the starting and stopping of clustered services. In the 3.1 series, cman will be removed entirely.

Component; openais / corosync

OpenAIS is the heart of the cluster. All other computers operate though this component, and no cluster component can work without it. Further, it is shared between both Pacemaker and RHCS clusters.

In Red Hat clusters, openais is configured via the central cluster.conf file. In Pacemaker clusters, it is configured directly in openais.conf. As we will be building an RHCS, we will only use cluster.conf. That said, (almost?) all openais.conf options are available in cluster.conf. This is important to note as you will see references to both configuration files when searching the Internet.

A Little History

There were significant changes between RHCS version 2, which we are using, and version 3 available on EL6 and recent Fedoras.

In the RHCS version 2, there was a component called openais which handled totem. The OpenAIS project was designed to be the heart of the cluster and was based around the Service Availability Forum's Application Interface Specification. AIS is an open API designed to provide inter-operable high availability services.

In 2008, it was decided that the AIS specification was overkill for RHCS clustering and a duplication of effort in the existing and easier to maintain corosync project. OpenAIS was then split off as a separate project specifically designed to act as an optional add-on to corosync for users who wanted AIS functionality.

You will see a lot of references to OpenAIS while searching the web for information on clustering. Understanding it's evolution will hopefully help you avoid confusion.

Concept; quorum

Quorum is defined as a collection of machines and devices in a cluster with a simple majority (50% + 1) of votes.

The idea behind quorum is that, which ever group of machines has it, can safely start clustered services even when defined members are not accessible.

Take this scenario;

  • You have a cluster of four nodes, each with one vote.
    • The cluster's expected_votes is 4. A clear majority, in this case, is 3 because (4/2)+1, rounded down, is 3.
    • Now imagine that there is a failure in the network equipment and one of the nodes disconnects from the rest of the cluster.
    • You now have two partitions; One partition contains three machines and the other partition has one.
    • The three machines will have quorum, and the other machine will lose quorum.
    • The partition with quorum will reconfigure and continue to provide cluster services.
    • The partition without quorum will withdraw from the cluster and shut down all cluster services.

This behaviour acts as a guarantee that the two partitions will never try to access the same clustered resources, like a shared filesystem, thus guaranteeing the safety of those shared resources.

This also helps explain why an even 50% is not enough to have quorum, a common question for people new to clustering. Using the above scenario, imagine if the split were 2 and 2 nodes. Because either can't be sure what the other would do, neither can safely proceed. If we allowed an even 50% to have quorum, both partition might try to take over the clustered services and disaster would soon follow.

There is one, and only one except to this rule.

In the case of a two node cluster, as we will be building here, any failure results in a 50/50 split. If we enforced quorum in a two-node cluster, there would never be high availability because and failure would cause both nodes to withdraw. The risk with this exception is that we now place the entire safety of the cluster on fencing, a concept we will cover in a second. Fencing is a second line of defense and something we are loath to rely on alone.

Even in a two-node cluster though, proper quorum can be maintained by using a quorum disk, called a qdisk. Unfortunately, qdisk on a DRBD resource comes with it's own problems, so we will not be able to use it here.

Concept; Virtual Synchrony

All cluster operations, like fencing, distributed locking and so on, have to occur in the same order across all nodes. This concept is called "virtual synchrony".

This is provided by openais using "closed process groups", CPG. A closed process group is simply a private group of nodes in a cluster. Within this closed group, all messages are ordered and consistent.

Let's look at how locks are handled on clustered file systems as an example.

  • As various nodes want to work on files, they send a lock request to the cluster. When they are done, they send a lock release to the cluster.
    • Lock and unlock messages must arrive in the same order to all nodes, regardless of the real chronological order that they were issued.
  • Let's say one node sends out messages "a1 a2 a3 a4". Meanwhile, the other node sends out "b1 b2 b3 b4".
    • All of these messages go to openais which gathers them up, puts them into an order and then sends them out in that order.
    • It is totally possible that openais will get the messages as "a2 b1 b2 a1 b4 a3 a4 b4". What order is used is not important, only that the order is consistent across all nodes.
    • The openais application will then ensure that all nodes get the messages in the above order, one at a time. All nodes must confirm that they got a given message before the next message is sent to any node.

All of this ordering, within the closed process group, is "virtual synchrony".

This will tie into fencing and totem, as we'll see in the next sections.

Concept; Fencing

Fencing is a absolutely critical part of clustering. Without fully working fence devices, your cluster will fail.

Was that strong enough, or should I say that again? Let's be safe:

DO NOT BUILD A CLUSTER WITHOUT PROPER, WORKING AND TESTED FENCING.

Sorry, I promise that this will be the only time that I speak so strongly. Fencing really is critical, and explaining the need for fencing is nearly a weekly event.

So then, let's discuss fencing.

When a node stops responding, an internal timeout and counter start ticking away. During this time, no messages are moving through the cluster because virtual synchrony is no longer possible and the cluster is, essentially, hung. If the node responds in time, the timeout and counter reset and the cluster begins operating properly again.

If, on the other hand, the node does not respond in time, the node will be declared dead and the process of ejecting it from the cluster begins.

The cluster will take a "head count" to see which nodes it still has contact with and will determine then if there are enough votes from those nodes to have quorum. If you are using qdisk, it's heuristics will run and then it's votes will be added. If there is sufficient votes for quorum, the cluster will issue a "fence" against the lost node. A fence action is a call sent to fenced, the fence daemon.

Which physical node sends the fence call is somewhat random and irrelevant. What matters is that the call comes from the CPG which has quorum.

The fence daemon will look at the cluster configuration and get the fence devices configured for the dead node. Then, one at a time and in the order that they appear in the configuration, the fence daemon will call those fence devices, via their fence agents, passing to the fence agent any configured arguments like username, password, port number and so on. If the first fence agent returns a failure, the next fence agent will be called. If the second fails, the third will be called, then the forth and so on. Once the last (or perhaps only) fence device fails, the fence daemon will retry again, starting back at the start of the list. It will do this indefinitely until one of the fence devices success.

Here's the flow, in point form:

  • The openais program collects messages and sends them off, one at a time, to all nodes.
  • All nodes respond, and the next message is sent. Repeat continuously during normal operation.
  • Suddenly, one node stops responding.
    • Communication freezes while the cluster waits for the silent node.
    • A timeout starts (300ms by default), and each time the timeout is hit, and error counter increments.
    • The silent node responds before the counter reaches the limit.
      • The counter is reset to 0
      • The cluster operates normally again.
  • Again, one node stops responding.
    • Again, the timeout begins and the error count increments each time the timeout is reached.
    • Time error exceeds the limit (10 is the default); Three seconds have passed (300ms * 10).
    • The node is declared dead.
    • The cluster checks which members it still has, and if that provides enough votes for quorum.
      • If there are too few votes for quorum, the cluster software freezes and the node(s) withdraw from the cluster.
      • If there are enough votes for quorum, the silent node is declared dead.
        • openais calls fenced, telling it to fence the node.
        • Which fence device(s) to use, that is, what fence_agent to call and what arguments to pass, is gathered.
        • For each configured fence device:
          • The agent is called and fenced waits for the fence_agent to exit.
          • The fence_agent's exit code is examined. If it's a success, recovery starts. If it failed, the next configured fence agent is called.
        • If all (or the only) configured fence fails, fenced will start over.
        • fenced will wait and loop forever until a fence agent succeeds. During this time, the cluster is hung.
    • Once a fence_agent succeeds, the cluster is reconfigured.
      • A new closed process group (cpg) is formed.
      • A new fence domain is formed.
      • Lost cluster resources are recovered as per rgmanager's configuration (including file system recovery as needed).
      • Normal cluster operation is restored.

This skipped a few key things, but the general flow of logic should be there.

This is why fencing is so important. Without a properly configured and tested fence device or devices, the cluster will never successfully fence and the cluster will stay hung forever.

Component; totem

The totem protocol defined message passing within the cluster and it is used by openais. A token is passed around all the nodes in the cluster, and the timeout discussed in fencing above is actually a token timeout. The counter, then, is the number of lost tokens that are allowed before a node is considered dead.

The totem protocol supports something called 'rrp', Redundant Ring Protocol. Through rrp, you can add a second backup ring on a separate network to take over in the event of a failure in the first ring. In RHCS, these rings are known as "ring 0" and "ring 1".

Component; rgmanager

When the cluster configuration changes, openais tells the cluster that it needs to recheck it's resources. This causes rgmanager, the resource group manager, to run. It will examine what changed and then will start, stop, migrate or recover cluster resources as needed.

Within rgmanager, one or more resources are brought together as a service. This service is then optionally assigned to a failover domain, an subset of nodes that can have preferential ordering.

Component; qdisk

Note: qdisk does not work reliably on a DRBD resource, so we will not be using it in this tutorial.

A Quorum disk, known as a qdisk is small partition used on SAN storage to enhance quorum. It generally carries enough votes to allow even a single node to take quorum during a cluster partition. It does this by using configured heuristics, that is custom tests, to decided which which node or partition is best suited for providing clustered services during a cluster reconfiguration. These heuristics can be simple, like testing which partition has access to a given router, or they can be as complex as the administrator wishes using custom scripts.

Though we won't be using it here, it is well worth knowing about when you move to a cluster with SAN storage.

Component; DRBD

DRBD; Distributed Replicating Block Device, is a technology that takes raw storage from two or more nodes and keeps their data synchronized in real time. It is sometimes described as "RAID 1 over Nodes", and that is conceptually accurate. In this tutorial's cluster, DRBD will be used to provide that back-end storage as a cost-effective alternative to a tranditional SAN or iSCSI device.

To help visualize DRBD's use and role, Take a look at how we will implement our cluster's storage. Don't worry it this seems illogical at this stage. The main thing to look at are the drbdX devices and how they each tie back to a corresponding sdaY device on either node.

         [ an-node04 ]
  ______   ______    ______     __[sda4]__
 | sda1 | | sda2 |  | sda3 |   |  ______  |       _______    ______________    ______________________________
 |______| |______|  |______|   | | sda5 |-+------| drbd0 |--| drbd_sh0_vg0 |--| /dev/drbd_sh0_vg0/xen_shared |
     |        |         |      | |______| |   /--|_______|  |______________|  |______________________________|
  ___|___    _|_    ____|____  |  ______  |   |     _______    ______________    ____________________________
 | /boot |  | / |  | <swap>  | | | sda6 |-+---+----| drbd1 |--| drbd_an4_vg0 |--| /dev/drbd_an4_vg0/vm0001_1 |
 |_______|  |___|  |_________| | |______| |   | /--|_______|  |______________|  |____________________________|
                               |  ______  |   | |     _______    ______________    ____________________________
                               | | sda7 |-+---+-+----| drbd2 |--| drbd_an5_vg0 |--| /dev/drbd_an4_vg0/vm0002_1 | 
                               | |______| |   | | /--|_______|  |______________|  |____________________________|
                               |  ______  |   | | |                         | |    _______________________
                               | | sda8 |-+---+-+-+--\                      | \---| Example LV for 2nd VM |
                               | |______| |   | | |  |                      |     |_______________________|
                               |__________|   | | |  |                      |      _______________________
         [ an-node05 ]                        | | |  |                      \-----| Example LV for 3rd VM |
  ______   ______    ______     __[sda4]__    | | |  |                            |_______________________|
 | sda1 | | sda2 |  | sda3 |   |  ______  |   | | |  |                   
 |______| |______|  |______|   | | sda5 |-+---/ | |  |   _______    __________________
     |        |         |      | |______| |     | |  \--| drbd3 |--| Spare PV for     |
  ___|___    _|_    ____|____  |  ______  |     | |  /--|_______|  | future expansion |
 | /boot |  | / |  | <swap>  | | | sda6 |-+-----/ |  |             |__________________|
 |_______|  |___|  |_________| | |______| |       |  |
                               |  ______  |       |  |
                               | | sda7 |-+-------/  |
                               | |______| |          |
                               |  ______  |          |
                               | | sda8 |-+----------/
                               | |______| |
                               |__________|
.

Component; CLVM

With DRBD providing the raw storage for the cluster, we must next consider partitions. This is where Clustered LVM, known as CLVM, comes into play.

CLVM is ideal in that by using DLM, the distributed lock manager, it won't allow access to cluster members outside of openais's closed process group, which, in turn, requires quorum.

It is ideal because it can take one or more raw devices, known as "physical volumes", or simple as PVs, and combine their raw space into one or more "volume groups", known as VGs. These volume groups then act just like a typical hard drive and can be "partitioned" into one or more "logical volumes", known as LVs. These LVs are where Xen's domU virtual machines will exist and where we will create out GFS2 clustered file system.

LVM is particularly attractive because of how incredibly flexible it is. We can easily add new physical volumes later, and then grow an existing volume group to use the new space. This new space can then be given to existing logical volumes, or entirely new logical volumes can be created. This can all be done while the cluster is online offering an upgrade path with no down time.

Component; GFS2

With DRBD providing the clusters raw storage space, and Clustered LVM providing the logical partitions, we can now look at the clustered file system. This is the role of the Global File System version 2, known simply as GFS2.

It works much like standard filesystem, with user-land tools like mkfs.gfs2, fsck.gfs2 and so on. The major difference is that it and clvmd use the cluster's distributed locking mechanism provided by the dlm_controld daemon. Once formatted, the GFS2-formatted partition can be mounted and used by any node in the cluster's closed process group. All nodes can then safely read from and write to the data on the partition simultaneously.

Component; DLM

One of the major roles of a cluster is to provide distributed locking on clustered storage. In fact, storage software can not be clustered without using DLM, as provided by the dlm_controld daemon and using openais's virtual synchrony via CPG.

Through DLM, all nodes accessing clustered storage are guaranteed to get POSIX locks, called plocks, in the same order across all nodes. Both CLVM and GFS2 rely on DLM, though other clustered storage, like OCFS2, use it as well.

Component; Xen

Two of the most popular open-source virtualization platforms available in the Linux world today and Xen and KVM. The former is maintained by Citrix and the other by Redhat. It would be difficult to say which is "better", as they're both very good. Xen can be argued to be more mature where KVM is the "official" solution supported by Red Hat in EL6.

We will be using the Xen hypervisor and a "host" virtual server called dom0. In Xen, every machine is a virtual server, including the system you installed when you built the server. This is possible thanks to a small Xen micro-operating system that initially boots, then starts up your original installed operating system as a virtual server with special access to the underlying hardware and hypervisor management tools.

The rest of the virtual servers in a Xen environment are collectively called "domU" virtual servers. These will be the highly-available resource that will migrate between nodes during failure events in our cluster.

Base Setup

Before we can look at the cluster, we must first build two cluster nodes and then install the operating system.

Hardware Requirements

The bare minimum requirements are;

  • All hardware must be supported by EL5. It is strongly recommended that you check compatibility before making any purchases.
  • A dual-core CPUs with hardware virtualization support.
  • Three network cards; At least one should be gigabit or faster.
  • One hard drive.
  • 2 GiB of RAM
  • A fence device. This can be an IPMI-enabled server, a Node Assassin, a switched PDU or similar.

This tutorial was written using the following hardware:

This is not an endorsement of the above hardware. I put a heavy emphasis on minimizing power consumption and bought what was within my budget. This hardware was never meant to be put into production, but instead was chosen to serve the purpose of my own study and for creating this tutorial. What you ultimately choose to use, provided it meets the minimum requirements, is entirely up to you and your requirements.

Note: I use three physical NICs, but you can get away with two by merging the storage and back-channel networks, which we will discuss shortly. If you are really in a pinch, you could create three aliases on on interface and isolate them using VLANs. If you go this route, please ensure that your VLANs are configured and working before beginning this tutorial. Pay close attention to multicast traffic.

Pre-Assembly

Before you assemble your nodes, take a moment to record the MAC addresses of each network interface and then note where each interface is physically installed. This will help you later when configuring the networks. I generally create a simple text file with the MAC addresses, the interface I intend to assign to it and where it physically is located.

-=] an-node04
48:5B:39:3C:53:15   # eth0 - onboard interface
00:1B:21:72:9B:5A   # eth1 - right-most PCIe interface
00:1B:21:72:96:EA   # eth2 - left-most PCIe interface

-=] an-node05
48:5B:39:3C:53:13   # eth0 - onboard interface
00:1B:21:72:99:AB   # eth1 - right-most PCIe interface
00:1B:21:72:96:A6   # eth2 - left-most PCIe interface

OS Install

Later steps will include packages to install, so the initial OS install can be minimal. I like to change the default run-level to 3, remove rhgb quiet from the grub menu, disable the firewall and disable SELinux. In a production cluster, you will want to use firewalling and selinux, but until you finish studying, leave it off to keep things simple.

Note: Before EL5.4, you could not use SELinux. It is now possible to use it, and it is recommended that you do so in any production cluster.
Note: Ports and protocols to open in a firewall will be discussed later in the networking section.

I like to minimize and automate my installs as much as possible. To that end, I run a little PXE server on my network and use a kickstart script to automate the install. Here is a simple one for use on a single-drive node:

If you decide to manually install EL5 on your nodes, please try to keep the installation as small as possible. The fewer packages installed, the fewer sources of problems and vectors for attack.

Post Install OS Changes

This section discusses changes I recommend, but are not required. If you wish to adapt any of the steps below, please do so but be sure to keep the changes consistent through out the implementation of this tutorial.

Network Planning

The most important change that is recommended is to get your nodes into a consistent networking configuration. This will prove very handy when trying to keep track of your networks and where they're physically connected. This becomes exponentially more helpful as your cluster grows.

The first step is to understand the three networks we will be creating. Once you understand their role, you will need to decide which interface on the nodes will be used for each network.

Cluster Networks

The three networks are;

Network Acronym Use
Back-Channel Network BCN Private cluster communications, virtual machine migrations, fence devices
Storage Network SN Used exclusively for storage communications. Possible to use as totem's redundant ring.
Internet-Facing Network IFN Internet-polluted network. No cluster, storage or cluster device communication.

Things To Consider

When planning which interfaces to connect to each network, consider the following, in order of importance:

  • If your nodes have IPMI and an interface sharing a physical RJ-45 connector, this must be on the Back-Channel Network. The reasoning is that having your fence device accessible on the Internet-Facing Network poses a major security risk. Having the IPMI interface on the Storage Network can cause problems if a fence is fired and the network is saturated with storage traffic.
  • The lowest-latency network interface should be used as the Back-Channel Network. The cluster is maintained by multicast messaging between the nodes using something called the totem protocol. Any delay in the delivery of these messages can risk causing a failure and ejection of effected nodes when no actual failure existed. This will be discussed in greater detail later.
  • The network with the most raw bandwidth should be used for the Storage Network. All disk writes must be sent across the network and committed to the remote nodes before the write is declared complete. This causes the network to become the disk I/O bottle neck. Using a network with jumbo frames and high raw throughput will help minimize this bottle neck.
  • During the live migration of virtual machines, the VM's RAM is copied to the other node using the BCN. For this reason, the second fastest network should be used for back-channel communication. However, these copies can saturate the network, so care must be taken to ensure that cluster communications get higher priority. This can be done using a managed switch. If you can not ensure priority for totem multicast, then be sure to configure Xen later to use the storage network for migrations.
  • The remain, slowest interface should be used for the IFN.

Planning the Networks

This paper will use the following setup. Feel free to alter the interface to network mapping and the IP subnets used to best suit your needs. For reasons completely my own, I like to start my cluster IPs final octal at 71 for node 1 and then increment up from there. This is entirely arbitrary, so please use what ever makes sense to you. The remainder of this tutorial will follow the convention below:

Network Interface Subnet
IFN eth0 192.168.1.0/24
SN eth1 192.168.2.0/24
BCN eth2 192.139.3.0/24

This translates to the following per-node configuration:

an-node04 an-node05
Interface IP Address Host Name(s) IP Address Host Name(s)
IFN eth0 192.168.1.74 an-node04.ifn 192.168.1.75 an-node05.ifn
SN eth1 192.168.2.74 an-node04.sn 192.168.2.75 an-node05.sn
BCN eth2 192.168.3.74 an-node04 an-node04.alteeve.com an-node04.bcn 192.168.3.75 an-node05 an-node05.alteeve.com an-node05.bcn

Network Configuration

So now we've planned the network, so it is time to implement it.

Warning About Managed Switches

Warning: The vast majority of cluster problems end up being network related. The hardest ones to diagnose are usually multicast issues.

If you use a managed switch, be careful about enabling and configuring Multicast IGMP Snooping or Spanning Tree Protocol. They have been known to cause problems by not allowing multicast packets to reach all nodes fast enough or at all. This can cause somewhat random break-downs in communication between your nodes, leading to seemingly random fences and DLM lock timeouts. If your switches support PIM Routing, be sure to use it!

If you have problems with your cluster not forming, or seemingly random fencing, try using a cheap unmanaged switch. If the problem goes away, you are most likely dealing with a managed switch configuration problem.

Disable Firewalling

To "keep things simple", we will disable all firewalling on the cluster nodes. This is not recommended in production environments, obviously, so below will be a table of ports and protocols to open when you do get into production. Until then, we will simply use chkconfig to disable iptables and ip6tables.

Note: Cluster 2 does not support IPv6, so you can skip or ignore it if you wish. I like to disable it just to be certain that it can't cause issues though.
chkconfig iptables off
chkconfig ip6tables off
/etc/init.d/iptables stop
/etc/init.d/ip6tables stop

Now confirm that they are off by having iptables and ip6tables list their rules.

iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
ip6tables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

When you do prepare to go into production, these are the protocols and ports you need to open between cluster nodes. Remember to allow multicast communications as well!

Port Protocol Component
5404, 5405 UDP cman
8084, 5405 TCP luci
11111 TCP ricci
14567 TCP gnbd
16851 TCP modclusterd
21064 TCP dlm
50006, 50008, 50009 TCP ccsd
50007 UDP ccsd

Disable NetworkManager, Enable network

The NetworkManager daemon is an excellent daemon in environments where a system connects to a variety of networks. The NetworkManager daemon handles changing the networking configuration whenever it senses a change in the network state, like when a cable is unplugged or a wireless network comes or goes. As useful as this is on laptops and workstations, it can be detrimental in a cluster.

To prevent the networking from changing once we've got it setup, we want to replace NetworkManager daemon with the network initialization script. The network script will start and stop networking, but otherwise it will leave the configuration alone. This is ideal in servers, and doubly-so in clusters given their sensitivity to transient network issues.

Start by removing NetworkManager:

yum remove NetworkManager NetworkManager-glib NetworkManager-gnome NetworkManager-devel NetworkManager-glib-devel

Now you want to ensure that network starts with the system.

chkconfig network on

Setup /etc/hosts

The /etc/hosts file, by default, will resolve the hostname to the lo (127.0.0.1) interface. The cluster uses this name though for knowing which interface to use for the totem protocol (and thus all cluster communications). To this end, we will remove the hostname from 127.0.0.1 and instead put it on the IP of our BCN connected interface. At the same time, we will add entries for all networks for each node in the cluster and entries for the fence devices. Once done, the edited /etc/hosts file should be suitable for copying to all nodes in the cluster.

vim /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1	localhost.localdomain localhost
::1		localhost6.localdomain6 localhost6

192.168.1.74	an-node04.ifn
192.168.2.74	an-node04.sn
192.168.3.74	an-node04 an-node04.bcn an-node04.alteeve.com

192.168.1.75	an-node05.ifn
192.168.2.75	an-node05.sn
192.168.3.75	an-node05 an-node05.bcn an-node05.alteeve.com

192.168.3.61	batou.alteeve.com	# Node Assassin
192.168.3.62	motoko.alteeve.com	# Switched PDU

Mapping Interfaces to ethX Names

Chances are good that the assignment of ethX interface names to your physical network cards is not ideal. There is no strict technical reason to change the mapping, but it will make you life a lot easier if all nodes use the same ethX names for the same subnets.

The actual process of changing the mapping is a little involved. For this reason, there is a dedicated mini-tutorial which you can find below. Please jump to it and then return once your mapping is as you like it.

Set IP Addresses

The last step in setting up the network interfaces is to manually assign the IP addresses and define the subnets for the interfaces. This involves directly editing the /etc/sysconfig/network-scripts/ifcfg-ethX files. There are a large set of options that can be set in these configuration files, but most are outside the scope of this tutorial. To get a better understanding of the available options, please see:

Here are my three configuration files which you can use as guides. Please do not copy these over your files! Doing so will cause your interfaces to fail outright as every interface's MAC address is unique. Adapt these to suite your needs.

vim /etc/sysconfig/network-scripts/ifcfg-eth0
# Internet-Facing Network
HWADDR=48:5B:39:3C:53:15
DEVICE=eth0
BOOTPROTO=static
ONBOOT=yes
IPADDR=192.168.1.74
NETMASK=255.255.255.0
GATEWAY=192.168.1.254
vim /etc/sysconfig/network-scripts/ifcfg-eth1
# Storage Network
HWADDR=00:1B:21:72:9B:5A
DEVICE=eth1
BOOTPROTO=static
ONBOOT=yes
IPADDR=192.168.2.74
NETMASK=255.255.255.0
vim /etc/sysconfig/network-scripts/ifcfg-eth2
# Back Channel Network
HWADDR=00:1B:21:72:96:EA
DEVICE=eth2
BOOTPROTO=static
ONBOOT=yes
IPADDR=192.168.3.74
NETMASK=255.255.255.0

You will also need to setup the /etc/resolv.conf file for DNS resolution. You can learn more about this file's purpose by reading it's man page; man resolv.conf. The main thing is to set valid DNS server IP addresses in the nameserver sections. Here is mine, for reference:

vim /etc/resolv.conf
search alteeve.com
nameserver 192.139.81.117
nameserver 192.139.81.1

Finally, restart network and you should have you interfaces setup properly.

/etc/init.d/network restart
Shutting down interface eth0:                              [  OK  ]
Shutting down interface eth1:                              [  OK  ]
Shutting down interface eth2:                              [  OK  ]
Shutting down loopback interface:                          [  OK  ]
Bringing up loopback interface:                            [  OK  ]
Bringing up interface eth0:                                [  OK  ]
Bringing up interface eth1:                                [  OK  ]
Bringing up interface eth2:                                [  OK  ]

You can verify your configuration using the ifconfig tool.

ifconfig
eth0      Link encap:Ethernet  HWaddr 48:5B:39:3C:53:15  
          inet addr:192.168.1.74  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::92e6:baff:fe71:82ea/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1727 errors:0 dropped:0 overruns:0 frame:0
          TX packets:655 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:208916 (204.0 KiB)  TX bytes:133171 (130.0 KiB)
          Interrupt:252 Base address:0x2000 

eth1      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:5A  
          inet addr:192.168.2.74  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::221:91ff:fe19:9653/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:998 errors:0 dropped:0 overruns:0 frame:0
          TX packets:47 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:97702 (95.4 KiB)  TX bytes:6959 (6.7 KiB)
          Interrupt:16 

eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:96:EA  
          inet addr:192.168.3.74  Bcast:192.168.3.255  Mask:255.255.255.0
          inet6 addr: fe80::20e:cff:fe59:46e4/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5241 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4439 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1714026 (1.6 MiB)  TX bytes:1624392 (1.5 MiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:42 errors:0 dropped:0 overruns:0 frame:0
          TX packets:42 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:6449 (6.2 KiB)  TX bytes:6449 (6.2 KiB)

Setting up SSH

Setting up SSH shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This will be needed later when we want to enable applications like libvirtd and virt-manager.

SSH is, on it's own, a very big topic. If you are not familiar with SSH, please take some time to learn about it before proceeding. A great first step is the Wikipedia entry on SSH, as well as the SSH man page; man ssh.

SSH can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user on each node, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the other node's root user's directory.

For each user, on each machine you want to connect from, run:

# The '2047' is just to screw with brute-forces a bit. :)
ssh-keygen -t rsa -N "" -b 2047 -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
Created directory '/root/.ssh'.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
a1:65:a9:50:bb:15:ae:b1:6e:06:12:4a:29:d1:68:f3 root@an-node04.alteeve.com

This will create two files: the private key called ~/.ssh/id_rsa and the public key called ~/.ssh/id_rsa.pub. The private must never be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

Private key:

cat ~/.ssh/id_rsa
-----BEGIN RSA PRIVATE KEY-----
MIIEnwIBAAKCAQBTNg6FZyDKm4GAm7c+F2enpLWy+t8ZZjm4Z3Q7EhX09ukqk/Qm
MqprtI9OsiRVjce+wGx4nZ8+Z0NHduCVuwAxG0XG7FpKkUJC3Qb8KhyeIpKEcfYA
tsDUFnWddVF8Tsz6dDOhb61tAke77d9E01NfyHp88QBxjJ7w+ZgB2eLPBFm6j1t+
K50JHwdcFfxrZFywKnAQIdH0NCs8VaW91fQZBupg4OGOMpSBnVzoaz2ybI9bQtbZ
4GwhCghzKx7Qjz20WiqhfPMfFqAZJwn0WXfjALoioMDWavTbx+J2HM8KJ8/YkSSK
dDEgZCItg0Q2fC35TDX+aJGu3xNfoaAe3lL1AgEjAoIBABVlq/Zq+c2y9Wo2q3Zd
yjJsLrj+rmWd8ZXRdajKIuc4LVQXaqq8kjjz6lYQjQAOg9H291I3KPLKGJ1ZFS3R
AAygnOoCQxp9H6rLHw2kbcJDZ4Eknlf0eroxqTceKuVzWUe3ev2gX8uS3z70BjZE
+C6SoydxK//w9aut5UJN+H5f42p95IsUIs0oy3/3KGPHYrC2Zgc2TIhe25huie/O
psKhHATBzf+M7tHLGia3q682JqxXru8zhtPOpEAmU4XDtNdL+Bjv+/Q2HMRstJXe
2PU3IpVBkirEIE5HlyOV1T802KRsSBelxPV5Y6y5TRq+cEwn0G2le1GiFBjd0xQd
0csCgYEA2BWkxSXhqmeb8dzcZnnuBZbpebuPYeMtWK/MMLxvJ50UCUfVZmA+yUUX
K9fAUvkMLd7V8/MP7GrdmYq2XiLv6IZPUwyS8yboovwWMb+72vb5QSnN6LAfpUEk
NRd5JkWgqRstGaUzxeCRfwfIHuAHikP2KeiLM4TfBkXzhm+VWjECgYBilQEBHvuk
LlY2/1v43zYQMSZNHBSbxc7R5mnOXNFgapzJeFKvaJbVKRsEQTX5uqo83jRXC7LI
t14pC23tpW1dBTi9bNLzQnf/BL9vQx6KFfgrXwy8KqXuajfv1ECH6ytqdttkUGZt
TE/monjAmR5EVElvwMubCPuGDk9zC7iQBQKBgG8hEukMKunsJFCANtWdyt5NnKUB
X66vWSZLyBkQc635Av11Zm8qLusq2Ld2RacDvR7noTuhkykhBEBV92Oc8Gj0ndLw
hhamS8GI9Xirv7JwYu5QA377ff03cbTngCJPsbYN+e/uj6eYEE/1X5rZnXpO1l6y
G7QYcrLE46Q5YsCrAoGAL+H5LG4idFEFTem+9Tk3hDUhO2VpGHYFXqMdctygNiUn
lQ6Oj7Z1JbThPJSz0RGF4wzXl/5eJvn6iPbsQDpoUcC1KM51FxGn/4X2lSCZzgqr
vUtslejUQJn96YRZ254cZulF/YYjHyUQ3byhDRcr9U2CwUBi5OcbFTomlvcQgHcC
gYEAtIpaEWt+Akz9GDJpKM7Ojpk8wTtlz2a+S5fx3WH/IVURoAzZiXzvonVIclrH
5RXFiwfoXlMzIulZcrBJZfTgRO9A2v9rE/ZRm6qaDrGe9RcYfCtxGGyptMKLdbwP
UW1emRl5celU9ZEZRBpIVTES5ZVWqD2RkkkNNJbPf5F/x+w=
-----END RSA PRIVATE KEY-----

Public key (wrapped to make it more readable):

cat ~/.ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQBTNg6FZyDKm4GAm7c+F2enpLWy+t8Z
Zjm4Z3Q7EhX09ukqk/QmMqprtI9OsiRVjce+wGx4nZ8+Z0NHduCVuwAxG0XG7FpK
kUJC3Qb8KhyeIpKEcfYAtsDUFnWddVF8Tsz6dDOhb61tAke77d9E01NfyHp88QBx
jJ7w+ZgB2eLPBFm6j1t+K50JHwdcFfxrZFywKnAQIdH0NCs8VaW91fQZBupg4OGO
MpSBnVzoaz2ybI9bQtbZ4GwhCghzKx7Qjz20WiqhfPMfFqAZJwn0WXfjALoioMDW
avTbx+J2HM8KJ8/YkSSKdDEgZCItg0Q2fC35TDX+aJGu3xNfoaAe3lL1 root@an
-node01.alteeve.com

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From an-node04, type:

ssh root@an-node05
The authenticity of host 'an-node05 (192.168.3.75)' can't be established.
RSA key fingerprint is 55:58:c3:32:e4:e6:5e:32:c1:db:5c:f1:36:e2:da:4b.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node05,192.168.3.75' (RSA) to the list of known hosts.
Last login: Fri Mar 11 20:45:58 2011 from 192.168.1.202

You will now be logged into an-node05 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node04. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

(Wrapped to make it more readable)

cat ~/.ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQBTNg6FZyDKm4GAm7c+F2enpLWy+t8Z
Zjm4Z3Q7EhX09ukqk/QmMqprtI9OsiRVjce+wGx4nZ8+Z0NHduCVuwAxG0XG7FpK
kUJC3Qb8KhyeIpKEcfYAtsDUFnWddVF8Tsz6dDOhb61tAke77d9E01NfyHp88QBx
jJ7w+ZgB2eLPBFm6j1t+K50JHwdcFfxrZFywKnAQIdH0NCs8VaW91fQZBupg4OGO
MpSBnVzoaz2ybI9bQtbZ4GwhCghzKx7Qjz20WiqhfPMfFqAZJwn0WXfjALoioMDW
avTbx+J2HM8KJ8/YkSSKdDEgZCItg0Q2fC35TDX+aJGu3xNfoaAe3lL1 root@an
-node01.alteeve.com

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

Various applications will connect to the other node using different methods and networks. Each connection, when first established, will prompt for you to confirm that you trust the authentication, as we saw above. Many programs can't handle this prompt and will simply fail to connect. So to get around this, I will ssh into both nodes using all hostnames. This will populate a file called ~/.ssh/known_hosts. Once you do this on one node, you can simply copy the known_hosts to the other nodes and user's ~/.ssh/ directories.

I simply paste this into a terminal, answering yes and then immediately exiting from the ssh session. This is a bit tedious, I admit. Take the time to check the fingerprints as they are displayed to you. It is a bad habit to blindly type yes.

Alter this to suit your host names.

ssh root@an-node04 && \
ssh root@an-node04.alteeve.com && \
ssh root@an-node04.bcn && \
ssh root@an-node04.sn && \
ssh root@an-node04.ifn && \
ssh root@an-node05 && \
ssh root@an-node05.alteeve.com && \
ssh root@an-node05.bcn && \
ssh root@an-node05.sn && \
ssh root@an-node05.ifn

Keeping Time In Sync

It is very important that time on both nodes be kept in sync. The way to do this is to setup [[[NTP]], the network time protocol. I like to use the tick.redhat.com time server, though you are free to substitute your preferred time source.

First, add the timeserver to the NTP configuration file by appending the following lines to the end of the

echo server tick.redhat.com$'\n'restrict tick.redhat.com mask 255.255.255.255 nomodify notrap noquery >> /etc/ntp.conf
tail -n 4 /etc/ntp.conf
# Specify the key identifier to use with the ntpq utility.
#controlkey 8
server tick.redhat.com
restrict tick.redhat.com mask 255.255.255.255 nomodify notrap noquery

Now make sure that the ntpd service starts on boot, then start it manually.

chkconfig ntpd on
/etc/init.d/ntpd start
Starting ntpd:                                             [  OK  ]

Altering Boot Up

Note: These are an optional steps.

There are two changes I like to make on my nodes. These are not required, but I find it helps to keep things as simple as possible. Particularly in the earlier learning and testing stages.

Changing the Default Run-Level

If you choose not to implement it, please change any referenced to /etc/rc3.d to /etc/rc5.d later in this tutorial.

I prefer to minimize the running daemons and apps on my nodes for two reasons; Performance and security. One of the simplest ways to minimize the number of running programs is to change the run-level to 3 by editing /etc/inittab. This tells the node when it boots not to start the graphical interface and instead simply boot to a bash shell.

This change is actually quite simple. Simple edit /etc/inittab and change the line id:5:initdefault: to id:3:initdefault:.

vim /etc/inittab
# Default runlevel. The runlevels used by RHS are:
#   0 - halt (Do NOT set initdefault to this)
#   1 - Single user mode
#   2 - Multiuser, without NFS (The same as 3, if you do not have networking)
#   3 - Full multiuser mode
#   4 - unused
#   5 - X11
#   6 - reboot (Do NOT set initdefault to this)
# 
id:3:initdefault:

If you are still in a graphical environment and want to disable the GUI without rebooting, you can run init 3. Conversely, if you want to start the GUI for a certain task, you can do so my running init 5.

Making Boot Messages Visible

Another optional step, in-line with the change above, is to disable the rhgb (Red Hat Graphical Boot) and quiet kernel arguments. These options provide the clean boot screen you normally see with EL5, but they also hide a lot of boot messages that we may find helpful.

To make this change, edit the grub boot-loader menu and remove the rhgb quiet arguments from the kernel /vmlinuz... line. These arguments are usually the last ones on the line. If you leave this until later you may see two or more kernel entries. Delete these arguments where ever they are found.

vim /boot/grub/menu.lst

Change:

title CentOS (2.6.18-194.32.1.el5)
        root (hd0,0)
        kernel /vmlinuz-2.6.18-194.32.1.el5 ro root=LABEL=/ rhgb quiet
        initrd /initrd-2.6.18-194.32.1.el5.img

To:

title CentOS (2.6.18-194.32.1.el5)
        root (hd0,0)
        kernel /vmlinuz-2.6.18-194.32.1.el5 ro root=LABEL=/
        initrd /initrd-2.6.18-194.32.1.el5.img

There is nothing more to do now. Future reboots will be simple terminal displays.

Installing Packages We Will Use

There are several packages we will need. They can all be installed in one go with the following command.

If you have a slow or metered Internet connection, you may want to alter /etc/yum.conf and change keepcache=0 to keepcache=1 before installing packages. This way, you can then run you updates and installs on one node and then rsync the downloaded files from the first node to the second node. Once done, when you run the updates and installs on that second node, nothing more will be downloaded. To copy the cached RPMs, simply run rsync -av /var/cache/yum root@an-node05:/var/cache/ (assuming you did the initial downloads from an-node04).

yum install cman openais rgmanager lvm2-cluster gfs2-utils xen xen-libs kmod-xenpv \
            drbd83 kmod-drbd83-xen virt-manager virt-viewer libvirt libvirt-python \
            python-virtinst luci ricci

This will drag in a good number of dependencies, which is fine.

Setting Up Xen

It may seem premature to discuss Xen before the cluster itself. The reason we need to look at it now, before the cluster, is because Xen makes some fairly significant changes to the networking. Given how changes to networking can effect the cluster, we will want to get these changes out of the way.

We're not going to provision any virtual machines until the cluster is built.

A Brief Overview

Xen is a hypervisor the converts the installed operating system into a virtual machine running on a small Xen kernel. This same small kernel also runs all of the virtual machines you will add later. In this way, you will always be working in a virtual machine once you switch to booting a Xen kernel. In Xen terminology, virtual machines are known as domains.

The "host" operating system is known as dom0 (domain 0) and has a special view of the hardware plus contains the configuration and control of Xen itself. All other Xen virtual machines are known as domU (domain U). This is a collective term that represents the transient ID number assigned to all virtual machines. For example, when you boot the first virtual machine, it is known as dom1. The next will be dom2, then dom3 and so on. Do note that if a domU shuts down, it's ID is not reused. So when it restarts, it will use the next free ID (ie: dom4 in this list, despite it having been, say, dom1 initially).

This makes Xen somewhat unique in the virtualization world. Most others do not touch or alter the "host" OS, instead running the guest VMs fully withing the context of the host operating system.

Understanding Networking in Xen

Xen uses a fairly complex networking system. This is, perhaps, it's strongest point. The trade off though is that it can be a little tricky to wrap your head around. To help you become familiar, there is a short tutorial dedicated to this topic. Please read it over before proceeding in you are not familiar with Xen's networking.

Taking the time to read and understand the mini-paper below will save you a lot of heartache in the following stages.

Making Network Interfaces Available To Xen Clients

As discussed above, Xen makes some significant changes to the dom0 network, which happens to be where the cluster will operate. These changes including shutting down and moving around the interfaces. As we will discuss later, this behaviour can trigger cluster failures. This is the main reason for dealing with Xen now. Once the changes are in place, the network is stable and safe for running the cluster on.

A Brief Overview

By default, Xen only makes eth0 available to the virtual machines. We will want to add eth2 as well, as we will use the Back Channel Network for inter-VM communication. We do not want to add the Storage Network to Xen though! Doing so puts the DRBD link at risk. Should xend get shut down, it could trigger a split-brain in DRBD.

What Xen does, in brief, is move the "real" eth0 over to a new device called peth0. Then it creates a virtual "clone" of the network interface called eth0. Next, Xen creates a bridge called xenbr0. Finally, both the real peth0 and the new virtual eth0 are connected to the xenbr0 bridge.

The reasoning behind all this is to separate the traffic coming to and from dom0 from any traffic doing to the various domUs. Think of it sort of like the bridge being a network switch, the peth0 being an uplink cable to the outside world and the virtual eth0 being dom0's "port" on the switch. We want the same to be done to the interface on the Back-Channel Network, too. The Storage Network will never be exposed to the domU machines, so combining the risk to the underlying storage, there is no reason to add eth1 to Xen's control.

Disable the 'qemu' Bridge

By default, libvirtd creates a bridge called virbr0 designed to connect virtual machines to the first eth0 interface. Our system will not need this, so we will remove it. This bridge is configured in the /etc/libvirt/qemu/networks/default.xml file, so to remove this bridge, simply delete the contents of the file.

cat /dev/null >/etc/libvirt/qemu/networks/default.xml

The next time you reboot, that bridge will be gone.

Create /etc/xen/scripts/an-network-script

We will create a script that Xen will be told to use for bringing up the "xenified" network interfaces.

Please note:

  1. You don't need to use the name 'an-network-script'. I suggest this name mainly to keep in line with the rest of the 'AN!x' naming used on this wiki.
  2. If you install convirt (not discussed further here), it will create it's own bridge script called convirt-xen-multibridge. Other tools may do something similar.

First, touch the file and then chmod it to be executable.

touch /etc/xen/scripts/an-network-script
chmod 755 /etc/xen/scripts/an-network-script

Now edit it to contain the following:

vim /etc/xen/scripts/an-network-script
#!/bin/sh
dir=$(dirname "$0")
"$dir/network-bridge" "$@" vifnum=0 netdev=eth0 bridge=xenbr0
"$dir/network-bridge" "$@" vifnum=2 netdev=eth2 bridge=xenbr2

Let's move on to the main Xen configuration file now.

Editing the /etc/xen/xend-config.sxp Configuration File

We need to do two things here:

  • Tell Xen to use our new network script.
  • Tell Xen to enable it's unix socket so that external tools can manage it.
  • Enable Live Migration of VMs between nodes.

Edit the /etc/xen/xend-config.sxp file and changing the network-script argument to point to this new script. This is about line 91.

vim /etc/xen/xend-config.sxp
#(network-script network-bridge)
(network-script an-network-script)

Next, tell Xen to enable it's unix socket. This is how tools like virsh, which we will look at later, interact with Xen. To do this, change xend-unix-server, which is around line 19.

(xend-unix-server yes)

Finally, to enable live migration, we need to edit four values. Let's look at the new values, then we'll discuss what they effect and how their syntax works.

(xend-relocation-server yes)
(xend-relocation-port 8002)
(xend-relocation-address 'an-node04.bcn')
(xend-relocation-hosts-allow '')
  • xend-unix-server; When set to yes, this tells Xen to enable it's unix socket. This is needed by management tools like virsh.
  • xend-relocation-server; When set to yes, this tells Xen to allow the migration of VMs.
  • xend-relocation-port; This controls what TCP port that Xen listens for migration requests.
  • xend-relocation-address; This is an IP address or resolvable name that must match an IP address of an interface on the local machine. This binds Xen's migration to the given interface. If set to just '', Xen will listen for connections on all interfaces.
  • xend-relocation-hosts-allow; This is a space-separated list of host names, IP addresses and regular expressions of hosts that are allowed to be migration sources and targets. Some examples are; an-node04 an-node05 ^192\.168\.*$. If set to just '', Xen will allow migration to or from all nodes on the network. As we've already restricted migrate to the BCN by way of xend-relocation-address 'an-node04.bcn', it's save to leave this open to any host.
Note: Be sure that you set xend-relocation-address is set uniquely for each node.

Finally, save the file and check that it works by (re)starting xend:

/etc/init.d/xend restart
restart xend:                                              [  OK  ]

Now we'll use ifconfig to see the new network configuration (with a dash of creative grep to save screen space):

ifconfig |grep "Link encap" -A 1
eth0      Link encap:Ethernet  HWaddr 48:5B:39:3C:53:15
          inet addr:192.168.1.74  Bcast:192.168.1.255  Mask:255.255.255.0
--
eth1      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:5A
          inet addr:192.168.2.74  Bcast:192.168.2.255  Mask:255.255.255.0
--
eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:96:EA
          inet addr:192.168.3.74  Bcast:192.168.3.255  Mask:255.255.255.0
--
lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
--
peth0     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
--
peth2     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
--
vif0.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
--
vif0.2    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
--
xenbr0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
--
xenbr2    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1

If you see this, the Xen networking is setup properly!

Altering When xend Starts

As was mentioned, xend rather dramatically modified the networking when it starts. We need to now make sure that xend starts before cman, which is not the case by default. To do this, we will edit the /etc/init.d/xend script and change it's default start position from 98 to 12.

Edit /etc/init.d/xend:

vim /etc/init.d/xend

And change the chkconfig: 2345 98 01 header from:

#!/bin/bash
#
# xend          Script to start and stop the Xen control daemon.
#
# Author:       Keir Fraser <keir.fraser@cl.cam.ac.uk>
#
# chkconfig: 2345 98 01
# description: Starts and stops the Xen control daemon.

To chkconfig: 2345 12 01:

#!/bin/bash
#
# xend          Script to start and stop the Xen control daemon.
#
# Author:       Keir Fraser <keir.fraser@cl.cam.ac.uk>
#
# chkconfig: 2345 12 01
# description: Starts and stops the Xen control daemon.

Now delete and re-add the xend initialization script using chkconfig.

chkconfig xend off
chkconfig xend on

If it worked, you should see it now higher up the start list than cman (ignore xendomains:

ls -lah /etc/rc3.d/ | grep -e cman -e xend
lrwxrwxrwx  1 root root   20 Mar  2 11:36 K00xendomains -> ../init.d/xendomains
lrwxrwxrwx  1 root root   14 Mar 14 11:38 S12xend -> ../init.d/xend
lrwxrwxrwx  1 root root   14 Mar 14 11:38 S21cman -> ../init.d/cman

That's it! The initial Xen configuration is done and we can start on the cluster configuration itself!

Cluster Setup

In Red Hat Cluster Services, the heart of the cluster is found in the /etc/cluster/cluster.conf XML configuration file.

There are three main ways of editing this file. Two are already well documented, so I won't bother discussing them, beyond introducing them. The third way is by directly hand-crafting the cluster.conf file. This method is not very well documented, and directly manipulating configuration files is my preferred method. As my boss loves to say; The more computers do for you, the more they do to you". I've grudging come to agree with him.

The first two, well documented, graphical interface methods are:

  • system-config-cluster, older GUI tool run directly from one of the cluster nodes.
  • Conga, comprised of the ricci node-side client and the luci web-based server (can be run on machines outside the cluster).

I do like the tools above, but I often find issues that send me back to the command line. I'd recommend setting them aside for now as well. Once you feel comfortable with cluster.conf syntax, then by all means, go back and use them. I'd recommend not relying on them though, which might be the case if you try to use them too early in your studies.

The First cluster.conf Foundation Configuration

The very first stage of building the cluster is to create a configuration file that is as minimal as possible. To do that, we need to define a few thing;

  • The name of the cluster and the cluster file version.
    • Define cman options
    • The nodes in the cluster
      • The fence method for each node
    • Define fence devices
    • Define fenced options

That's it. Once we've defined this minimal amount, we will be able to start the cluster for the first time! So lets get to it, finally.

Name the Cluster and Set The Configuration Version

The cluster tag is the parent tag for the entire cluster configuration file.

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="1">
</cluster>

This has two attributes that we need to set are name="" and config_version="".

The name="" attribute defines the name of the cluster. It must be unique amongst the clusters on your network. It should be descriptive, but you will not want to make it too long, either. You will see this name in the various cluster tools and you will enter in, for example, when creating a GFS2 partition later on. This tutorial uses the cluster name an_cluster.

The config_version="" attribute is an integer marking the version of the configuration file. Whenever you make a change to the cluster.conf file, you will need to increment this version number by 1. If you don't increment this number, then the cluster tools will not know that the file needs to be reloaded. As this is the first version of this configuration file, it will start with 1. Note that this tutorial will increment the version after every change, regardless of whether it is explicitly pushed out to the other nodes and reloaded. The reason is to help get into the habit of always increasing this value.

Configuring cman Options

We are going to setup a special case for our cluster; A 2-Node cluster.

This is a special case because traditional quorum will not be useful. With only two nodes, each having a vote of 1, the total votes is 2. Quorum needs 50% + 1, which means that a single node failure would shut down the cluster, as the remaining node's vote is 50% exactly. That kind of defeats the purpose to having a cluster at all.

So to account for this special case, there is a special attribute called two_node="1". This tells the cluster manager to continue operating with only one vote. This option requires that the expected_votes="" attribute be set to 1. Normally, expected_votes is set automatically to the total sum of the defined cluster nodes' votes (which itself is a default of 1). This is the other half of the "trick", as a single node's vote of 1 now always provides quorum (that is, 1 meets the 50% + 1 requirement).

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="2">
	<cman expected_votes="1" two_node="1"/>
</cluster>

Take note of the self-closing <... /> tag. This is an XML syntax that tells the parser not to look for any child or a closing tags.

Defining Cluster Nodes

This example is a little artificial, please don't load it into your cluster as we will need to add a few child tags, but one thing at a time.

This actually introduces two tags.

The first is parent clusternodes tag, which takes no variables of it's own. It's sole purpose is to contain the clusternode child tags.

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="3">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node04.alteeve.com" nodeid="1" />
		<clusternode name="an-node05.alteeve.com" nodeid="2" />
	</clusternodes>
</cluster>

The clusternode tag defines each cluster node. There are many attributes available, but we will look at just the two required ones.

The first is the name="" attribute. This should match the name given by uname -n ($HOSTNAME) when run on each node. The IP address that the name resolves to also sets the interface and subnet that the totem ring will run on. That is, the main cluster communications, which we are calling the Back-Channel Network. This is why it is so important to setup our /etc/hosts file correctly. Please see the clusternode's name attribute document for details on how name to interface mapping is resolved.

The second attribute is nodeid="". This must be a unique integer amongst the <clusternode ...> tags. It is used by the cluster to identify the node.

Defining Fence Devices

Fencing devices are designed to forcible eject a node from a cluster. This is done by forcing it to power off or reboot, generally. Some SAN switches can logically disconnect a node from the shared storage device, which has the same effect of guaranteeing that the defective node can not alter the shared storage. A common, third type of fence device is one that cuts the mains power to the server.

All fence devices are contained withing the parent fencedevices tag. This parent tag has no attributes. Within this parent tag are one or more fencedevice child tags.

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="4">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node04.alteeve.com" nodeid="1" />
		<clusternode name="an-node05.alteeve.com" nodeid="2" />
	</clusternodes>
	<fencedevices>
		<fencedevice agent="fence_na" ipaddr="batou.alteeve.com" login="admin" name="batou" passwd="secret" quiet="1"/>
	</fencedevices>
</cluster>

Every fence device used in your cluster will have it's own fencedevice tag. If you are using IPMI, this means you will have a fencedevice entry for each node, as each physical IPMI BMC is a unique fence device.

All fencedevice tags share two basic attributes; name="" and agent="".

  • The name attribute must be unique among all the fence devices in your cluster. As we will see in the next step, this name will be used within the <clusternode...> tag.
  • The agent tag tells the cluster which fence agent to use when the fenced daemon needs to communicate with the physical fence device. A fence agent is simple a shell script that acts as a glue layer between the fenced daemon and the fence hardware. This agent takes the arguments from the daemon, like what port to act on and what action to take, and executes the node. The agent is responsible for ensuring that the execution succeeded and returning an appropriate success or failure exit code, depending. For those curious, the full details are described in the FenceAgentAPI. If you have two or more of the same fence device, like IPMI, then you will use the same fence agent value a corresponding number of times.

Beyond these two attributes, each fence agent will have it's own subset of attributes. The scope of which is outside this tutorial, though we will see examples for IPMI, a switched PDU and a Node Assassin. Most, if not all, fence agents have a corresponding man page that will show you what attributes it accepts and how they are used. The two fence agents we will see here have their attributes defines in the following man pages.

  • man fence_na - Node Assassin fence agent
  • man fence_ipmilan - IPMI fence agent

The example above is what this tutorial will use.

Example <fencedevice...> Tag For Node Assassin

This is the device used throughout this tutorial. It is for the open source, open hardware Node Assassin fence device that you can build yourself.

	<fencedevices>
		<fencedevice agent="fence_na" ipaddr="batou.alteeve.com" login="admin" name="batou" passwd="secret" quiet="1"/>
	</fencedevices>

Being a network-attached fence device, as most fence devices are, the attributes for fence_na include connection information. The attribute variable names are generally the same across fence agents, and they are:

  • ipaddr; This is the resolvable name or IP address of the device. If you use a resolvable name, it is strongly advised that you put the name in /etc/hosts as DNS is another layer of abstraction which could fail.
  • login; This is the login name to use when the fenced daemon connects to the device. This is configured in /etc/cluster/fence_na.conf.
  • passwd; This is the login password to use when the fenced daemon connects to the device. This is also configured in /etc/cluster/fence_na.conf.
  • name; This is the name of this particular fence device within the cluster which, as we will see shortly, is matched in the <clusternode...> element where appropriate.
  • quiet; This is a Node Assassin specific argument. It is used to generate no output to STDOUT when run, as there is no terminal to print to or user to view it.

Example <fencedevice...> Tag For IPMI

Here we will show what IPMI <fencedevice...> tags look like. We won't be using it ourselves, but it is quite popular as a fence device so I wanted to show an example of it's use.

	<fencedevices>
		<fencedevice name="an01_ipmi" agent="fence_ipmilan" ipaddr="192.168.4.74" login="admin" passwd="secret" />
		<fencedevice name="an02_ipmi" agent="fence_ipmilan" ipaddr="192.168.4.75" login="admin" passwd="secret" />
	</fencedevices>
  • ipaddr; This is the resolvable name or IP address of the device. If you use a resolvable name, it is strongly advised that you put the name in /etc/hosts as DNS is another layer of abstraction which could fail.
  • login; This is the login name to use when the fenced daemon connects to the device. This is configured in /etc/cluster/fence_na.conf.
  • passwd; This is the login password to use when the fenced daemon connects to the device. This is also configured in /etc/cluster/fence_na.conf.
  • name; This is the name of this particular fence device within the cluster which, as we will see shortly, is matched in the <clusternode...> element where appropriate.
Note: We will see shortly that, unlike switched PDUs, Node Assassin or other network fence devices, IPMI does not have ports. This is because each IPMI BMC supports just it's host system. More on that later.

Example <fencedevice...> Tag For HP's iLO

Getting iLO to work in the cluster is a little trickier as the RPMs used to enable iLO must be downloaded from HP's website and manually installed. There is a "quickie" tutorial that covers getting iLO working on EL5 below.

	<fencedevices>
		<fencedevice name="an01_ilo" agent="fence_ilo" ipaddr="192.168.4.74" login="Administrator" passwd="secret" />
		<fencedevice name="an02_ilo" agent="fence_ilo" ipaddr="192.168.4.75" login="Administrator" passwd="secret" />
	</fencedevices>

Using the Fence Devices

Now we have nodes and fence devices defined, we will go back and tie them together. This is done by:

  • Defining a fence tag containing all fence methods and devices.
    • Defining one or more method tag(s) containing the device call(s) needed for each fence attempt.
      • Defining one or more device tag(s) containing attributes describing how to call the fence device to kill this node.

This tutorial will be using just a Node Assassin fence device. We'll look at an example adding IPMI in a moment though, as IPMI is a very common fence device and one you will very likely use.

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="5">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node04.alteeve.com" nodeid="1">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="01" action="reboot"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node05.alteeve.com" nodeid="2">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="02" action="reboot"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="batou" agent="fence_na" ipaddr="batou.alteeve.com" login="admin" passwd="secret" quiet="1"/>
	</fencedevices>
</cluster>

First, notice that the fence tag has no attributes. It's merely a container for the method(s).

The next level is the method named node_assassin. This name is merely a description and can be whatever you will is most appropriate. It's purpose is simply to help you distinguish this method from other methods. The reason for method tags is that some fence device calls will have two or more steps. A classic example would be a node with a redundant power supply on a switch PDU acting as the fence device. In such a case, you will need to define multiple device tags, one for each power cable feeding the node. In such a case, the cluster will not consider the fence a success unless and until all contained device calls execute successfully.

The actual fence device configuration is the final piece of the puzzle. It is here that you specify per-node configuration options and link these attributes to a given fencedevice. Here, we see the link to the fencedevice via the name, batou in this example.

Let's step through an example fence call to help show how the per-cluster and fence device attributes are combined during a fence call.

  • The cluster manager decides that a node needs to be fenced. Let's say that the victim is an-node05.
  • The first method in the fence section under an-node05 is consulted. Within it there is just one device, named batou and having two attributes;
    • port; This tells the cluster that an-node05 is connected to the Node Assassin's port number 02.
    • action; This tells the cluster that the fence action to take is reboot. How this action is actually interpreted depends on the fence device in use, though the name certainly implies that the node will be forced off and then restarted.
  • The cluster searches in fencedevices for a fencedevice matching the name batou. This fence device has five attributes;
    • agent; This tells the cluster to call the fence_na fence agent script, as we discussed earlier.
    • ipaddr; This tells the fence agent where on the network to find this particular Node Assassin. This is how multiple fence devices of the same type can be used in the cluster.
    • login; This is the login user name to use when authenticating against the fence device.
    • passwd; This is the password to supply along with the login name when authenticating against the fence device.
    • quiet; This is a device-specific argument that Node Assassin uses (see man fence_na for details).
  • With this information collected and compiled, the fenced daemon will call the fence agent and pass it the attribute variable=value pairs, one per line. Thus, the fenced daemon will call:
/sbin/fence_na

Then it will pass to that agent the following arguments:

ipaddr=batou.alteeve.com
login=admin
passwd=secret
quiet=1
port=02
action=reboot

As you can see then, the first four arguments are from the fencedevice attributes and the last two are from the device attributes under an-node05's clusternode's fence tag.

When you have two or more method tags defined, then the first in the list will be tried. If any of it's device tags fail, then the method is considered to have failed and the next method is consulted. This will repeat until all method entries have been tried. At that point, the cluster goes back to the first method and tries again, repeating the walk through of all methods. This loop will continue until one method succeeds, regardless of how long that might take.

An Example Showing IPMI's Use

This is a full configuration file showing what it would look like if we were using IPMI and a Node Assassin for redundant fencing.

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="6">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node04.alteeve.com" nodeid="1">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="01" action="reboot"/>
				</method>
				<method name="an-node04_ipmi">
					<device name="an01_ipmi" action="reboot"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node05.alteeve.com" nodeid="2">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="02" action="reboot"/>
				</method>
				<method name="an-node05_ipmi">
					<device name="an02_ipmi" action="reboot"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="batou" agent="fence_na" ipaddr="batou.alteeve.com" login="admin" passwd="secret" quiet="1"/>
		<fencedevice name="an01_ipmi" agent="fence_ipmilan" ipaddr="192.168.4.74" login="admin" passwd="secret" />
		<fencedevice name="an02_ipmi" agent="fence_ipmilan" ipaddr="192.168.4.75" login="admin" passwd="secret" />
	</fencedevices>
</cluster>

We now see three elements in fencedevices. The first is the original Node Assassin entry plus two IPMI entries, one for each node in the cluster. As we touched on earlier, this is because each node has it's own IPMI BMC. In the same vein, we also now see that the device entries in each node's method element have no port setting.

Notice that the Node Assassin's method is above the IPMI method. This means that the Node Assassin is the primary fence device and the IPMI is the secondary. When deciding which order to assign the fence devices, consider the device's potential for failure and how that might effect cluster recovery time. For example, many IPMI BMCs rely on the node's power supply to operate. Thus, if the node's power supply fails and the IPMI is the first fence device, then recovery will be delayed as the cluster will try, and then wait until it times out, before moving on to the networked fence device, Node Assassin in this instance.

Give Nodes More Time To Start

Clusters with more than three nodes will have to gain quorum before they can fence other nodes. As we saw earlier though, this is not really the case when using the two_node="1" attribute in the cman tag. What this means in practice is that if you start the cluster on one node and then wait too long to start the cluster on the second node, the first will fence the second.

The logic behind this is; When the cluster starts, it will try to talk to it's fellow node and then fail. With the special two_node="1" attribute set, the cluster knows that it is allowed to start clustered services, but it has no way to say for sure what state the other node is in. It could well be online and hosting services for all it knows. So it has to proceed on the assumption that the other node is alive and using shared resources. Given that, and given that it can not talk to the other node, it's only safe option is to fence the other node. Only then can it be confident that it is safe to start providing clustered services.

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="7">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node04.alteeve.com" nodeid="1">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="01" action="reboot"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node05.alteeve.com" nodeid="2">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="02" action="reboot"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="batou" agent="fence_na" ipaddr="batou.alteeve.com" login="admin" passwd="secret" quiet="1"/>
	</fencedevices>
        <fence_daemon post_join_delay="60"/>
</cluster>

The new tag is fence_daemon, seen near the bottom if the file above. The change is made using the post_join_delay="60" attribute. By default, the cluster will declare the other node dead after just 6 seconds. The reason is that the larger this value, the slower the start-up of the cluster services will be. During testing and development though, I find this value to be far too short and frequently led to unnecessary fencing. Once your cluster is setup and working, it's not a bad idea to reduce this value to the lowest value that you are comfortable with.

Configuring Totem

This is almost a misnomer, as we're more or less not configuring the totem protocol in this cluster.

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="8">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node04.alteeve.com" nodeid="1">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="01" action="reboot"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node05.alteeve.com" nodeid="2">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="02" action="reboot"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="batou" agent="fence_na" ipaddr="batou.alteeve.com" login="admin" passwd="secret" quiet="1"/>
	</fencedevices>
        <fence_daemon post_join_delay="60"/>
        <totem rrp_mode="none" secauth="off"/>
</cluster>

In the spirit of "keeping it simple", we're not configuring redundant ring protocol in this cluster. RRP is an optional second ring that can be used for cluster communication in the case of a break down in the first ring. This is not the simplest option to setup, as recovery must be done manually. However, if you wish to explore it further, please take a look at the clusternode element tag called <altname...>. When altname is used though, then the rrp_mode attribute will need to be changed to either active or passive (the details of which are outside the scope of this tutorial).

The second option we're looking at here is the secauth="off" attribute. This controls whether the cluster communications are encrypted or not. We can safely disable this because we're working on a known-private network, which yields two benefits; It's simpler to setup and it's a lot faster. If you must encrypt the cluster communications, then you can do so here. The details of which are also outside the scope of this tutorial though.

Validating and Pushing the /etc/cluster/cluster.conf File

The cluster software validates the /etc/cluster/cluster.conf file against /usr/share/system-config-cluster/misc/cluster.ng using the xmllint program. If it fails to validate, the cluster will refuse to start.

So now that we've got the foundation of our cluster ready, the last step is to validate it. To do so, simply run:

xmllint --relaxng /usr/share/system-config-cluster/misc/cluster.ng /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="an-cluster" config_version="8">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node04.alteeve.com" nodeid="1">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="01" action="reboot"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node05.alteeve.com" nodeid="2">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="02" action="reboot"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="batou" agent="fence_na" ipaddr="batou.alteeve.com" login="admin" passwd="secret" quiet="1"/>
	</fencedevices>
        <fence_daemon post_join_delay="60"/>
        <totem rrp_mode="none" secauth="off"/>
</cluster>
/etc/cluster/cluster.conf validates

If there was a problem, you need to go back and fix it. DO NOT proceed until your configuration validates. Once it does, we're ready to move on!

With it validated, we need to push it to the other node. As the cluster is not running yet, we will push it out using rsync.

rsync -av /etc/cluster/cluster.conf root@an-node05:/etc/cluster/
building file list ... done
cluster.conf

sent 891 bytes  received 66 bytes  638.00 bytes/sec
total size is 790  speedup is 0.83

Starting the Cluster For The First Time

At this point, we have the foundation of the cluster in place and we can start it up!

Keeping an Eye on Things

I've found a layout of four terminal windows, the left ones being 80 columns wide and the right ones filling the rest of the screen, works well. I personally run a tail -f -n 0 /var/log/messages in the right windows so that I can keep an eye on things.

The terminal layout I use to monitor and operate the two nodes in the cluster.

Of course, what you use is entirely up to you, your screen real-estate and your preferences.

A Note on Timing

Remember that you have post_join_delay seconds to start both nodes, which is 60 seconds in our configuration. So be sure that you can start the cman daemon quickly on both nodes. I generally ensure that both terminal windows have the start command typed in, so that I can quickly press <enter> on both nodes. Again, how you do this is entirely up to you.

All Systems Are Go!

Time to start cman on both nodes!

On both nodes, run the following command:

/etc/init.d/cman start
Starting cluster: 
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
                                                           [  OK  ]

If things went well, you should see something like this in the /var/log/messages terminal on both nodes:

Mar 27 22:10:30 an-node04 ccsd[6229]: Starting ccsd 2.0.115: 
Mar 27 22:10:30 an-node04 ccsd[6229]:  Built: Nov 11 2010 13:23:04 
Mar 27 22:10:30 an-node04 ccsd[6229]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved. 
Mar 27 22:10:30 an-node04 ccsd[6229]: cluster.conf (cluster name = an-cluster, version = 8) found. 
Mar 27 22:10:31 an-node04 openais[6235]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6' 
Mar 27 22:10:31 an-node04 openais[6235]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. 
Mar 27 22:10:31 an-node04 openais[6235]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. 
Mar 27 22:10:31 an-node04 openais[6235]: [MAIN ] AIS Executive Service: started and ready to provide service. 
Mar 27 22:10:31 an-node04 openais[6235]: [MAIN ] Using default multicast address of 239.192.122.47 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms) 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] token hold (386 ms) retransmits before loss (20 retrans) 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] join (60 ms) send_join (0 ms) consensus (2000 ms) merge (200 ms) 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs) 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1402 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] window size per rotation (50 messages) maximum messages per rotation (17 messages) 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] missed count const (5 messages) 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] send threads (0 threads) 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] RRP token expired timeout (495 ms) 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] RRP token problem counter (2000 ms) 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] RRP threshold (10 problem count) 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] RRP mode set to none. 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] heartbeat_failures_allowed (0) 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] max_network_delay (50 ms) 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes). 
Mar 27 22:10:31 an-node04 openais[6235]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). 
Mar 27 22:10:32 an-node04 openais[6235]: [TOTEM] The network interface [192.168.3.74] is now up. 
Mar 27 22:10:32 an-node04 openais[6235]: [TOTEM] Created or loaded sequence id 552.192.168.3.74 for this ring. 
Mar 27 22:10:32 an-node04 openais[6235]: [TOTEM] entering GATHER state from 15. 
Mar 27 22:10:32 an-node04 openais[6235]: [CMAN ] CMAN 2.0.115 (built Nov 11 2010 13:23:08) started 
Mar 27 22:10:32 an-node04 openais[6235]: [MAIN ] Service initialized 'openais CMAN membership service 2.01' 
Mar 27 22:10:32 an-node04 openais[6235]: [SERV ] Service initialized 'openais extended virtual synchrony service' 
Mar 27 22:10:32 an-node04 openais[6235]: [SERV ] Service initialized 'openais cluster membership service B.01.01' 
Mar 27 22:10:32 an-node04 openais[6235]: [SERV ] Service initialized 'openais availability management framework B.01.01' 
Mar 27 22:10:32 an-node04 openais[6235]: [SERV ] Service initialized 'openais checkpoint service B.01.01' 
Mar 27 22:10:32 an-node04 openais[6235]: [SERV ] Service initialized 'openais event service B.01.01' 
Mar 27 22:10:32 an-node04 openais[6235]: [SERV ] Service initialized 'openais distributed locking service B.01.01' 
Mar 27 22:10:32 an-node04 openais[6235]: [SERV ] Service initialized 'openais message service B.01.01' 
Mar 27 22:10:32 an-node04 openais[6235]: [SERV ] Service initialized 'openais configuration service' 
Mar 27 22:10:32 an-node04 openais[6235]: [SERV ] Service initialized 'openais cluster closed process group service v1.01' 
Mar 27 22:10:32 an-node04 openais[6235]: [SERV ] Service initialized 'openais cluster config database access v1.01' 
Mar 27 22:10:32 an-node04 openais[6235]: [SYNC ] Not using a virtual synchrony filter. 
Mar 27 22:10:32 an-node04 openais[6235]: [TOTEM] Creating commit token because I am the rep. 
Mar 27 22:10:32 an-node04 openais[6235]: [TOTEM] Saving state aru 0 high seq received 0 
Mar 27 22:10:32 an-node04 openais[6235]: [TOTEM] Storing new sequence id for ring 22c 
Mar 27 22:10:32 an-node04 openais[6235]: [TOTEM] entering COMMIT state. 
Mar 27 22:10:32 an-node04 openais[6235]: [TOTEM] entering RECOVERY state. 
Mar 27 22:10:32 an-node04 openais[6235]: [TOTEM] position [0] member 192.168.3.74: 
Mar 27 22:10:32 an-node04 openais[6235]: [TOTEM] previous ring seq 552 rep 192.168.3.74 
Mar 27 22:10:32 an-node04 openais[6235]: [TOTEM] aru 0 high delivered 0 received flag 1 
Mar 27 22:10:32 an-node04 openais[6235]: [TOTEM] Did not need to originate any messages in recovery. 
Mar 27 22:10:32 an-node04 openais[6235]: [TOTEM] Sending initial ORF token 
Mar 27 22:10:32 an-node04 openais[6235]: [CLM  ] CLM CONFIGURATION CHANGE 
Mar 27 22:10:32 an-node04 openais[6235]: [CLM  ] New Configuration: 
Mar 27 22:10:32 an-node04 openais[6235]: [CLM  ] Members Left: 
Mar 27 22:10:32 an-node04 openais[6235]: [CLM  ] Members Joined: 
Mar 27 22:10:32 an-node04 openais[6235]: [CLM  ] CLM CONFIGURATION CHANGE 
Mar 27 22:10:32 an-node04 openais[6235]: [CLM  ] New Configuration: 
Mar 27 22:10:32 an-node04 openais[6235]: [CLM  ] 	r(0) ip(192.168.3.74)  
Mar 27 22:10:32 an-node04 openais[6235]: [CLM  ] Members Left: 
Mar 27 22:10:32 an-node04 openais[6235]: [CLM  ] Members Joined: 
Mar 27 22:10:32 an-node04 openais[6235]: [CLM  ] 	r(0) ip(192.168.3.74)  
Mar 27 22:10:32 an-node04 openais[6235]: [SYNC ] This node is within the primary component and will provide service. 
Mar 27 22:10:32 an-node04 openais[6235]: [TOTEM] entering OPERATIONAL state. 
Mar 27 22:10:32 an-node04 openais[6235]: [CMAN ] quorum regained, resuming activity 
Mar 27 22:10:32 an-node04 openais[6235]: [CLM  ] got nodejoin message 192.168.3.74 
Mar 27 22:10:32 an-node04 ccsd[6229]: Initial status:: Quorate 
Mar 27 22:10:33 an-node04 openais[6235]: [TOTEM] entering GATHER state from 11. 
Mar 27 22:10:33 an-node04 openais[6235]: [TOTEM] Creating commit token because I am the rep. 
Mar 27 22:10:33 an-node04 openais[6235]: [TOTEM] Saving state aru e high seq received e 
Mar 27 22:10:33 an-node04 openais[6235]: [TOTEM] Storing new sequence id for ring 234 
Mar 27 22:10:33 an-node04 openais[6235]: [TOTEM] entering COMMIT state. 
Mar 27 22:10:33 an-node04 openais[6235]: [TOTEM] entering RECOVERY state. 
Mar 27 22:10:33 an-node04 openais[6235]: [TOTEM] position [0] member 192.168.3.74: 
Mar 27 22:10:33 an-node04 openais[6235]: [TOTEM] previous ring seq 556 rep 192.168.3.74 
Mar 27 22:10:33 an-node04 openais[6235]: [TOTEM] aru e high delivered e received flag 1 
Mar 27 22:10:33 an-node04 openais[6235]: [TOTEM] position [1] member 192.168.3.75: 
Mar 27 22:10:33 an-node04 openais[6235]: [TOTEM] previous ring seq 560 rep 192.168.3.75 
Mar 27 22:10:33 an-node04 openais[6235]: [TOTEM] aru c high delivered c received flag 1 
Mar 27 22:10:33 an-node04 openais[6235]: [TOTEM] Did not need to originate any messages in recovery. 
Mar 27 22:10:33 an-node04 openais[6235]: [TOTEM] Sending initial ORF token 
Mar 27 22:10:33 an-node04 openais[6235]: [CLM  ] CLM CONFIGURATION CHANGE 
Mar 27 22:10:33 an-node04 openais[6235]: [CLM  ] New Configuration: 
Mar 27 22:10:33 an-node04 openais[6235]: [CLM  ] 	r(0) ip(192.168.3.74)  
Mar 27 22:10:33 an-node04 openais[6235]: [CLM  ] Members Left: 
Mar 27 22:10:33 an-node04 openais[6235]: [CLM  ] Members Joined: 
Mar 27 22:10:33 an-node04 openais[6235]: [CLM  ] CLM CONFIGURATION CHANGE 
Mar 27 22:10:33 an-node04 openais[6235]: [CLM  ] New Configuration: 
Mar 27 22:10:33 an-node04 openais[6235]: [CLM  ] 	r(0) ip(192.168.3.74)  
Mar 27 22:10:33 an-node04 openais[6235]: [CLM  ] 	r(0) ip(192.168.3.75)  
Mar 27 22:10:33 an-node04 openais[6235]: [CLM  ] Members Left: 
Mar 27 22:10:33 an-node04 openais[6235]: [CLM  ] Members Joined: 
Mar 27 22:10:33 an-node04 openais[6235]: [CLM  ] 	r(0) ip(192.168.3.75)  
Mar 27 22:10:33 an-node04 openais[6235]: [SYNC ] This node is within the primary component and will provide service. 
Mar 27 22:10:33 an-node04 openais[6235]: [TOTEM] entering OPERATIONAL state. 
Mar 27 22:10:33 an-node04 openais[6235]: [CLM  ] got nodejoin message 192.168.3.74 
Mar 27 22:10:33 an-node04 openais[6235]: [CLM  ] got nodejoin message 192.168.3.75 
Mar 27 22:10:33 an-node04 openais[6235]: [CPG  ] got joinlist message from node 1

What you see is:

  • The cluster configuration system daemon, ccsd, starts up and reads in /etc/cluster/cluster.conf. It reports the name of the cluster, an-cluster and the version, 8.
  • OpenAIS then starts up, reports it's multicast address it will use, reports many of it's variable values and what IP address it will use for cluster communications.
  • The Cluster Manager, cman, starts and reports the version of various services in use.
  • The totem protocol is started and it forms an initial configuration containing just itself. These messages have the prefix CLM, CLuster Membership.
    • Then it waits to see if the other node will join. On the other node's log, you will see it start off and immediately join with this first node.
  • The initial configuration is sufficient to gain quorum and declares that it will provide services.
  • The second node announces that it wants to join the first node's cluster membership and the cluster reconfigures.

If you got this, then you're cluster is up and running, congratulations!

Setting Up Clustered Storage

The next few steps will cover setting up the DRBD resources, using them in clustered LVM and the creating a GFS2 partition. Next, we will add it all as cluster resources and then create a service for each node to start up all of the clustered storage.

Creating Our DRBD Resources

We're going to create four DRBD resources;

  • A resource to back our shared GFS2 partition which will hold shared files, like our virtual machine configuration files.
  • A resource to back the VMs running primarily on an-node04.
  • A resource to back the VMs running primarily on an-node05.
  • A final resource that will be left alone for future expansion. This is optional, of course.

The "Why" of Our Layout

The reason for this is to minimize the chance of data loss in a split-brain event.

A split-brain occurs when a DRBD resource loses it's network link while in Primary/Primary mode. The problem is that, after the split, any write to either node is not replicated to the other node. Thus, after even one byte is written, the DRBD resource is out of sync. Once this happens, there is no real way to automate recovery. You will need to go in and manual flag one side of the resource to discard it's changes and then manually re-connect the two sides before the resource will be usable again.

We will take steps to prevent this, but it always a possibility with shared storage.

Given then that there is no sure way to avoid this, we're going to mitigate risk by breaking up our DRBD resources so that we can be more selective in choosing what parts to invalidate after a split brain event.

  • The small GFS2 partition will be the hardest to manage. For this reason, it is on it's own. For the same reason, we will be using it as little as we can, and copies of files we care about will be stored on each node. The main thing here are the VM configuration files. This should be written to rarely, so with luck, in a split brain condition, simply nothing will be written to either side so recovery should be arbitrary and simple.
  • The VMs that will primarily run on an-node04 will get their own resource. This way we can simply invalidate the DRBD device on the node that was not running the VMs during the split brain.
  • Likewise, the VMs primarily running on an-node05 will get their own resource. This way, if a split brain happens and VMs are running on both nodes, it should be easily to invalidate opposing nodes for the respective DRBD resource.
  • The fourth DRBD resource will just contain free space. This can later be added whole to an existing LVM VG or further divided up as needed in the future.

Visualizing Storage

The layout of our storage is, on the surface, somewhat complex. To help follow what we'll be creating, here is an ASCII drawing showing what it will look like. Note that example VMs are shown, which we will not be creating. This is to help you see where extra VMs would exist if you ran two or more VMs per node.

If you are using RAID, then you can simply replace sdaX with mdX. You can find a tutorial on manually creating RAID devices here:

         [ an-node04 ]
  ______   ______    ______     __[sda4]__
 | sda1 | | sda2 |  | sda3 |   |  ______  |       _______    ______________    ______________________________
 |______| |______|  |______|   | | sda5 |-+------| drbd0 |--| drbd_sh0_vg0 |--| /dev/drbd_sh0_vg0/xen_shared |
     |        |         |      | |______| |   /--|_______|  |______________|  |______________________________|
  ___|___    _|_    ____|____  |  ______  |   |     _______    ______________    ____________________________
 | /boot |  | / |  | <swap>  | | | sda6 |-+---+----| drbd1 |--| drbd_an4_vg0 |--| /dev/drbd_an4_vg0/vm0001_1 |
 |_______|  |___|  |_________| | |______| |   | /--|_______|  |______________|  |____________________________|
                               |  ______  |   | |     _______    ______________    ____________________________
                               | | sda7 |-+---+-+----| drbd2 |--| drbd_an5_vg0 |--| /dev/drbd_an4_vg0/vm0002_1 | 
                               | |______| |   | | /--|_______|  |______________|  |____________________________|
                               |  ______  |   | | |                         | |    _______________________
                               | | sda8 |-+---+-+-+--\                      | \---| Example LV for 2nd VM |
                               | |______| |   | | |  |                      |     |_______________________|
                               |__________|   | | |  |                      |      _______________________
         [ an-node05 ]                        | | |  |                      \-----| Example LV for 3rd VM |
  ______   ______    ______     __[sda4]__    | | |  |                            |_______________________|
 | sda1 | | sda2 |  | sda3 |   |  ______  |   | | |  |                   
 |______| |______|  |______|   | | sda5 |-+---/ | |  |   _______    __________________
     |        |         |      | |______| |     | |  \--| drbd3 |--| Spare PV for     |
  ___|___    _|_    ____|____  |  ______  |     | |  /--|_______|  | future expansion |
 | /boot |  | / |  | <swap>  | | | sda6 |-+-----/ |  |             |__________________|
 |_______|  |___|  |_________| | |______| |       |  |
                               |  ______  |       |  |
                               | | sda7 |-+-------/  |
                               | |______| |          |
                               |  ______  |          |
                               | | sda8 |-+----------/
                               | |______| |
                               |__________|
.

Modifying the Physical Storage

Warning: Multiple assumptions ahead. If you are comfortable fdisk (and possibly mdadm), you can largely skip this section. You will need to create four partitions; This tutorial uses a 10 GiB for shared files, two 100 GiB and the remainder of the space in the last partition. These will be four extended partitions, /dev/sda5, /dev/sda6, /dev/sda7 and /dev/sda8 respectively.

This tutorial, in the interest of simplicity and not aiming to be a disk management tutorial, uses single-disk storage on each node. If you only have one disk, or if you have hardware RAID, this is sufficient. However, if you have multiple disks and want to use software RAID on your nodes, you will need to create /dev/mdX devices to match the layout we will be creating. Here is a tutorial on managing software RAID arrays, written with this tutorial in mind.

We will need four new partitions; a 10 GiB partition for the GFS2 resource, two 100 GiB partitions for the VMs on either node and the remainder of the disk's free space for the last partition. To do this, we will use the fdisk tool. Be aware; This tool directly edits the hard drive's geometry. This is obviously risky! All along, this tutorial has assumed that you are working on test nodes, but it bears repeating again. Do not do this on a machine with data you care about! At the very least, have a good backup.

Finally, this assumes that you used the kickstart script when setting up your nodes. More to the point, it assumes an existing fourth primary partition which we will delete, convert to an extended partition and then within that create the four usable partitions.

So first, delete the fourth partition.

fdisk /dev/sda
The number of cylinders for this disk is set to 60801.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)

Confirm that the layout is indeed four partitions.

Command (m for help): p
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          32      257008+  83  Linux
/dev/sda2              33        2643    20972857+  83  Linux
/dev/sda3            2644        3165     4192965   82  Linux swap / Solaris
/dev/sda4            3166       60801   462961170   83  Linux

It is, so let's delete /dev/sda4 and then confirm that it is gone.

Command (m for help): d
Partition number (1-4): 4

Command (m for help): p
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          32      257008+  83  Linux
/dev/sda2              33        2643    20972857+  83  Linux
/dev/sda3            2644        3165     4192965   82  Linux swap / Solaris

It is, so now we'll create the extended partition.

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
e
Selected partition 4
First cylinder (3166-60801, default 3166): <enter>
Using default value 3166
Last cylinder or +size or +sizeM or +sizeK (3166-60801, default 60801): <enter>
Using default value 60801

Again, a quick check to make sure the extended partition is now there.

Command (m for help): p
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          32      257008+  83  Linux
/dev/sda2              33        2643    20972857+  83  Linux
/dev/sda3            2644        3165     4192965   82  Linux swap / Solaris
/dev/sda4            3166       60801   462961170    5  Extended

Finally, let's create the four partitions.

Command (m for help): n
First cylinder (3166-60801, default 3166): 
Using default value 3166
Last cylinder or +size or +sizeM or +sizeK (3166-60801, default 60801): +10G
Command (m for help): n
First cylinder (4383-60801, default 4383): <enter>
Using default value 4383
Last cylinder or +size or +sizeM or +sizeK (4383-60801, default 60801): +100G
Command (m for help): n
First cylinder (16542-60801, default 16542): <enter>
Using default value 16542
Last cylinder or +size or +sizeM or +sizeK (16542-60801, default 60801): +100G
Command (m for help): n
First cylinder (28701-60801, default 28701): <enter>
Using default value 28701
Last cylinder or +size or +sizeM or +sizeK (28701-60801, default 60801): <enter>
Using default value 60801

Finally, check that the four new partitions exist.

Command (m for help): p
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          32      257008+  83  Linux
/dev/sda2              33        2643    20972857+  83  Linux
/dev/sda3            2644        3165     4192965   82  Linux swap / Solaris
/dev/sda4            3166       60801   462961170    5  Extended
/dev/sda5            3166        4382     9775521   83  Linux
/dev/sda6            4383       16541    97667136   83  Linux
/dev/sda7           16542       28700    97667136   83  Linux
/dev/sda8           28701       60801   257851251   83  Linux

We do! So now we'll commit the changes to disk and exit.

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.

WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table.
The new table will be used at the next reboot.
Syncing disks.
Warning: Repeat the steps on the other node and double-check that the output of fdisk -l /dev/sda shows the same tart and End boundaries. If they do not match, fix it before proceeding.
Note: This was done on the same disk as the host OS, so we'll need to reboot before we can proceed.

Creating the DRBD Resources

Now that we have either node's storage ready, we can configure and start the DRBD resources. DRBD has "resource names", which is it's internal reference to the "array". These names are used whenever you are working on the resource using drbdadm or similar tools. The tradition is to name the resources as rX, with X being a sequence number starting at 0. The resource itself is made available as a normal /dev/ block device. The tradition is to name this device /dev/drbdX where X matches the resource's sequence number.

The DRBD Fence Script

Red Hat's Lon Hohberger created a DRBD script called obliterate that allows DRBD to trigger a fence call through the cluster when it detects a split-brain condition. The goal behind this is to stop the resource(s) from being flagged as "split-brain" in the first place, thus avoiding manual recovery.

Download the script below and save it under your /sbin/ directory.

Then ensure that it is executable.

chmod 755 /sbin/obliterate
ls -lah /sbin/obliterate
-rwxr-xr-x 1 root root 2.1K Mar  4 23:44 /sbin/obliterate

Our Desired Layout in Detail

Let's review how we will bring the devices together.

an-node04 an-node05 DRBD Resource DRBD Device Size Note
/dev/sda5 /dev/sda5 r0 /dev/drbd0 10 GB GFS2 partition for VM configurations and shared files
/dev/sda6 /dev/sda6 r1 /dev/drbd1 100 GB Host VMs that will primarily run on an-node04
/dev/sda7 /dev/sda7 r2 /dev/drbd2 100 GB Host VMs that will primarily run on an-node05
/dev/sda8 /dev/sda8 r3 /dev/drbd3 Free space that can later be allocated to an existing VG as-is or further divided up into two or more DRBD resources as future needs dictate.

Configuring /etc/drbd.conf

With this plan then, we can now create the /etc/drbd.conf configuration file.

The initial file is very sparse;

cat /etc/drbd.conf
#
# please have a a look at the example configuration file in
# /usr/share/doc/drbd83/drbd.conf
#

Setting up the 'global' Directive

There are a lot of options available to you, many of which are outside the scope of this tutorial. You can get a good overview of all option by reading the man page; man drbd.conf.

The first section we will add is the global { } directive. There is only one argument we will set, which tells DRBD that it can count our install in the Linbit user information. If you have privacy concerns, set this to no.

# The 'global' directive covers values that apply to RBD in general.
global {
        # This tells Linbit that it's okay to count us as a DRBD user. If you
        # have privacy concerns, set this to 'no'.
        usage-count     yes;
}

Setting up the 'common' Directive

The next directive is common { }. This sets values to be used on all DRBD resources by default. You can override common values in any given resource directive later.

The example below is well documented, so please take a moment to look at the example for r0.

# The 'common' directive sets defaults values for all resources.
common {
        # Protocol 'C' tells DRBD to not report a disk write as complete until
        # it has been confirmed written to both nodes. This is required for
        # Primary/Primary use.
        protocol C;

        # This sets the default sync rate to 15 MiB/sec. Be careful about
        # setting this too high! High speed sync'ing can flog your drives and
        # push disk I/O times very high.
        syncer {
                rate 15M;
        }
        
        # This tells DRBD what policy to use when a fence is required.
        disk {
                # This tells DRBD to block I/O (resource) and then try to fence
                # the other node (stonith). The 'stonith' option requires that
                # we set a fence handler below. The name 'stonith' comes from
                # "Shoot The Other Nide In The Head" and is a term used in
                # other clustering environments. It is synonomous with with
                # 'fence'.
                fencing         resource-and-stonith;
        }

        # We set 'stonith' above, so here we tell DRBD how to actually fence
        # the other node.
        handlers {
                # The term 'outdate-peer' comes from other scripts that flag
                # the other node's resource backing device as 'Inconsistent'.
                # In our case though, we're flat-out fencing the other node,
                # which has the same effective result.
                outdate-peer    "/sbin/obliterate";
        }

        # Here we tell DRBD that we want to use Primary/Primary mode. It is
        # also where we define split-brain (sb) recovery policies. As we'll be
        # running all of our resources in Primary/Primary, only the
        # 'after-sb-2pri' really means anything to us.
        net {
                # Tell DRBD to allow dual-primary.
                allow-two-primaries;

                # Set the recover policy for split-brain recover when no device
                # in the resource was primary.
                after-sb-0pri   discard-zero-changes;

                # Now if one device was primary.
                after-sb-1pri   discard-secondary;

                # Finally, set the policy when both nodes were Primary. The
                # only viable option is 'disconnect', which tells DRBD to
                # simply tear-down the DRBD resource right away and wait for
                # the administrator to manually invalidate one side of the
                # resource.
                after-sb-2pri   disconnect;
        }

        # This tells DRBD what to do when the resource starts.
        startup {
                # In our case, we're telling DRBD to promote both devices in
                # our resource to Primary on start.
                become-primary-on       both;
        }
}

Let's stop for a moment and talk about DRBD synchronization.

A DRBD resource does not have to be synced before it can be made Primary/Primary. For this reason, the default sync rate for DRBD is very, very low (320 KiB/sec). This means that you can normally start your DRBD in Primary/Primary on both nodes and get to work while the synchronization putters along in the background.

However!

If the UpToDate node goes down, the surviving Inconsistent node will demote to Secondary, thus becoming unusable. In a high-availability environment like ours, this is pretty useless. So for this reason we will want to get the resources in sync as fast as possible. Likewise, while a node is sync'ing, we will not be able to run the VMs on the Inconsistent node.

The temptation then is to set rate above to the maximum write speed of our disks. This is a bad idea!

We will have four separate resources sharing the same underlying disks. If you drive the sync rate very high, and I/O on the other UpToDate resources will be severely impacted. So much so that I've seen crashes caused by this. So you will want to keep this value at a sane level. That is, you will want to set the rate to as high as you can while still leaving the disks themselves sufficiently unbound that other I/O is still feasible. I've personally found 15M on single-drive and simple RAID machines to be a good value. Feel free to experiment for yourself.

Setting up the Resource Directives

We now define the resources themselves. Each resource will be contained in a directive called resource x, where x is the actual resource name (r0, r1, r2 and r3 in our case). Within this directive, all resource-specific options are set.

The example below is well documented, so please take a moment to look at the example for r0.

# The 'resource' directive defines a given resource and must be followed by the
# resource's name.
# This will be used as the GFS2 partition for shared files.
resource r0 {
        # This is the /dev/ device to create to make available this DRBD
        # resource.
        device          /dev/drbd0;

        # This tells DRBD where to store it's internal state information. We
        # will use 'internal', which tells DRBD to store the information at the
        # end of the resource's space.
        meta-disk       internal;

        # The next two 'on' directives setup each individual node's settings.
        # The value after the 'on' directive *MUST* match the output of
        # `uname -n` on each node.
        on an-node04.alteeve.com {
                # This is the network IP address on the network interface and
                # the TCP port to use for communication between the nodes. Note
                # that the IP address below in on our Storage Network. The TCP
                # port must be unique per resource, but the interface itself
                # can be shared. 
                # IPv6 is usable with 'address ipv6 [address]:port'.
                address         192.168.2.74:7789;

                # This is the node's storage device that will back this
                # resource.
                disk            /dev/sda5;
        }

        # Same as above, but altered to reflect the second node.
        on an-node05.alteeve.com {
                address         192.168.2.75:7789;
                disk            /dev/sda5;
        }
}

The r1, r2 and r3 resources should be nearly identical to the example above. The main difference will the device value and within each node's on x { } directive. We will incrementing the TCP ports to 7790, 7791 and 7792 respectively. Likewise, we will need to alter the disk to /dev/sda6, /dev/sda7 and /dev/sda8 respectively. Finally, the device will be incremented to /dev/drbd1, /dev/drbd2 and /dev/drbd3 respectively.

Housekeeping Before Starting Our DRBD Resources

Let's take a look at the complete /etc/drbd.conf file, validate it for use and then push it to the second node.

The Finished /etc/drbd.conf File

The finished /etc/drbd.conf file should look for or less like this:

#
# please have a a look at the example configuration file in
# /usr/share/doc/drbd83/drbd.conf
#

# The 'global' directive covers values that apply to RBD in general.
global {
	# This tells Linbit that it's okay to count us as a DRBD user. If you
	# have privacy concerns, set this to 'no'.
	usage-count	yes;
}

# The 'common' directive sets defaults values for all resources.
common {
	# Protocol 'C' tells DRBD to not report a disk write as complete until
	# it has been confirmed written to both nodes. This is required for
	# Primary/Primary use.
        protocol	C;

	# This sets the default sync rate to 15 MiB/sec. Be careful about
	# setting this too high! High speed sync'ing can flog your drives and
	# push disk I/O times very high.
        syncer {
                rate	15M;
        }
	
	# This tells DRBD what policy to use when a fence is required.
        disk {
		# This tells DRBD to block I/O (resource) and then try to fence
		# the other node (stonith). The 'stonith' option requires that
		# we set a fence handler below. The name 'stonith' comes from
		# "Shoot The Other Nide In The Head" and is a term used in
		# other clustering environments. It is synonomous with with
		# 'fence'.
                fencing		resource-and-stonith;
        }

	# We set 'stonith' above, so here we tell DRBD how to actually fence
	# the other node.
        handlers {
		# The term 'outdate-peer' comes from other scripts that flag
		# the other node's resource backing device as 'Inconsistent'.
		# In our case though, we're flat-out fencing the other node,
		# which has the same effective result.
                outdate-peer	"/sbin/obliterate";
        }
	
	# Here we tell DRBD that we want to use Primary/Primary mode. It is
	# also where we define split-brain (sb) recovery policies. As we'll be
	# running all of our resources in Primary/Primary, only the
	# 'after-sb-2pri' really means anything to us.
        net {
		# Tell DRBD to allow dual-primary.
                allow-two-primaries;

		# Set the recover policy for split-brain recover when no device
		# in the resource was primary.
                after-sb-0pri	discard-zero-changes;

		# Now if one device was primary.
                after-sb-1pri	discard-secondary;

		# Finally, set the policy when both nodes were Primary. The
		# only viable option is 'disconnect', which tells DRBD to
		# simply tear-down the DRBD resource right away and wait for
		# the administrator to manually invalidate one side of the
		# resource.
                after-sb-2pri	disconnect;
        }
	
	# This tells DRBD what to do when the resource starts.
        startup {
		# In our case, we're telling DRBD to promote both devices in
		# our resource to Primary on start.
                become-primary-on 	both;
        }
}

# The 'resource' directive defines a given resource and must be followed by the
# resource's name.
# This will be used as the GFS2 partition for shared files.
resource r0 {
	# This is the /dev/ device to create to make available this DRBD
	# resource.
        device 		/dev/drbd0;
	
	# This tells DRBD where to store it's internal state information. We
	# will use 'internal', which tells DRBD to store the information at the
	# end of the resource's space.
        meta-disk 	internal;
	
	# The next two 'on' directives setup each individual node's settings.
	# The value after the 'on' directive *MUST* match the output of
	# `uname -n` on each node.
        on an-node04.alteeve.com {
		# This is the network IP address on the network interface and
		# the TCP port to use for communication between the nodes. Note
		# that the IP address below in on our Storage Network. The TCP
		# port must be unique per resource, but the interface itself
		# can be shared. 
		# IPv6 is usable with 'address ipv6 [address]:port'.
                address 	192.168.2.74:7789;
		
		# This is the node's storage device that will back this
		# resource.
                disk    	/dev/sda5;
        }
	
	# Same as above, but altered to reflect the second node.
        on an-node05.alteeve.com {
                address 	192.168.2.75:7789;
                disk    	/dev/sda5;
        }
}

# This will be used to host VMs running primarily on an-node04.
resource r1 {
        device          /dev/drbd1;

        meta-disk       internal;

        on an-node04.alteeve.com {
                address         192.168.2.74:7790;
                disk            /dev/sda6;
        }

        on an-node05.alteeve.com {
                address         192.168.2.75:7790;
                disk            /dev/sda6;
        }
}

# This will be used to host VMs running primarily on an-node05.
resource r2 {
        device          /dev/drbd2;

        meta-disk       internal;

        on an-node04.alteeve.com {
                address         192.168.2.74:7791;
                disk            /dev/sda7;
        }

        on an-node05.alteeve.com {
                address         192.168.2.75:7791;
                disk            /dev/sda7;
        }
}

# This will be set aside as free space for future expansion.
resource r3 {
        device          /dev/drbd3;

        meta-disk       internal;

        on an-node04.alteeve.com {
                address         192.168.2.74:7792;
                disk            /dev/sda8;
        }

        on an-node05.alteeve.com {
                address         192.168.2.75:7792;
                disk            /dev/sda3;
        }
}

Validating the /etc/drbd.conf Syntax

To check for errors, we will validate the /etc/drbd.conf file. To do this, run drbdadm dump. If there are syntactical errors, fix them before proceeding. Once the file is correct, it will be dump it's view of the configuration to the screen with minimal commenting. Don't worry about slight differences (ie: meta-disk external; being inside the on { } directives).

The first time you ever do this, you will also see a note telling you that you are the nth DRBD user.

drbdadm dump
  --==  Thank you for participating in the global usage survey  ==--
The server's response is:

you are the 8286th user to install this version
# /etc/drbd.conf
common {
    protocol               C;
    net {
        allow-two-primaries;
        after-sb-0pri    discard-zero-changes;
        after-sb-1pri    discard-secondary;
        after-sb-2pri    disconnect;
    }
    disk {
        fencing          resource-and-stonith;
    }
    syncer {
        rate             15M;
    }
    startup {
        become-primary-on both;
    }
    handlers {
        fence-peer       /sbin/obliterate;
    }
}

# resource r0 on an-node04.alteeve.com: not ignored, not stacked
resource r0 {
    on an-node04.alteeve.com {
        device           /dev/drbd0 minor 0;
        disk             /dev/sda5;
        address          ipv4 192.168.2.74:7789;
        meta-disk        internal;
    }
    on an-node05.alteeve.com {
        device           /dev/drbd0 minor 0;
        disk             /dev/sda5;
        address          ipv4 192.168.2.75:7789;
        meta-disk        internal;
    }
}

# resource r1 on an-node04.alteeve.com: not ignored, not stacked
resource r1 {
    on an-node04.alteeve.com {
        device           /dev/drbd1 minor 1;
        disk             /dev/sda6;
        address          ipv4 192.168.2.74:7790;
        meta-disk        internal;
    }
    on an-node05.alteeve.com {
        device           /dev/drbd1 minor 1;
        disk             /dev/sda6;
        address          ipv4 192.168.2.75:7790;
        meta-disk        internal;
    }
}

# resource r2 on an-node04.alteeve.com: not ignored, not stacked
resource r2 {
    on an-node04.alteeve.com {
        device           /dev/drbd2 minor 2;
        disk             /dev/sda7;
        address          ipv4 192.168.2.74:7791;
        meta-disk        internal;
    }
    on an-node05.alteeve.com {
        device           /dev/drbd2 minor 2;
        disk             /dev/sda7;
        address          ipv4 192.168.2.75:7791;
        meta-disk        internal;
    }
}

# resource r3 on an-node04.alteeve.com: not ignored, not stacked
resource r3 {
    on an-node04.alteeve.com {
        device           /dev/drbd3 minor 3;
        disk             /dev/sda8;
        address          ipv4 192.168.2.74:7792;
        meta-disk        internal;
    }
    on an-node05.alteeve.com {
        device           /dev/drbd3 minor 3;
        disk             /dev/sda8;
        address          ipv4 192.168.2.75:7792;
        meta-disk        internal;
    }
}

Copying The /etc/drbd.conf to the Second Node

Assuming you write the first /etc/drbd.conf file on an-node04. So now we need to copy it to an-node05 before we can start things up.

rsync -av /etc/drbd.conf root@an-node05:/etc/
building file list ... done
drbd.conf

sent 5552 bytes  received 48 bytes  11200.00 bytes/sec
total size is 5454  speedup is 0.97

Loading the DRBD Module

By default, the /etc/init.d/drbd initialization script handles loading and unloading the drbd module. It's too early for us to start the DRBD resources using the initialization script, so we need to manually load the module ourselves. This will only need to be done once. After you get the DRBD resources up for the first time, you can safely use /etc/init.d/drbd.

To load the module, run:

modprobe drbd

You can verify that the module is loaded using lsmod.

lsmod |grep drbd
drbd                  277144  0

The module also creates a /proc file called drbd. By cat'ing this, we can watch the progress of our work. I'd recommend opening a terminal windows for each node and tracking it using watch.

watch cat /proc/drbd
Every 2.0s: cat /proc/drbd                                                                     Tue Mar 29 13:03:44 2011

version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by mockbuild@builder10.centos.org, 2010-06-04 08:04:27

In the steps ahead, I will show what the output from watch'ing /proc/drbd will be.

Initializing Our Resources

Before we can start each resource, we must first initialize each of the backing device partitions. This is done by running drbdadm create-md x. We'll run this on both nodes, replacing x with the four resource names.

The first time you do this, the command will execute right away.

drbdadm create-md r0
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success

If you've ever used the partition in a DRBD device before though, you will need to confirm that you want to over-write the existing meta-data.

drbdadm create-md r0

Type yes when prompted.

You want me to create a v08 style flexible-size internal meta data block.
There appears to be a v08 flexible-size internal meta data block
already in place on /dev/sda5 at byte offset 10010128384
Do you really want to overwrite the existing v08 meta-data?
[need to type 'yes' to confirm] yes
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.

Repeat for all four resource names, then do the same on the other node.

Initial Connections

As this is the first time that the DRBD resource will be started, neither side will be in a consistent state. The effect is that we will not be able to promote either node to Primary. So we need to tell DRBD that it must consider one side to be valid and, this, overwrite the other node's data.

Note: This is the only time you should ever use --overwrite-data-of-peer! Never use it to recover from a split brain.

The steps we will now take for each resource are:

  • Attach each node's backing device to the DRBD resource.
  • Establish the network connection between the two nodes.
  • Force one node's backing device to be considered UpToDate and promote it to Primary.
  • Promote the second node to Primary
  • Bump the synchronization rate to the value specified in /etc/drbd.conf.

Now lets walk through these steps, taking a look at /proc/drbd after each step.

On Both Nodes:

drbdadm attach r0
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by mockbuild@builder10.centos.org, 2010-06-04 08:04:27
 0: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/Inconsistent   r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:9775184

On Both Nodes: Note that while one node is connecting, it's /proc/drbd will show the resource as being in the connection state of WFConnection.

drbdadm connect r0
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by mockbuild@builder10.centos.org, 2010-06-04 08:04:27
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:9775184

On One Node Only: As the resource is totally new, it's entirely arbitrary which node we run this on. I run this on an-node04 out of habit.

drbdadm -- --overwrite-data-of-peer primary r0
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by mockbuild@builder10.centos.org, 2010-06-04 08:04:27
 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
    ns:4288 nr:0 dw:0 dr:4288 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:9770896
        [>....................] sync'ed:  0.1% (9540/9544)M delay_probe: 3951
        finish: 8:28:54 speed: 304 (284) K/sec

On The Other Node:

drbdadm primary r0
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by mockbuild@builder10.centos.org, 2010-06-04 08:04:27
 0: cs:SyncTarget ro:Primary/Primary ds:Inconsistent/UpToDate C r----
    ns:0 nr:17952 dw:17952 dr:0 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:9757232
        [>....................] sync'ed:  0.3% (9528/9544)M queue_delay: 0.4 ms
        finish: 13:33:06 speed: 168 (260) want: 0 K/sec

Finally, tell DRBD to use the defined synchronization rate.

On One Node: (either one is fine)

drbdadm syncer r0
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by mockbuild@builder10.centos.org, 2010-06-04 08:04:27
 0: cs:SyncSource ro:Primary/Primary ds:UpToDate/Inconsistent C r----
    ns:1527744 nr:0 dw:0 dr:1527744 al:0 bm:93 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:8247440
        [==>.................] sync'ed: 15.7% (8052/9544)M delay_probe: 174899
        finish: 0:09:32 speed: 14,272 (1,900) K/sec

That's it, your resource is ready for use! You do not need to wait for the sync to complete before proceeding. However, ensure that the sync is complete before bringing up VMs on the Inconsisten side.

Repeat these steps for r1, r2 and r3.

A Note on Synchronization Speeds

As discussed earlier while configuring /etc/drbd.conf, we do not want to have the sync rate set too high. However, if you know that your the disk(s) backing your DRBD resource will not be in use for a while, then you can temporarily drive up the sync rate. This can also be used in reverse. If you expect periods of high disk I/O, you can use this same command to temporarily throttle synchronization.

The command to raise the sync rate is below. Note that drbdsetup /dev/drbdX is used here.

drbdsetup /dev/drbd0 syncer -r 40M

Note that the transfer speed will not instantly reach maximum. It takes some time for synchronization rate changes to ramp up and down.

To restore it back to the rate set in /etc/drbd.conf, run:

drbdadm syncer r0

Confirming All Resource Are Ready

After you step through all of the resource, you should see something like this in /proc/drbd:

version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by mockbuild@builder10.centos.org, 2010-06-04 08:04:27
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:SyncSource ro:Primary/Primary ds:UpToDate/Inconsistent C r----
    ns:8804352 nr:0 dw:0 dr:8804352 al:0 bm:537 lo:0 pe:39 ua:0 ap:0 ep:1 wo:b oos:84306260
        [>...................] sync'ed:  9.5% (82328/90924)M delay_probe: 1209
        finish: 1:22:10 speed: 17,040 (14,500) K/sec
 2: cs:SyncSource ro:Primary/Primary ds:UpToDate/Inconsistent C r----
    ns:8798208 nr:0 dw:0 dr:8798208 al:0 bm:536 lo:1 pe:0 ua:0 ap:0 ep:1 wo:b oos:84851828
        [>...................] sync'ed:  9.4% (82860/91452)M delay_probe: 1208
        finish: 1:22:42 speed: 17,024 (14,492) K/sec
 3: cs:SyncSource ro:Primary/Primary ds:UpToDate/Inconsistent C r----
    ns:8793600 nr:0 dw:0 dr:8793600 al:0 bm:536 lo:48 pe:48 ua:48 ap:0 ep:1 wo:b oos:244971660
        [>....................] sync'ed:  3.5% (239228/247816)M delay_probe: 1208
        finish: 4:28:36 speed: 15,128 (14,484) K/sec

So long as all resources are Primary/Primary, you can proceed.

Note: In the above output, three resources are running at 15 MB/sec. This will be hammering the backing disks hard, so I expect high I/O latency. In this case, I am ok with this as I am still working on setup.

If you wanted to make the synchronization more efficient, you could pause sync on all but one resource at a time. For example, above I see that r0 is fully UpToDate/UpToDate already, so I will ignore it. To speed up sync, I'll pause sync on r2 and r3, then push the sync rate up to 40 MB/sec on r1.

drbdadm pause-sync r2
drbdadm pause-sync r3
drbdsetup /dev/drbd1 syncer -r 40M

Again, note that it takes some time for the synchronization speed to ramp up.

version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by mockbuild@builder10.centos.org, 2010-06-04 08:04:27
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:SyncSource ro:Primary/Primary ds:UpToDate/Inconsistent C r----
    ns:17010688 nr:0 dw:0 dr:17010688 al:0 bm:1038 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:76098676
        [==>.................] sync'ed: 18.3% (74312/90924)M delay_probe: 2065
        finish: 0:34:05 speed: 36,832 (16,756) K/sec
 2: cs:PausedSyncS ro:Primary/Primary ds:UpToDate/Inconsistent C r--u-
    ns:13314048 nr:0 dw:0 dr:13314048 al:0 bm:812 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:80335988
 3: cs:PausedSyncS ro:Primary/Primary ds:UpToDate/Inconsistent C r--u-
    ns:13303296 nr:0 dw:0 dr:13303296 al:0 bm:811 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:240460428

Once the sync of r1 is complete, you will want to reset it back to the rate defined in drbd.conf, resume sync on r2 and then bump it up to 40 MB/sec.

drbdadm syncer r1
drbdadm resume-sync r2
drbdsetup /dev/drbd2 syncer -r 40M
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by mockbuild@builder10.centos.org, 2010-06-04 08:04:27
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----
    ns:27016308 nr:0 dw:0 dr:27016308 al:0 bm:1649 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
 2: cs:SyncSource ro:Primary/Primary ds:UpToDate/Inconsistent C r----
    ns:259616 nr:0 dw:0 dr:259616 al:0 bm:15 lo:79 pe:10 ua:79 ap:0 ep:1 wo:b oos:80086932
        [>....................] sync'ed:  0.4% (78208/78460)M delay_probe: 32
        finish: 0:20:43 speed: 64,192 (14,404) K/sec
 3: cs:PausedSyncS ro:Primary/Primary ds:UpToDate/Inconsistent C r--u-
    ns:210432 nr:0 dw:0 dr:210432 al:0 bm:12 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:240265868

Same again for r3, after which you will reset r3's sync rate and your resources will be fully UpToDate/UpToDate.

Note: For the rest of the tutorial, we will be ignoring r3 as it's just a bank of spare disk space.

Setting Up Clustered LVM

This step will have us create three LVM physical volumes, one for each of the allocated DRBD resources, and then creating three separate volume groups. At this stage, the only logical volume we will create will be for the GFS2 partition. The rest of the LVs will be created later when we provision virtual machines.

Modifying /etc/lvm/lvm.conf

There are four main things we're going to change in the LVM configuration file.

  • Change the filter to only see /etc/drbd* devices. Otherwise LVM will see signatures on the DRBD resource and the backing /dev/sd* devices which will cause confusion.
  • Change the locking type to clustered locking.
  • Disable clustered locking from falling back to local locking.
  • We'll be identifying our clustered VGs and LVs using LVM tags. This tag will be defined using the volume_list variable.

The first step is trivial. Simple alter locking_type = 1 to locking_type = 3. This is about line 279.

vim /etc/lvm/lvm.conf
    # Type of locking to use. Defaults to local file-based locking (1).
    # Turn locking off by setting to 0 (dangerous: risks metadata corruption
    # if LVM2 commands get run concurrently).
    # Type 2 uses the external shared library locking_library.
    # Type 3 uses built-in clustered locking.
    # Type 4 uses read-only locking which forbids any operations that might 
    # change metadata.
    locking_type = 3

Next, restrict the filtering so that it only sees the DRBD resources. This is done by changing the filter variable from filter = [ "a/.*/" ] to filter = [ "a|/dev/drbd*|", "r/.*/" ]. What this does is tell LVM to accept devices matching /dev/drbd* and to reject all other devices. This is about line 53.

    # By default we accept every block device:
    filter = [ "a|/dev/drbd*|", "r/.*/" ]

Now, we'll disable falling back to local locking. The reasoning being that if the cluster lock manager DLM is not available, then we don't want to touch the storage at all. This is done by changing fallback_to_local_locking from 1 to 0. This is found around line 294.

    # If an attempt to initialise type 2 or type 3 locking failed, perhaps
    # because cluster components such as clvmd are not running, with this set
    # to 1 an attempt will be made to use local file-based locking (type 1).
    # If this succeeds, only commands against local volume groups will proceed.
    # Volume Groups marked as clustered will be ignored.
    fallback_to_local_locking = 0

Finally, tell LVM to use volumes with our tags by setting volume_list to ["@an-cluster"]. This is found around line 356.

    # If volume_list is defined, each LV is only activated if there is a
    # match against the list.
    #   "vgname" and "vgname/lvname" are matched exactly.
    #   "@tag" matches any tag set in the LV or VG.
    #   "@*" matches if any tag defined on the host is also set in the LV or VG
    #
    # volume_list = [ "vg1", "vg2/lvol1", "@tag1", "@*" ]
    volume_list = ["@an-cluster"]

Save the file and copy it to the second node.

rsync -av /etc/lvm/lvm.conf root@an-node05:/etc/lvm/
building file list ... done
lvm.conf

sent 879 bytes  received 210 bytes  726.00 bytes/sec
total size is 19116  speedup is 17.55

You're done. Normally we'd want to tell LVM to rescan for PVs, VGs and LVs but at this stage there are none.

Starting the clvmd Daemon

Before we proceed, we need to start the clustered LVM daemon, clvmd. This requires that the cluster is already running. So if you stopped the cluster, start it on both nodes before starting clvmd.

cman_tool status
Version: 6.2.0
Config Version: 9
Cluster Name: an-cluster
Cluster Id: 31412
Cluster Member: Yes
Cluster Generation: 592
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1  
Active subsystems: 7
Flags: 2node Dirty 
Ports Bound: 0  
Node name: an-node04.alteeve.com
Node ID: 1
Multicast addresses: 239.192.122.47 
Node addresses: 192.168.3.74
Note: The version incremented after the last example when I edited the cluster.conf to have my real passwords.

So now we see that the cluster is up on both nodes (Nodes: 2), so we can start the clustered LVM daemon.

/etc/init.d/clvmd start
Starting clvmd:                                            [  OK  ]
Activating VGs:                                            [  OK  ]
Note: At this stage, the cluster does not start at boot, so we can't start clvmd at boot yet, either. We'll do this at the end of the tutorial, so for now, disable clvmd and start it manually after starting cman when you first start your cluster.
chkconfig clvmd off
chkconfig --list clvmd
clvmd          	0:off	1:off	2:off	3:off	4:off	5:off	6:off

Turning Our DRBD Resources Into LVM Physical Volumes

Note: Now that DRBD is in use, commands will only need to be executed on one node and the changes should be immediately seen on the second node.

Creating LVM physical volumes is a trivial task. Simply run pvcreate /dev/drbdX.

On One Node:

pvcreate /dev/drbd0
  Physical volume "/dev/drbd0" successfully created

On The Other Node:

pvdisplay
  "/dev/drbd0" is a new physical volume of "9.32 GB"
  --- NEW Physical volume ---
  PV Name               /dev/drbd0
  VG Name               
  PV Size               9.32 GB
  Allocatable           NO
  PE Size (KByte)       0
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               KLYDPe-rMhP-b92M-NEwS-6Pyl-BTzR-fpKMFY

There you go, a clustered PV visible on both nodes! Now repeat the process for /dev/drbd1 and drbd2. When you're done, pvscan should be similar to the output below.

pvdisplay
  "/dev/drbd0" is a new physical volume of "9.32 GB"
  --- NEW Physical volume ---
  PV Name               /dev/drbd0
  VG Name               
  PV Size               9.32 GB
  Allocatable           NO
  PE Size (KByte)       0
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               KLYDPe-rMhP-b92M-NEwS-6Pyl-BTzR-fpKMFY
   
  "/dev/drbd1" is a new physical volume of "93.14 GB"
  --- NEW Physical volume ---
  PV Name               /dev/drbd1
  VG Name               
  PV Size               93.14 GB
  Allocatable           NO
  PE Size (KByte)       0
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               mcFNMH-1rfm-JWpG-9aME-Mnys-Iowc-71PRMz
   
  "/dev/drbd2" is a new physical volume of "93.14 GB"
  --- NEW Physical volume ---
  PV Name               /dev/drbd2
  VG Name               
  PV Size               93.14 GB
  Allocatable           NO
  PE Size (KByte)       0
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               NxFZKn-cT21-2AVc-2Ife-muVr-GwKS-YNnDKJ

Before proceeding, be sure to have LVM rescan for the new PVs so that it's cache is up to date.

pvscan
...
Note: There is nothing showing in VG Name yet, as we've not created any VGs. Re-run pvdisplay after the VGs are created and you will see them show up. Be aware that a PV can only belong to one VG at a time.

Creating Volume Groups

LVM allows for a given VG to have multiple PVs assigned to it. In our case though, each PV has a specific purpose so we will be creating three independent VGs.

Creating VGs is somewhat less trivial compared to creating the PVs. There are a few extra bits that need to be specified when the volume groups are created. The extra bits are:

  • We will explicitly tell LVM that these are clustered VGs via -c y (--clustered yes).
  • We will create a tag that we will use to identify all clustered VGs. The tag I use is an-cluster, thought you are free to use something else. This is applied via --addtag @an-cluster
  • Each VG needs a unique name which will become part of the /dev/vg_name/lv_name path. The name you choose should make sense to you. The names used in this tutorial are shown in the table below.
VG name PV used Note
drbd_sh0_vg0 /dev/drbd0 This will be used for the shared VG hosting the lone logical volume which we will create the GFS2 partition on.
drbd_an4_vg0 /dev/drbd1 This VG will host the LVs backing the virtual machines designed to normally operate on an-node04.
drbd_an5_vg0 /dev/drbd2 This VG will host the LVs backing the virtual machines designed to normally operate on an-node05.

So then, the commands to create these VGs will be as follows.

On One Node:

vgcreate -c y --addtag @an-cluster drbd_sh0_vg0 /dev/drbd0
  Clustered volume group "drbd_sh0_vg0" successfully created
vgcreate -c y --addtag @an-cluster drbd_an4_vg0 /dev/drbd1
  Clustered volume group "drbd_an4_vg0" successfully created
vgcreate -c y --addtag @an-cluster drbd_an5_vg0 /dev/drbd2
  Clustered volume group "drbd_an5_vg0" successfully created

On The Other Node: You can verify that the VGs are visible on the second node with vgdisplay

vgdisplay -v
    Finding all volume groups
    Finding volume group "drbd_an5_vg0"
  --- Volume group ---
  VG Name               drbd_an5_vg0
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               93.14 GB
  PE Size               4.00 MB
  Total PE              23843
  Alloc PE / Size       0 / 0   
  Free  PE / Size       23843 / 93.14 GB
  VG UUID               GEwIte-f91k-3LL1-e7s5-THjX-3bGB-M2IOKQ
   
  --- Physical volumes ---
  PV Name               /dev/drbd2     
  PV UUID               NxFZKn-cT21-2AVc-2Ife-muVr-GwKS-YNnDKJ
  PV Status             allocatable
  Total PE / Free PE    23843 / 23843
   
    Finding volume group "drbd_an4_vg0"
  --- Volume group ---
  VG Name               drbd_an4_vg0
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               93.14 GB
  PE Size               4.00 MB
  Total PE              23843
  Alloc PE / Size       0 / 0   
  Free  PE / Size       23843 / 93.14 GB
  VG UUID               80ckhY-aQLF-nVYL-e8jM-PJ1P-uwXg-m4yrHo
   
  --- Physical volumes ---
  PV Name               /dev/drbd1     
  PV UUID               mcFNMH-1rfm-JWpG-9aME-Mnys-Iowc-71PRMz
  PV Status             allocatable
  Total PE / Free PE    23843 / 23843
   
    Finding volume group "drbd_sh0_vg0"
  --- Volume group ---
  VG Name               drbd_sh0_vg0
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               9.32 GB
  PE Size               4.00 MB
  Total PE              2386
  Alloc PE / Size       0 / 0   
  Free  PE / Size       2386 / 9.32 GB
  VG UUID               Vuol1I-8kCI-7mmR-vA7x-vCuI-QE6f-NBwhBJ
   
  --- Physical volumes ---
  PV Name               /dev/drbd0     
  PV UUID               KLYDPe-rMhP-b92M-NEwS-6Pyl-BTzR-fpKMFY
  PV Status             allocatable
  Total PE / Free PE    2386 / 2386

The tag we assigned isn't displayed, this is ok. You can see the tags using a special incantation of vgs:

vgs -o vg_name,vg_tags
  VG           VG Tags   
  drbd_an4_vg0 an-cluster
  drbd_an5_vg0 an-cluster
  drbd_sh0_vg0 an-cluster

It may not be pretty, but at least you can confirm that the tags exist as expected. Where tags are used will be discussed later in the trouble-shooting section.

Before proceeding, be sure to have LVM rescan for the new VGs so that it's cache is up to date.

vgscan
...

Creating a Logical Volume

At this point, we're only going to create a logical volume on the shared VG. This one LV will use all of the space available in the drbd_sh0_vg0 volume group. As with the VGs, we'll be assigning the same tag to our LV. We will also need to assign a name to the LV which will form the last part of the device path, /dev/vg_name/lv_name.

When creating LVs, you can specify the size of the new LV in a few ways. The two way I prefer are -L xxG, where xx is the number of GiB to make the LV. Alternatively, I like to use -l 100%FREE when I am creating the last partition on the VG (or the only one, as in this case). Which you use is entirely up to you.

On One Node:

lvcreate -l 100%FREE --addtag @an-cluster -n xen_shared drbd_sh0_vg0
  Logical volume "xen_shared" created

On The Other Node:

lvdisplay
  --- Logical volume ---
  LV Name                /dev/drbd_sh0_vg0/xen_shared
  VG Name                drbd_sh0_vg0
  LV UUID                2f7ZcD-BEBn-ESgh-WAcK-4aCS-gUg5-LYnv3Q
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                9.32 GB
  Current LE             2386
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0

As before, before proceeding, be sure to have LVM rescan for the new LVs so that it's cache is up to date.

lvscan
...

As with VGs, we can confirm that the tag was set using a similar call to lvs:

lvs -o vg_name,lv_name,lv_tags
  VG           LV         LV Tags   
  drbd_sh0_vg0 xen_shared an-cluster

That's it. Our clustered LVM is setup!

Setting Up The Shared GFS2 Partition

Setting up a GFS2 partition requires three steps;

  • Formatting the block device, a logical volume in our case, using the mkfs.gfs2 tool.
  • Create a mount point on each node.
  • Add an entry to /etc/fstab.

As mentioned earlier, we'll create a small 10 GB GFS2 partition that will hold common files for the cluster. The most notable being the virtual machine definition files. These need to be centralized so that one node can restore a VM lost on another node during a failure state. It's also a decent place for things like ISOs if you're not using a PXE server of if you want to make generic VM images available. Though if you plan to do that, you will probably want a larger GFS2 partition than we are using here.

The information you need to have on hand when formatting a GFS2 partition is:

Variable Value Note
Locking protocol lock_dlm This is always lock_dlm
Journals 2 This matches the number of nodes in the cluster.
Cluster Name an-cluster As set in /etc/cluster/cluster.conf
Partition Name xen_shared Arbitrary name
Backing Device /dev/drbd_sh0_vg0/xen_shared The LV we created earlier

Putting it all together, the command becomes:

On One Node:

mkfs.gfs2 -p lock_dlm -j 2 -t an-cluster:xen_shared /dev/drbd_sh0_vg0/xen_shared
This will destroy any data on /dev/drbd_sh0_vg0/xen_shared.
Are you sure you want to proceed? [y/n] y
Note: It can take a bit of time for this to complete, please be patient.
Device:                    /dev/drbd_sh0_vg0/xen_shared
Blocksize:                 4096
Device Size                9.32 GB (2443264 blocks)
Filesystem Size:           9.32 GB (2443261 blocks)
Journals:                  2
Resource Groups:           38
Locking Protocol:          "lock_dlm"
Lock Table:                "an-cluster:xen_shared"
UUID:                      709CD7D5-9C5F-9255-E9BE-0CA1F758471D

Now confirm that the partition is visible from the other node.

On The Other Node:

gfs2_edit -p sb /dev/drbd_sh0_vg0/xen_shared
Block #16    (0x10) of 2443264 (0x254800) (superblock)

Superblock:
  mh_magic              0x01161970(hex)
  mh_type               1                   0x1
  mh_format             100                 0x64
  sb_fs_format          1801                0x709
  sb_multihost_format   1900                0x76c
  sb_bsize              4096                0x1000
  sb_bsize_shift        12                  0xc
  master dir:           2                   0x2
        addr:           22                  0x16
  root dir  :           1                   0x1
        addr:           21                  0x15
  sb_lockproto          lock_dlm
  sb_locktable          an-cluster:xen_shared
  sb_uuid               709CD7D5-9C5F-9255-E9BE-0CA1F758471D

The superblock has 2 directories
     1. (1). 21 (0x15): Dir     root
     2. (2). 22 (0x16): Dir     master
------------------------------------------------------

With that, the GFS2 partition is ready for use.

Now we need to create the mount point. The mount point you use is up to you. This tutorial will create a mount point called /xen_shared. Once that's created, we'll actually mount the GFS2 partition. Finally, we'll use df to verify that it mounted successfully.

On Both Nodes:

mkdir /xen_shared
mount /dev/drbd_sh0_vg0/xen_shared /xen_shared/
df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              20G  3.0G   16G  17% /
/dev/sda1             244M   33M  199M  15% /boot
tmpfs                 1.7G     0  1.7G   0% /dev/shm
none                  1.7G   40K  1.7G   1% /var/lib/xenstored
/dev/mapper/drbd_sh0_vg0-xen_shared
                      9.4G  259M  9.1G   3% /xen_shared

The last step is to add an entry to /etc/fstab for this GFS2 partition. This is required because the /etc/init.d/gfs2 initialization script consults /etc/fstab to see what partitions it is to manage.

If you are familiar with GFS2 on EL6, then you might be familiar with using the GFS2's UUID in /etc/fstab. That is not supported here on EL5.

echo "/dev/drbd_sh0_vg0/xen_shared /xen_shared gfs2 rw,suid,dev,exec,nouser,async 0 0" >> /etc/fstab
tail -n 1 /etc/fstab
/dev/drbd_sh0_vg0/xen_shared /xen_shared gfs2 rw,suid,dev,exec,nouser,async 0 0
Note: The reason that we use rw,suid,dev,exec,nouser,async instead of defaults. The key option we don't want to use is auto, which is implied with defaults. The reason for avoiding this is to prevent the system from trying to mount the GFS2 partition during boot. With the cluster not running that early in the boot process, the GFS2 partition will effectively not exist at that point, so any attempt to mount it will fail.

Now, to verify that everything is working, call status against the gfs2 initialization script.

/etc/init.d/gfs2 status
Configured GFS2 mountpoints: 
/xen_shared
Active GFS2 mountpoints: 
/xen_shared

Now try stop'ing gfs2, checking the mounts with df and status, start'ing gfs2 back up and doing a final df and status. If all works well, the GFS2 volume should unmount and remount.

On Both Nodes:

Stop:

/etc/init.d/gfs2 stop
Unmounting GFS2 filesystems:                               [  OK  ]

Check:

/etc/init.d/gfs2 status
Configured GFS2 mountpoints: 
/xen_shared
df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              20G  3.0G   16G  17% /
/dev/sda1             244M   33M  199M  15% /boot
tmpfs                 1.7G     0  1.7G   0% /dev/shm
none                  1.7G   40K  1.7G   1% /var/lib/xenstored

Start:

/etc/init.d/gfs2 start
Mounting GFS2 filesystems:                                 [  OK  ]

Re-check:

/etc/init.d/gfs2 status
Configured GFS2 mountpoints: 
/xen_shared
Active GFS2 mountpoints: 
/xen_shared
df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              20G  3.0G   16G  17% /
/dev/sda1             244M   33M  199M  15% /boot
tmpfs                 1.7G     0  1.7G   0% /dev/shm
none                  1.7G   40K  1.7G   1% /var/lib/xenstored
/dev/mapper/drbd_sh0_vg0-xen_shared
                      9.4G  259M  9.1G   3% /xen_shared

Perfect!

Managing Storage In The Cluster

The storage for the cluster is ready, but it hasn't actually been tied into the cluster. To do that, we will use rgmanager, within which we will add the drbd, clvmd and gfs2 initialization scripts as cluster resources. We will create to failover domains, each one containing only one node. Finally, we will take those three resources and create a service tree.

Covering Some New Terms

Now, let's back up and talk a bit about those three new terms.

  • Resources are items that can be used in one or more services.
  • Services consist of one or more resources, either in series, parallel or a combination of both, that are managed by the cluster.
  • Failover Domains are collections of one or more nodes into a logic group. Services can run strictly within a failover group, or they can be allowed to run outside of the failover domain when no member domains are available.

An Overview Of How We Will Manage Storage In The Cluster

So what we are going to do here is:

  • Create three script resources
  • Create two failover domains. One containing just an-node04 and the other containing just an-node05. We will restrict services within these domains to only run within this domain, effectively locking the service to the node.
  • Within each failover domain, we will create a service with a serial resource tree. This tree will start drbd, then clvmd and finally gfs2.

The reason for this is so that when rgmanager starts, it will start each failover domain's service which, in turn, will start the clustered storage daemons in the proper order.

Why Not Start The Daemons At Boot Time?

This might seem like over kill, and arguably it is. The reason I still find it worth while is that if a storage daemon like DRBD hangs on boot, you can find yourself with a node that you can not access. Many folk have their nodes in data centers so gaining direct access can be a pain, to be polite. So by moving these daemons over to the cluster, and knowing that rgmanager itself will start late in the boot process, we are much more likely to still have remote access when things go bad.

I used DRBD as an example on purpose. I prefer to have DRBD resources wait forever to connect to the other node when starting up. This way, if one node starts somewhat later than the other, the first node's DRBD resource won't risk split-braining. it will happily wait until it's partner node comes up and starts it's own DRBD daemon. The downside to this is that DRBD will effectively hang the boot process forever if the other node can't be started. By managing DRBD in the cluster, we leave open the option of logging in and telling DRBD to stop waiting when we know the other node will not be booting.

Adding rgmanager To cluster.conf

Everything related to rgmanager is an element of the <rm /> tag. Within that, the actual resources are themselves elements of the <resources /> tag. We'll start by creating these tags, then we'll look at the actual resources.

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="10">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node04.alteeve.com" nodeid="1">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="01" action="reboot"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node05.alteeve.com" nodeid="2">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="02" action="reboot"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="batou" agent="fence_na" ipaddr="batou.alteeve.com" login="admin" passwd="secret" quiet="1"/>
	</fencedevices>
        <fence_daemon post_join_delay="60"/>
        <totem rrp_mode="none" secauth="off"/>
        <rm>
                <resources/>
                <failoverdomains />
        </rm>
</cluster>

There are several attributes available for the rm, though we don't need to worry about them now as the defaults are sane. It's primary purpose is to act as a container for <failoverdomains />, <resources /> and <service /> tags. We'll be working with all three of these now.

Adding Resources to cluster.conf

The resources tag has no attributes of it's own. It solely acts as a container for various resource tags. There are many types of resources, but we will only be using the <script /> tag in this cluster.

Let's look at the three scripts we're going to add;

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="11">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node04.alteeve.com" nodeid="1">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="01" action="reboot"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node05.alteeve.com" nodeid="2">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="02" action="reboot"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="batou" agent="fence_na" ipaddr="batou.alteeve.com" login="admin" passwd="secret" quiet="1"/>
	</fencedevices>
        <fence_daemon post_join_delay="60"/>
        <totem rrp_mode="none" secauth="off"/>
        <rm>
                <resources>
                        <script file="/etc/init.d/drbd" name="drbd"/>
                        <script file="/etc/init.d/clvmd" name="clvmd"/>
                        <script file="/etc/init.d/gfs2" name="gfs2"/>
                </resources>
                <failoverdomains />
        </rm>
</cluster>

The main two attributes used by <script /> are file and name. The file attribute is the path to the script and the name will be used to reference this script when we create our <service /> resource tree later.

Note: Scripts must work like initialization scripts. That is, they need to support being called with start, stop and status.

Adding Failover Domains to cluster.conf

Failover domains are, at their most basic, a collection of one or more nodes in the cluster. Services can then be configured to operate within the context of a given failover domain. There are a few key options to be aware of.

  • A failover domain can be unordered or prioritized.
    • When unordered, the service will relocate, effectively, to another random node in the domain.
    • When prioritized, services will relocate the the highest-priority node in the domain.
  • A failover domain can be restricted or unrestricted.
    • When restricted, the service is only allowed to relocate to nodes in the domain. When no nodes are available, the service is stopped.
    • When unrestricted, the service will try to relocate to a node in the domain. However, when no domain members are available, the service attempt to start on another node in the cluster.
  • A failover domain can have a failback policy.
    • When a domain allows for failback and the domain is ordered, a service will migrate to the highest priority node in the domain. This allows for automated restoration of services on a failed node when it rejoins the cluster.
    • When a domain does not allow for failback, but is unrestricted, failback of services that fell out of the domain will happen anyway. However, once the service is within the domain, the service will not relocate to a higher-priority node should one become available later.
    • When a domain does not allow for failback and is restricted, then failback of services will never occur.

What we are going to do now is create two restricted failover domains with no relocation. Each of these will contain just one of the nodes, each node getting one domain. This will effectively lock the service we will soon create to the node in question. This way, services assigned to each domain will be started and maintained by the cluster, but they will not be highly available. The services we will create will have local initialization scripts, so this is perfectly fine.

This is how we will get the cluster to start and maintain out clustered storage daemons.

The format for defining failover domains is to create a <failoverdomains> tag, which has no attributes, and acts as a container for one or more <failoverdomain> tags. Each <failoverdomain> tag has four attributes and acts as a container for one or more <failoverdomainnode /> tags.

The only required attribute in <failoverdomain /> is name="". This is the name that will be used later when we want to bind a service to a given failover domain. By default, a failover domain is unordered, thus making failback meaningless, and is unrestricted. When ordered, the default is to allow for failback.

The individual <failoverdomainnode /> have two attributes; name, which must match the given node's <clusternode name="...", and priority="x", where x is an integer. when only one node is defined or when a failover domain in unordered, the priority is ignored. When two or more nodes are defined and the domain is ordered, then nodes with the lowest number has the highest priority for hosting services. That is, a node with priority="1" will be preferred to a node with priority="2".

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="12">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node04.alteeve.com" nodeid="1">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="01" action="reboot"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node05.alteeve.com" nodeid="2">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="02" action="reboot"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="batou" agent="fence_na" ipaddr="batou.alteeve.com" login="admin" passwd="secret" quiet="1"/>
	</fencedevices>
        <fence_daemon post_join_delay="60"/>
        <totem rrp_mode="none" secauth="off"/>
        <rm>
                <resources>
                        <script file="/etc/init.d/drbd" name="drbd"/>
                        <script file="/etc/init.d/clvmd" name="clvmd"/>
                        <script file="/etc/init.d/gfs2" name="gfs2"/>
                </resources>
		<failoverdomains>
			<failoverdomain name="an4_only" nofailback="0" ordered="0" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.com" priority="1"/>
			</failoverdomain>
			<failoverdomain name="an5_only" nofailback="0" ordered="0" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.com" priority="1"/>
			</failoverdomain>
		</failoverdomains>
        </rm>
</cluster>

So here we've now created two failover domains; an4_only and an5_only. Both of these are restricted="1", so services within these domains will never try to start on other nodes. Both nofailback="0" and ordered="0" are defined, but they have no meaning anyway because the two domains have only one node.

Within each domain, the corresponding <failoverdomainnode /> is defined. Notice that priority="1" is set, despite having no use. This attribute must exist regardless. The name="an-node0x.alteeve.com connects the node to it's corresponding <clusternode name="an-node0x.alteeve.com" entry in <clusternodes />.

Creating the Storage Services in cluster.conf

The last piece of the resource management puzzle are the <service /> tags. These tags are where the actually resources are tied together, optionally assigned to a failover domain and put under the cluster's control. The resource elements can be defined as parallel tags, they can elements of one another to form dependency branches or they can be a combination of both. In our case, we want to make sure that each storage daemon successfully starts before the next service starts so we will be creating a dependency tree of resources.

The <service /> tag has just one required attribute, name="", which is used in tools like Conga for identifying the service. The name can be descriptive, but it must be unique. There are several optional attributes, though we will only be looking at five of them.

  • domain="" is used to assign the given <service /> to failover domain. The name set here must match a <failoverdomain name="" />.
  • autostart="[0|1]" controls whether or not the service is automatically started when rgmanager starts. We'll be disabling this for now, but we will come back and enable it after our initial testing is done.
  • exclusive="[0|1]" controls whether this service must run exclusively on a given node. Warning: If this is enabled, then no other service will be allowed to run on the node hosting this service.
  • recover="[restart|relocate|disable]" controls what rgmanager will do when this service fails. The services we're going to create now are only designed to run on one node, so restart is the only policy that makes sense.
  • max_restarts="x", where x is the number of times that rgmanager will try to restart a given service. After x failures, rgmanager will instead relocate the service based on the failover domain policy, when set. In our case, the failover domains prevent the service from running outside the domain, and the domain has only one node, so this value is effectively meaningless to us.
  • restart_expire_time="x", where x is a number of seconds. When max_restarts is greater than 0, rgmanager keeps a count of how many times a service has failed. These service failures "expire" after the number of seconds defined here. This is used so that the service failure count can reduce back down to 0 once things have been shown to be stable for a reasonable amount of time. As we're using max_restarts="0" and the failover domain prevents relocation of the service, this value is effectively meaningless to us.
<?xml version="1.0"?>
<cluster name="an-cluster" config_version="13">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node04.alteeve.com" nodeid="1">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="01" action="reboot"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node05.alteeve.com" nodeid="2">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="02" action="reboot"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="batou" agent="fence_na" ipaddr="batou.alteeve.com" login="admin" passwd="secret" quiet="1"/>
	</fencedevices>
        <fence_daemon post_join_delay="60"/>
        <totem rrp_mode="none" secauth="off"/>
	<rm>
		<resources>
			<script file="/etc/init.d/drbd" name="drbd"/>
			<script file="/etc/init.d/clvmd" name="clvmd"/>
			<script file="/etc/init.d/gfs2" name="gfs2"/>
		</resources>
		<failoverdomains>
			<failoverdomain name="an4_only" nofailback="0" ordered="0" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.com" priority="1"/>
			</failoverdomain>
			<failoverdomain name="an5_only" nofailback="0" ordered="0" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.com" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<service autostart="0" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
		<service autostart="0" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
	</rm>
</cluster>

So what we've done here is create our two <service /> groups; One for the an4_only failover domain and another matching service for an5_only. Both have their recovery policy set to recovery="restart" and neither are configured to start with rgmanager.

Each <service /> tag's element is a collection of three <script /> resource references. The scripts are referenced using the <script ref="x" /> attribute, where x must match a <resource name="x" /> element in <resources>.

These references are embedded to form a dependency tree. The tree is formatted to start drbd first, then when that starts successfully, it will start clvmd and then, finally, gfs2. When this service is disabled, this dependency tree is stopped in the reverse order.

Validating the Additions to cluster.conf

Seeing as we've made some fairly significant changes to /etc/cluster/cluster.conf, we'll want to re-validate it before pushing it out to the other node.

xmllint --relaxng /usr/share/system-config-cluster/misc/cluster.ng /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="an-cluster" config_version="13">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node04.alteeve.com" nodeid="1">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="01" action="reboot"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node05.alteeve.com" nodeid="2">
			<fence>
				<method name="node_assassin">
					<device name="batou" port="02" action="reboot"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="batou" agent="fence_na" ipaddr="batou.alteeve.com" login="admin" passwd="secret" quiet="1"/>
	</fencedevices>
        <fence_daemon post_join_delay="60"/>
        <totem rrp_mode="none" secauth="off"/>
	<rm>
		<resources>
			<script file="/etc/init.d/drbd" name="drbd"/>
			<script file="/etc/init.d/clvmd" name="clvmd"/>
			<script file="/etc/init.d/gfs2" name="gfs2"/>
		</resources>
		<failoverdomains>
			<failoverdomain name="an4_only" nofailback="0" ordered="0" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.com" priority="1"/>
			</failoverdomain>
			<failoverdomain name="an5_only" nofailback="0" ordered="0" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.com" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<service autostart="0" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
		<service autostart="0" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
	</rm>
</cluster>
/etc/cluster/cluster.conf validates

If there was a problem, you need to go back and fix it. DO NOT proceed until your configuration validates. Once it does, we're ready to move on!

With it validated, we need to push it to the other node. The cluster should be running now, so instead of rsync, we can use ccs_tool, the "cluster configuration system (tool)", to push the new cluster.conf to the other node and upgrade the cluster's version in one shot.

ccs_tool update /etc/cluster/cluster.conf
Config file updated from version 9 to 13

Update complete.

If you tool at /var/log/messages on the other node, you should see something like this:

Apr  7 11:34:16 an-node05 ccsd[13259]: Update of cluster.conf complete (version 9 -> 13).

Starting rgmanager

Now that we have services, we will want to manually start rgmanager. We're not yet going to set it to automatically start as we're not yet automatically starting cman, which it depends on. This will be done later when the testing is complete.

So make sure that the cluster is up and running.

cman_tool status
Version: 6.2.0
Config Version: 13
Cluster Name: an-cluster
Cluster Id: 31412
Cluster Member: Yes
Cluster Generation: 612
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Quorum: 1  
Active subsystems: 7
Flags: 2node Dirty 
Ports Bound: 0  
Node name: an-node04.alteeve.com
Node ID: 1
Multicast addresses: 239.192.122.47 
Node addresses: 192.168.3.74

Now make sure that the gfs2, clvmd and drbd daemons are not running on either node. If you have to stop them, do so in this order.

/etc/init.d/gfs2 status
Configured GFS2 mountpoints: 
/xen_shared
/etc/init.d/clvmd status
clvmd is stopped
active volumes: (none)
/etc/init.d/drbd status
drbd not loaded

Finally, start rgmanager. It will be worth watching syslog (clear; tail -f -n 0 /var/log/messages) in second terminals.

/etc/init.d/rgmanager start
Starting Cluster Service Manager:                          [  OK  ]

Nothing more should happen at this point, as our services are not set to autostart.

Monitoring Resources

There is a tool called clustat that lets you see what state the cluster's resources are in. You can run it as a once-off check of the services, or you can use the -i x switch, where x is a number of seconds to wait between re-checking the cluster service states.

clustat
Cluster Status for an-cluster @ Thu Mar 31 13:55:13 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an4_storage            (none)                         disabled      
 service:an5_storage            (none)                         disabled

To watch the cluster status and have it refresh the display every 2 seconds, call it with clustat -i 2. The output will be the same, but it won't return to the shell until you press <ctrl> + <c>.

Managing Cluster Resources

Managing services in the cluster is done with a fairly simple tool called clusvcadm.

The main commands we're going to look at now are:

  • clusvcadm -e <service> -m <node>: Enable the <service> on the specified <node>. When a <node> is not specified, the local node where the command was run is assumed.
  • clusvcadm -d <service> -m <node>: Disable the <service>.
  • clusvcadm -l <service>: Locks the <service> prior to a cluster shutdown. The only action allowed when a <service> is frozen is disabling it. This allows you to stop the <service> so that rgmanager doesn't try to recover it (restart, in our two services). Once quorum is dissolved and the cluster is shut down, the service is unlocked and returns to normal operation next time the node regains quorum.
  • clusvcadm -u <service>: Unlocks a <service>, should you change your mind and decide not to stop the cluster.

There are other ways to use clusvcadm which we will look at after the virtual servers are provisioned and under cluster control.

A Note On Resource Management With DRBD

We have something of a unique setup here, using DRBD, that requires a brief discussion.

When the cluster starts for the first time, where neither node's DRBD storage was up, the first node to start will wait indefinitely for the second node to start. For this reason, we want to ensure that we enable the storage resources more or less at the same time and from two different terminals. The reason for two terminals is that the clusvcadm -e ... command won't return until all resources have started, so you need the second terminal window to start the other node's clustered storage service while the first one waits.

Keep an eye on syslog, too. If anything goes wrong in DRBD and a split-brain is declared you will see messages like:

Mar 29 20:24:37 an-node04 kernel: block drbd2: helper command: /sbin/drbdadm initial-split-brain minor-2
Mar 29 20:24:37 an-node04 kernel: block drbd2: helper command: /sbin/drbdadm initial-split-brain minor-2 exit code 0 (0x0)
Mar 29 20:24:37 an-node04 kernel: block drbd2: Split-Brain detected but unresolved, dropping connection!
Mar 29 20:24:37 an-node04 kernel: block drbd2: helper command: /sbin/drbdadm split-brain minor-2
Mar 29 20:24:37 an-node04 kernel: block drbd2: helper command: /sbin/drbdadm split-brain minor-2 exit code 0 (0x0)
Mar 29 20:24:37 an-node04 kernel: block drbd2: conn( WFReportParams -> Disconnecting )

This can happen, for example, if you stop the cluster while DRBD is still up, and then break the network connection between the two DRBD resources. Recovering from a split-brain is covered in the trouble-shooting section below. ToDo

Starting the Storage Services

Now, with a terminal window opened for each node, run:

On an-node04:

clusvcadm -e service:an4_storage -m an-node04.alteeve.com
Member an-node04.alteeve.com trying to enable service:an4_storage...Success
service:an4_storage is now running on an-node04.alteeve.com

On an-node05:

clusvcadm -e service:an5_storage -m an-node05.alteeve.com
Member an-node05.alteeve.com trying to enable service:an5_storage...Success
service:an5_storage is now running on an-node05.alteeve.com

Syslog should show something like this (sample from an-node04);

Mar 31 16:30:31 an-node04 clurgmgrd[12681]: <notice> Starting disabled service service:an4_storage
Mar 31 16:30:31 an-node04 kernel: drbd: initialized. Version: 8.3.8 (api:88/proto:86-94)
Mar 31 16:30:31 an-node04 kernel: drbd: GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by mockbuild@builder10.centos.org, 2010-06-04 08:04:27
Mar 31 16:30:31 an-node04 kernel: drbd: registered as block device major 147
Mar 31 16:30:31 an-node04 kernel: drbd: minor_table @ 0xffff8800c35b91c0
Mar 31 16:30:31 an-node04 kernel: block drbd0: Starting worker thread (from cqueue/1 [140])
Mar 31 16:30:31 an-node04 kernel: klogd 1.4.1, ---------- state change ----------
Mar 31 16:30:31 an-node04 kernel: block drbd0: disk( Diskless -> Attaching )
Mar 31 16:30:31 an-node04 kernel: block drbd0: Found 4 transactions (164 active extents) in activity log.
Mar 31 16:30:31 an-node04 kernel: block drbd0: Method to ensure write ordering: barrier
Mar 31 16:30:31 an-node04 kernel: block drbd0: max_segment_size ( = BIO size ) = 32768
Mar 31 16:30:31 an-node04 kernel: block drbd0: drbd_bm_resize called with capacity == 19550368
Mar 31 16:30:31 an-node04 kernel: block drbd0: resync bitmap: bits=2443796 words=38185
Mar 31 16:30:31 an-node04 kernel: block drbd0: size = 9546 MB (9775184 KB)
Mar 31 16:30:31 an-node04 kernel: block drbd0: recounting of set bits took additional 0 jiffies
Mar 31 16:30:31 an-node04 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Mar 31 16:30:31 an-node04 kernel: block drbd0: disk( Attaching -> Outdated )
Mar 31 16:30:31 an-node04 kernel: block drbd1: Starting worker thread (from cqueue/0 [139])
Mar 31 16:30:31 an-node04 kernel: block drbd1: disk( Diskless -> Attaching )
Mar 31 16:30:31 an-node04 kernel: block drbd1: Found 1 transactions (1 active extents) in activity log.
Mar 31 16:30:31 an-node04 kernel: block drbd1: Method to ensure write ordering: barrier
Mar 31 16:30:31 an-node04 kernel: block drbd1: max_segment_size ( = BIO size ) = 32768
Mar 31 16:30:31 an-node04 kernel: block drbd1: drbd_bm_resize called with capacity == 195328232
Mar 31 16:30:31 an-node04 kernel: block drbd1: resync bitmap: bits=24416029 words=381501
Mar 31 16:30:31 an-node04 kernel: block drbd1: size = 93 GB (97664116 KB)
Mar 31 16:30:31 an-node04 kernel: block drbd1: recounting of set bits took additional 2 jiffies
Mar 31 16:30:31 an-node04 kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Mar 31 16:30:31 an-node04 kernel: block drbd1: disk( Attaching -> Outdated )
Mar 31 16:30:31 an-node04 kernel: block drbd2: Starting worker thread (from cqueue/1 [140])
Mar 31 16:30:31 an-node04 kernel: block drbd2: disk( Diskless -> Attaching )
Mar 31 16:30:31 an-node04 kernel: block drbd2: Found 2 transactions (2 active extents) in activity log.
Mar 31 16:30:31 an-node04 kernel: block drbd2: Method to ensure write ordering: barrier
Mar 31 16:30:31 an-node04 kernel: block drbd2: max_segment_size ( = BIO size ) = 32768
Mar 31 16:30:31 an-node04 kernel: block drbd2: drbd_bm_resize called with capacity == 195328232
Mar 31 16:30:31 an-node04 kernel: block drbd2: resync bitmap: bits=24416029 words=381501
Mar 31 16:30:31 an-node04 kernel: block drbd2: size = 93 GB (97664116 KB)
Mar 31 16:30:31 an-node04 kernel: block drbd2: recounting of set bits took additional 0 jiffies
Mar 31 16:30:31 an-node04 kernel: block drbd2: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Mar 31 16:30:31 an-node04 kernel: block drbd2: disk( Attaching -> Outdated )
Mar 31 16:30:31 an-node04 kernel: block drbd3: Starting worker thread (from cqueue/1 [140])
Mar 31 16:30:31 an-node04 kernel: block drbd3: disk( Diskless -> Attaching )
Mar 31 16:30:31 an-node04 kernel: block drbd3: No usable activity log found.
Mar 31 16:30:31 an-node04 kernel: block drbd3: Method to ensure write ordering: barrier
Mar 31 16:30:31 an-node04 kernel: block drbd3: max_segment_size ( = BIO size ) = 32768
Mar 31 16:30:31 an-node04 kernel: block drbd3: drbd_bm_resize called with capacity == 515686680
Mar 31 16:30:31 an-node04 kernel: block drbd3: resync bitmap: bits=64460835 words=1007201
Mar 31 16:30:31 an-node04 kernel: block drbd3: size = 246 GB (257843340 KB)
Mar 31 16:30:32 an-node04 kernel: block drbd3: recounting of set bits took additional 0 jiffies
Mar 31 16:30:32 an-node04 kernel: block drbd3: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Mar 31 16:30:32 an-node04 kernel: block drbd3: disk( Attaching -> Outdated )
Mar 31 16:30:32 an-node04 kernel: block drbd0: conn( StandAlone -> Unconnected )
Mar 31 16:30:32 an-node04 kernel: block drbd0: Starting receiver thread (from drbd0_worker [14761])
Mar 31 16:30:32 an-node04 kernel: block drbd0: receiver (re)started
Mar 31 16:30:32 an-node04 kernel: block drbd0: conn( Unconnected -> WFConnection )
Mar 31 16:30:32 an-node04 kernel: block drbd1: conn( StandAlone -> Unconnected )
Mar 31 16:30:32 an-node04 kernel: block drbd1: Starting receiver thread (from drbd1_worker [14776])
Mar 31 16:30:32 an-node04 kernel: block drbd1: receiver (re)started
Mar 31 16:30:32 an-node04 kernel: block drbd1: conn( Unconnected -> WFConnection )
Mar 31 16:30:32 an-node04 kernel: block drbd2: conn( StandAlone -> Unconnected )
Mar 31 16:30:32 an-node04 kernel: block drbd2: Starting receiver thread (from drbd2_worker [14792])
Mar 31 16:30:32 an-node04 kernel: block drbd2: receiver (re)started
Mar 31 16:30:32 an-node04 kernel: block drbd2: conn( Unconnected -> WFConnection )
Mar 31 16:30:32 an-node04 kernel: block drbd3: conn( StandAlone -> Unconnected )
Mar 31 16:30:32 an-node04 kernel: block drbd3: Starting receiver thread (from drbd3_worker [14809])
Mar 31 16:30:32 an-node04 kernel: block drbd3: receiver (re)started
Mar 31 16:30:32 an-node04 kernel: block drbd3: conn( Unconnected -> WFConnection )
Mar 31 16:30:33 an-node04 kernel: block drbd0: Handshake successful: Agreed network protocol version 94
Mar 31 16:30:33 an-node04 kernel: block drbd0: conn( WFConnection -> WFReportParams )
Mar 31 16:30:33 an-node04 kernel: block drbd2: Handshake successful: Agreed network protocol version 94
Mar 31 16:30:33 an-node04 kernel: block drbd2: conn( WFConnection -> WFReportParams )
Mar 31 16:30:33 an-node04 kernel: block drbd1: Handshake successful: Agreed network protocol version 94
Mar 31 16:30:33 an-node04 kernel: block drbd1: conn( WFConnection -> WFReportParams )
Mar 31 16:30:33 an-node04 kernel: block drbd1: Starting asender thread (from drbd1_receiver [14837])
Mar 31 16:30:33 an-node04 kernel: block drbd0: Starting asender thread (from drbd0_receiver [14834])
Mar 31 16:30:33 an-node04 kernel: block drbd2: Starting asender thread (from drbd2_receiver [14840])
Mar 31 16:30:33 an-node04 kernel: block drbd0: data-integrity-alg: <not-used>
Mar 31 16:30:33 an-node04 kernel: block drbd0: drbd_sync_handshake:
Mar 31 16:30:33 an-node04 kernel: block drbd0: self F474848EEC0951CC:0000000000000000:7C017082A862CD08:7F34B076565CCDB5 bits:0 flags:0
Mar 31 16:30:33 an-node04 kernel: block drbd0: peer 157CD3BF3DB6E8BC:F474848EEC0951CD:7C017082A862CD08:7F34B076565CCDB5 bits:0 flags:0
Mar 31 16:30:33 an-node04 kernel: block drbd0: uuid_compare()=-1 by rule 50
Mar 31 16:30:33 an-node04 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Mar 31 16:30:33 an-node04 kernel: block drbd1: data-integrity-alg: <not-used>
Mar 31 16:30:33 an-node04 kernel: block drbd1: drbd_sync_handshake:
Mar 31 16:30:33 an-node04 kernel: block drbd1: self 30762E54BB9079FE:0000000000000000:88058E571A66C99C:12631AA2DAF46DD1 bits:0 flags:0
Mar 31 16:30:33 an-node04 kernel: block drbd1: peer 0B9D937BE2B0B9A6:30762E54BB9079FF:88058E571A66C99C:12631AA2DAF46DD1 bits:0 flags:0
Mar 31 16:30:33 an-node04 kernel: block drbd1: uuid_compare()=-1 by rule 50
Mar 31 16:30:33 an-node04 kernel: block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Mar 31 16:30:33 an-node04 kernel: block drbd2: data-integrity-alg: <not-used>
Mar 31 16:30:33 an-node04 kernel: block drbd2: drbd_sync_handshake:
Mar 31 16:30:33 an-node04 kernel: block drbd2: self A68C24DF6C94892A:0000000000000000:C0740977511D1CDC:B80EC8D187F4C7BF bits:0 flags:0
Mar 31 16:30:33 an-node04 kernel: block drbd2: peer 952CE197EA804B60:A68C24DF6C94892B:C0740977511D1CDC:B80EC8D187F4C7BF bits:0 flags:0
Mar 31 16:30:33 an-node04 kernel: block drbd2: uuid_compare()=-1 by rule 50
Mar 31 16:30:33 an-node04 kernel: block drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Mar 31 16:30:33 an-node04 kernel: block drbd3: Handshake successful: Agreed network protocol version 94
Mar 31 16:30:33 an-node04 kernel: block drbd3: conn( WFConnection -> WFReportParams )
Mar 31 16:30:33 an-node04 kernel: block drbd3: Starting asender thread (from drbd3_receiver [14843])
Mar 31 16:30:33 an-node04 kernel: block drbd3: data-integrity-alg: <not-used>
Mar 31 16:30:33 an-node04 kernel: block drbd3: drbd_sync_handshake:
Mar 31 16:30:33 an-node04 kernel: block drbd3: self 178C930B89B102E8:0000000000000000:94D8B3350E2CA8E6:CC29CCB2B94BE0EB bits:0 flags:0
Mar 31 16:30:33 an-node04 kernel: block drbd3: peer A1A624112B473DDE:178C930B89B102E9:94D8B3350E2CA8E7:CC29CCB2B94BE0EB bits:0 flags:0
Mar 31 16:30:33 an-node04 kernel: block drbd3: uuid_compare()=-1 by rule 50
Mar 31 16:30:33 an-node04 kernel: block drbd3: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Mar 31 16:30:33 an-node04 kernel: block drbd0: role( Secondary -> Primary )
Mar 31 16:30:33 an-node04 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
Mar 31 16:30:33 an-node04 kernel: block drbd1: conn( WFBitMapT -> WFSyncUUID )
Mar 31 16:30:33 an-node04 kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1
Mar 31 16:30:33 an-node04 kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0)
Mar 31 16:30:33 an-node04 kernel: block drbd1: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent )
Mar 31 16:30:33 an-node04 kernel: block drbd1: Began resync as SyncTarget (will sync 0 KB [0 bits set]).
Mar 31 16:30:33 an-node04 kernel: block drbd1: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
Mar 31 16:30:33 an-node04 kernel: block drbd1: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Mar 31 16:30:33 an-node04 kernel: block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1
Mar 31 16:30:33 an-node04 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Mar 31 16:30:33 an-node04 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Mar 31 16:30:33 an-node04 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent )
Mar 31 16:30:33 an-node04 kernel: block drbd0: Began resync as SyncTarget (will sync 0 KB [0 bits set]).
Mar 31 16:30:33 an-node04 kernel: block drbd1: helper command: /sbin/drbdadm after-resync-target minor-1 exit code 0 (0x0)
Mar 31 16:30:33 an-node04 kernel: block drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
Mar 31 16:30:33 an-node04 kernel: block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Mar 31 16:30:33 an-node04 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
Mar 31 16:30:33 an-node04 kernel: block drbd1: Connected in w_make_resync_request
Mar 31 16:30:33 an-node04 kernel: block drbd1: role( Secondary -> Primary )
Mar 31 16:30:33 an-node04 kernel: block drbd2: conn( WFBitMapT -> WFSyncUUID )
Mar 31 16:30:33 an-node04 kernel: block drbd2: helper command: /sbin/drbdadm before-resync-target minor-2
Mar 31 16:30:33 an-node04 kernel: block drbd2: helper command: /sbin/drbdadm before-resync-target minor-2 exit code 0 (0x0)
Mar 31 16:30:33 an-node04 kernel: block drbd2: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent )
Mar 31 16:30:33 an-node04 kernel: block drbd2: Began resync as SyncTarget (will sync 0 KB [0 bits set]).
Mar 31 16:30:33 an-node04 kernel: block drbd3: conn( WFBitMapT -> WFSyncUUID )
Mar 31 16:30:33 an-node04 kernel: block drbd2: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
Mar 31 16:30:33 an-node04 kernel: block drbd2: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Mar 31 16:30:33 an-node04 kernel: block drbd2: helper command: /sbin/drbdadm after-resync-target minor-2
Mar 31 16:30:33 an-node04 kernel: block drbd3: helper command: /sbin/drbdadm before-resync-target minor-3
Mar 31 16:30:33 an-node04 clvmd: Cluster LVM daemon started - connected to CMAN
Mar 31 16:30:33 an-node04 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
Mar 31 16:30:33 an-node04 kernel: block drbd0: Connected in w_make_resync_request
Mar 31 16:30:33 an-node04 kernel: block drbd2: helper command: /sbin/drbdadm after-resync-target minor-2 exit code 0 (0x0)
Mar 31 16:30:33 an-node04 kernel: block drbd3: helper command: /sbin/drbdadm before-resync-target minor-3 exit code 0 (0x0)
Mar 31 16:30:33 an-node04 kernel: block drbd3: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent )
Mar 31 16:30:33 an-node04 kernel: block drbd3: Began resync as SyncTarget (will sync 0 KB [0 bits set]).
Mar 31 16:30:33 an-node04 kernel: block drbd3: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
Mar 31 16:30:33 an-node04 kernel: block drbd3: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Mar 31 16:30:33 an-node04 kernel: block drbd3: helper command: /sbin/drbdadm after-resync-target minor-3
Mar 31 16:30:33 an-node04 kernel: block drbd3: helper command: /sbin/drbdadm after-resync-target minor-3 exit code 0 (0x0)
Mar 31 16:30:33 an-node04 kernel: block drbd2: Connected in w_make_resync_request
Mar 31 16:30:33 an-node04 kernel: block drbd2: role( Secondary -> Primary )
Mar 31 16:30:33 an-node04 kernel: block drbd3: role( Secondary -> Primary )
Mar 31 16:30:34 an-node04 kernel: block drbd0: peer( Secondary -> Primary )
Mar 31 16:30:34 an-node04 kernel: block drbd1: peer( Secondary -> Primary )
Mar 31 16:30:34 an-node04 kernel: block drbd2: peer( Secondary -> Primary )
Mar 31 16:30:34 an-node04 kernel: block drbd3: peer( Secondary -> Primary )
Mar 31 16:30:34 an-node04 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "an-cluster:xen_shared"
Mar 31 16:30:34 an-node04 kernel: GFS2: fsid=an-cluster:xen_shared.0: Joined cluster. Now mounting FS...
Mar 31 16:30:34 an-node04 kernel: GFS2: fsid=an-cluster:xen_shared.0: jid=0, already locked for use
Mar 31 16:30:34 an-node04 kernel: GFS2: fsid=an-cluster:xen_shared.0: jid=0: Looking at journal...
Mar 31 16:30:35 an-node04 kernel: GFS2: fsid=an-cluster:xen_shared.0: jid=0: Done
Mar 31 16:30:35 an-node04 kernel: GFS2: fsid=an-cluster:xen_shared.0: jid=1: Trying to acquire journal lock...
Mar 31 16:30:35 an-node04 kernel: GFS2: fsid=an-cluster:xen_shared.0: jid=1: Looking at journal...
Mar 31 16:30:35 an-node04 kernel: GFS2: fsid=an-cluster:xen_shared.0: jid=1: Done
Mar 31 16:30:35 an-node04 clurgmgrd[12681]: <notice> Service service:an4_storage started

We see clurgmgrd, the cluster rgmanager daemon, take the request to start the an4_storage service. This is immediately followed by a lot of drbd messages showing the attachment, connection and promotion of the DRBD resources. Once the drbd daemon reported that it was up, clurgmgrd started the clvmd. There is still some DRBD work going on in the background, but shortly there after we see the gfs2 initialization script start up. Once this last daemon returns successfully, clurgmgrd reports that the service started successfully.

Now you can check drbd, clvmd and gfs2 again and you will see that they are all online.

/etc/init.d/drbd status
drbd driver loaded OK; device status:
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by mockbuild@builder10.centos.org, 2010-06-04 08:04:27
m:res  cs         ro               ds                 p  mounted  fstype
0:r0   Connected  Primary/Primary  UpToDate/UpToDate  C
1:r1   Connected  Primary/Primary  UpToDate/UpToDate  C
2:r2   Connected  Primary/Primary  UpToDate/UpToDate  C
3:r3   Connected  Primary/Primary  UpToDate/UpToDate  C
/etc/init.d/clvmd status
clvmd (pid 14919) is running...
active volumes: xen_shared
/etc/init.d/gfs2 status
Configured GFS2 mountpoints: 
/xen_shared
Active GFS2 mountpoints: 
/xen_shared

Now, let's check clustat again and we'll see that the services are online.

clustat
Cluster Status for an-cluster @ Thu Mar 31 17:40:15 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an4_storage            an-node04.alteeve.com          started       
 service:an5_storage            an-node05.alteeve.com          started

Hoozah!

Stopping Clustered Services

With the services we've created, it's actually pretty simple to shut things down. Simply stopping rgmanager on each node will stop the services and, as they're not able to fail over, the services will stay offline. This can lead to bad habits though. So, to get into the proper habit, let's lock then disable the an5_storage service and then shutdown rgmanager.

On an-node05:

clusvcadm -l service:an5_storage -m an-node05.alteeve.com
Resource groups locked
clusvcadm -d service:an5_storage -m an-node05.alteeve.com
Member an-node05.alteeve.com disabling service:an5_storage...Success

Now when you try to run clustat, you can see that the service on an-node05 is disabled.

clustat
Cluster Status for an-cluster @ Fri Apr  1 00:01:06 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, rgmanager
 an-node05.alteeve.com                       2 Online, Local, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an4_storage            an-node04.alteeve.com          started       
 service:an5_storage            (an-node05.alteeve.com)        disabled

Now we can shutdown rgmanager proper.

/etc/init.d/rgmanager stop
Shutting down Cluster Service Manager...
Waiting for services to stop:                              [  OK  ]
Cluster Service Manager is stopped.

Now clustat will not show any services at all when run from an-node05.

On an-node05:

clustat
Cluster Status for an-cluster @ Fri Apr  1 00:02:53 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online
 an-node05.alteeve.com                       2 Online, Local

You can still see both services from an-node04 though.

On an-node04:

clustat
Cluster Status for an-cluster @ Fri Apr  1 00:03:21 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an4_storage            an-node04.alteeve.com          started       
 service:an5_storage            (an-node05.alteeve.com)        disabled

Now we can go back to an-node05 and completely shut down the cluster.

/etc/init.d/cman stop
Stopping cluster: 
   Stopping fencing... done
   Stopping cman... done
   Stopping ccsd... done
   Unmounting configfs... done
                                                           [  OK  ]

We can check on an-node04 and see that the cluster is now down to just itself.

cman_tool status
Version: 6.2.0
Config Version: 13
Cluster Name: an-cluster
Cluster Id: 31412
Cluster Member: Yes
Cluster Generation: 648
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Quorum: 1  
Active subsystems: 9
Flags: 2node Dirty 
Ports Bound: 0 11 177  
Node name: an-node04.alteeve.com
Node ID: 1
Multicast addresses: 239.192.122.47 
Node addresses: 192.168.3.74

At this point, an-node05 is totally out of the cluster and, if you wanted, you could perform any maintenance you might want to do. More on that later though.

Provisioning Our Virtual Servers

Finally, the goal of this cluster is starting to come into sight!

"Provisioning" virtual servers means to create them. This tutorial is more about clustering than Xen and virtual machine administration, so some liberties will be taken with regard to your knowledge of Xen. We'll cover all of the steps needed to provision and manage the VMs, but there will not be an in-depth discussion of the tools and their various uses.

Please, if you are totally unfamiliar with Xen, take a few minutes to review some tutorials:

Note: We are using Xen v3 here, and the latest is now v4. Please take not of the version when reading these tutorials.

Starting libvirtd On The Nodes

In the following steps, we will be using a program called virsh on the nodes and virt-manager on our workstations to view the VMs. For this, we need to make sure that the libvirtd daemon is running on each node first.

We'll start the daemon now as we're going to use it very shortly.

On Both Nodes:

/etc/init.d/libvirtd start
Starting libvirtd daemon:                                  [  OK  ]

To start libvirtd on boot, run the command below.

On Both Nodes':

chkconfig libvirtd on
chkconfig --list libvirtd
libvirtd       	0:off	1:off	2:on	3:on	4:on	5:on	6:off

Accessing The VMs

The virtual servers we are going to create are, by definition, "headless". There is no monitor or place to plug in a keyboard.

The main way that you will monitor the virtual servers is through VNC. If you are running a relatively recent version of Linux on your workstation, there is a fantastic little program for connecting to and monitoring the VMs on multiple nodes using multiple hypervisors called virt-manager. It is available under many Linux distribution's package managers under the same name.

In Fedora, EL 5 and 6 and many other RPM based distributions, you can install virt-manager on your workstation with the following command.

yum install virt-manager

You can then find virt-manager on you System Tools -> Virtual Machine Manager.

To establish a connection to the nodes, click on File -> Add Connection.... Change the Hypervisor selection bar to Xen, check to select Connect to remote host, leave the default Method as SSH and Username as root. Then enter the host name or IP address of each node in the Hostname field. I always add cluster nodes to my /etc/hosts file so that I can simply enter an-node04 and an-node05. How you handle this is up to you and your preferences.

Adding a connection to virt-manager on Fedora 14.

Once both nodes are added, you should see that there is already a Domain-0 entry. This is because, as we discussed earlier, even the "host" OS is itself a virtual machine.

A view of virt-manager on Fedora 14.

As an aside; You'll node that localhost in that screen shot does not have an entry called Domain-0. This is one of the major differences between KVM and Xen. KVM is what I use on my workstation for various other projects, but it does not convert the host into a VM.

Limiting dom0's RAM Use

Normally, dom0 will claim and use memory not allocated to virtual machines. This can cause trouble if, for example, you've migrated a VM off of a node and then want to move it or another VM back shortly after. For a period of time, dom0 will claim that there is not enough free memory for the migration. By setting a hard limit of dom0's memory usage, this scenario won't happen and you will not need to delay migrations.

To do this, add dom0_mem=1024M to the Xen kernel image's first module line in grub. For example, you should have a line like in your grub configuration file:

vim /boot/grub/grub.conf
default=1
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title CentOS (2.6.18-194.32.1.el5)
        root (hd0,0)
        kernel /vmlinuz-2.6.18-194.32.1.el5 ro root=LABEL=/ dom0_mem=1024M
        initrd /initrd-2.6.18-194.32.1.el5.img
title CentOS (2.6.18-194.32.1.el5xen)
        root (hd0,0)
        kernel /xen.gz-2.6.18-194.32.1.el5
        module /vmlinuz-2.6.18-194.32.1.el5xen ro root=LABEL=/ dom0_mem=1024M
        module /initrd-2.6.18-194.32.1.el5xen.img

You can change 1024M with the amount of RAM you want to allocate to dom0.

Warning: If you update your kernel, ensure that this kernel argument was added to the new kernel's argument list.

Our planned layout

At this stage, what you will want to run is almost certainly going to be unique to you, so we will not be going into detail about what each VM does. We will cover provisioning them, manipulating them and what not. The description of the VMs is purely an example of what they might be.

We will be creating two virtual servers.

  • vm0001_c5_ws1; A CentOS server hosting a website.
  • vm0002_win1; A Microsoft Windows server, showing how to host non-Linux virtual machines.

We'll assign vm0001_c5_ws1 and vm0001_c5_ws1 to run normally on an-node04. The vm0002_win1 machine will run on an-node05.

Before we talk about resources, there is something you must be aware of.

  • You can have more virtual machines than CPU cores. However, it's advisable to dedicate one core to just the dom0 machine.
  • RAM on dom0 and all domU VMs must not exceed the maximum amount of RAM available in a given node.
Warning: You must consider how your collection of virtual servers will run when only on node is available. As I have 4 GiB of RAM in each node, I will assign 1 GiB to dom0 and then 1 GiB to each VM, leaving 1 GiB for future expansion. How you divvy up your memory and CPU cores is ultimately up to you.

So here are our two planned virtual servers, laid out in a table. Doing this before provisioning can help you visualize how your cluster's resources will be consumed, helping to ensure that you don't use too much, which is of particular note on very large installations. It's also very useful for planning your virtual machine provisioning commands in the next step.

vm0001_c5_ws1 vm0002_win1
Primary Host an-node04 an-node05
RAM 1024MiB 1024MiB
Storage /dev/drbd_an4_vg0/vm0001_1, 50 GB /dev/drbd_an5_vg0/vm0002_1, 100 GB (100%)
Network(s) IFN xenbr0, 192.168.1.200/255.255.255.0 BCN xenbr0, 192.168.1.201/255.255.255.0
Source Files http://192.168.1.254/c5/x86_64/img http://192.168.1.254/win7/x86_64/iso
Kickstart Script http://192.168.1.254/c5/x86_64/ks/generic_c5.ks
Warning: There are issues with installing VMs from ISO images. For this reason, you are advised to make the installation images available over a web server. A great way to do this is by creating a PXE server on your network. Then you can point to it's img directory when running the VM installs. This tutorial assumes this is available.

Provisioning vm0001_c5_ws1; A Webserver

So let's start with a basic web server.

Provisioning VMs requires two steps;

  • Creating a logical volume on the clustered LVM.
  • Craft and execute a virt-install command.

Before you proceed, you need to know where the installation image files are found. This tutorial uses a PXE server, so we'll be telling virt-install to pull the installation files and kickstart scripts off of it's web server. If you don't have a PXE server, simply mounting the installation image's ISO and making that available through a trivial webserver setup will be fine. How you do this, exactly, is outside the scope of this tutorial. However, there is a separate, detailed configuration tutorial for setting up a PXE server which covers a basic apache configuration.

Create the LV for the VM on the /dev/drbd_an4_vg0 VG, as it will primarily run on an-node04.

lvcreate -L 50G -n vm0001_1 --addtag @an-cluster /dev/drbd_an4_vg0
  Logical volume "vm0001_1" created
Note: The example below uses the following kickstart file. Please adapt it for your use.

Now, the long virt-install command to provision the VM. Let's look at it, then we'll discuss it.

virt-install --connect xen \
	--name vm0001_c5_ws1 \
	--ram 1048 \
	--arch x86_64 \
	--vcpus 1 \
	--cpuset 1 \
	--location http://192.168.1.254/c5/x86_64/img \
	--extra-args "ks=http://192.168.1.254/c5/x86_64/ks/generic_c5.ks" \
	--os-type linux \
	--os-variant rhel5.4 \
	--disk path=/dev/drbd_an4_vg0/vm0001_1 \
	--network bridge=xenbr0 \
	--vnc \
	--paravirt
Note: If you wanted to provision a VM to act as a firewall, or for other reasons wanted a VM to access the back-channel, you could connect to xenbr2 by simply adding a second --network bridge=xenbr2 argument.

The man page for virt-install covers all of the options you can pass in good detail. We're going to discuss now the options used here, but it will only be a subset of options that you may wish to use. Please take the time to read man virt-install.

  • --connect xen; Tells virt-install that we are provisioning a Xen domU VM.
  • --name vm0001_c5_ws1; Tells virt-install to give the VM the name vm0001_c5_ws1. This can be anything you please, but it must be unique in the cluster. Personally, I like the format vm####_desc, where #### is a sequence number to ensure uniqueness and desc is a human-readable, short description of the VM. Please use whatever naming convention you find comfortable.
  • --ram 1024; This is the number of MiB to allocate to the VM. This can be adjusted post-install.
  • --arch x86_64; This tells virt-install to emulate a 64bit CPU/environment.
  • -- vcpus 1; This controls how many CPU cores to allocate to this VM. This can not exceed the real number of CPUs, and should be n-1 at most, to ensure that dom0 gets sole access to core 0. This can be adjusted post-install.
  • --cpuset 1; This tells libvirt which cores it is allowed to use for this VM. As this node only has two cores, this is set to 1 (the second core). This can be a comma-separated list of value, and values can use hyphens for ranges. For example, if you have eight cores, you may specify --cpuset 1-7 or --cpuset 1,3,5-7.
  • --location http://192.168.1.254/c5/x86_64/img; This tells the OS' installer to look for installation files under the passed URL. The installation files could be local to the node (ie: with a loop-back mounted ISO), on an NFS share or over FTP. This option can be replaced with --pxe for PXE server installs, --import for skipping an installation and directly importing a VM image or --livecd for running up a live CD/DVD.
  • --extra-args "ks=http://192.168.1.254/c5/x86_64/ks/generic_c5.ks"; This allows us to pass special arguments to the installer's kernel. In this case, we're telling the installer to use the kickstart file at the given location. Optionally, we could have used --extra-args "ks=http://192.168.1.254/c5/x86_64/ks/generic_c5.ks ksdevice=eth0" to specify which interface to use when looking for the defined kickstart file. I generally avoid this as it is rather difficult to predict with physical interface will get what ethX name.
  • --os-type linux; This controls some internal optimization within Xen for handling Linux operating systems.
  • --os-variant rhel5.4; This further optimizes Xen for use with EL5.4 (and newer) based operating systems. When this option is used, --os-type is not strictly needed. The various supported --os-type and --os-variant are found in man virt-install.
  • --disk path=/dev/drbd_an4_vg0/vm0001_1; This tells the installer to allocate the LV we just created as this VM's hard drive. There are many options for using storage for VMs, please see man virt-install.
  • --network bridge=xenbr0; This, and the xenbr2 following it, tells virt-install to connect this VM to those two bridges. Note that inside the VM, these will show up as eth0 and eth1.
  • --vnc; This tells the VM to setup and export a VNC server. This is how we will connect to and monitor the installation of the VM.
  • --paravirt; This tells virt-install that we will be creating a paravirtual VM. The other option is --hvm which specifies full virtualization.

If things went well, you should now see you VM begin to install!

Installation of a kickstart-based text install of CentOS 5.6 as a Xen VM.

Once your VM is installed, we'll want to dump it's configuration to an XML file. This way, should the VM be accidentally undefined, we can easily redefine it. In fact, we have to define this VM on the second node to enable migration, but we'll go into details about migration later. For now though, run the following virsh command to write the VM's definition information to an XML file on the shared GFS2 partition. Putting it there will make is accessible to both nodes.

Warning: Do not bother dumping the configuration to an XML file until after the OS is fully installed and has rebooted. The configuration will contain arguments specific to the installation that will cause problem if used after the install is completed.

Personally, I like to keep the configuration files in a subdirectory on the GFS2 share, then copy them to the local node's storage, just to be safe. Given that this is our first VM, we'll create a directory for the definition files now called definitions.

On an-node04:

mkdir /xen_shared/definitions
virsh dumpxml vm0001_c5_ws1 > /xen_shared/definitions/vm0001_c5_ws1.xml
cat /xen_shared/definitions/vm0001_c5_ws1.xml
<domain type='xen' id='1'>
  <name>vm0001_c5_ws1</name>
  <uuid>99ecbbd2-f277-0ede-756d-255b1436f8de</uuid>
  <memory>1073152</memory>
  <currentMemory>1073152</currentMemory>
  <vcpu cpuset='1'>1</vcpu>
  <bootloader>/usr/bin/pygrub</bootloader>
  <os>
    <type>linux</type>
    <kernel>/var/lib/xen/boot_kernel.w1l3CV</kernel>
    <initrd>/var/lib/xen/boot_ramdisk.nhjjMp</initrd>
    <cmdline>ro root=LABEL=/ crashkernel=auto</cmdline>
  </os>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <disk type='block' device='disk'>
      <driver name='phy'/>
      <source dev='/dev/drbd_an4_vg0/vm0001_1'/>
      <target dev='xvda' bus='xen'/>
    </disk>
    <interface type='bridge'>
      <mac address='00:16:36:36:39:32'/>
      <source bridge='xenbr0'/>
      <script path='vif-bridge'/>
      <target dev='vif1.0'/>
    </interface>
    <console type='pty' tty='/dev/pts/2'>
      <source path='/dev/pts/2'/>
      <target port='0'/>
    </console>
    <input type='mouse' bus='xen'/>
    <graphics type='vnc' port='5900' autoport='yes' keymap='en-us'/>
  </devices>
</domain>

On Both Nodes:

rsync -av /xen_shared/definitions ~/
building file list ... done
definitions/
definitions/vm0001_c5_ws1.xml

sent 1311 bytes  received 48 bytes  2718.00 bytes/sec
total size is 1176  speedup is 0.87

The benefit of having backups on the local storage is to protect these rarely changing but critical files in case anything ever corrupted the shared storage. We've gone to great lengths to avoid this, but it's always possible and this is a simple precaution.

Reconnecting to the VM

After the install finishes, or after you close the initial minimal VNC viewer, you will need to manually reconnect to the VM. This is where virt-manager comes in so handy!

Start it back up and double-click on the an-node04 host. You will now see the new vm0001_fw1 VM. Double-click on it and you will be right back on the VM.

A view of virt-manager used to connect to the vm0001_c5_ws1 VM.

Pretty cool, eh!

How to Stop, Define and Start the VM

To stop the VM, you can connect to it as a remote server and shut it down as you would a normal VM.

If you want to initial a clean shut down from the host node, you can use virsh to initial a shutdown over ACPI, same as if you tapped the power button on a physical server.

Make sure the VM is on the node:

virsh list --all
 Id Name                 State
----------------------------------
  0 Domain-0             running
  2 vm0001_c5_ws1        idle

Tell it to shutdown"

virsh shutdown vm0001_c5_ws1
Domain vm0001_c5_ws1 is being shutdown

If you had a VNC session running, you will see the VM start to gracefully shutdown.

Gracefully shutting down the firewall VM via virsh shutdown vm0001_c5_ws1.

After a few moments, the VM should shut down. You can confirm this by again running virsh list --all again.

virsh list --all
 Id Name                 State
----------------------------------
  0 Domain-0             running
  - vm0001_c5_ws1        shut off

Remember how we dumped this VM's configuration to an XML file on the GFS2 partition earlier? We're now going to use that to define the VM on the other node, then we'll start it up over there, too.

On an-node05:

Check that the VM isn't known by an-node05:

virsh list --all
 Id Name                 State
----------------------------------
  0 Domain-0             running

It's not there, as expect. So now we'll use the /xen_shared/definitions/vm0001_c5_ws1.xml file we created.

virsh define /xen_shared/definitions/vm0001_c5_ws1.xml
Domain vm0001_c5_ws1 defined from /xen_shared/definitions/vm0001_c5_ws1.xml

Now confirm that it's there.

virsh list --all
 Id Name                 State
----------------------------------
  0 Domain-0             running
  - vm0001_c5_ws1       shut off

We can now see vm0001_c5_ws1 on both nodes. Of course, never, ever try to start the VM on both nodes at the same time. In the previous step, we shut down vm0001_c5_ws1, but it's safest to make sure that it's still off.


On an-node04:

virsh list --all
 Id Name                 State
----------------------------------
  0 Domain-0             running
  - vm0001_c5_ws1        shut off

So we now have vm0001_c5_ws1 shut off and it is defined on both an-node04 and an-node05. We can now start it on either node. Let's now start it up on the second node, just for fun.

On an-node05:

virsh start vm0001_c5_ws1
Domain vm0001_c5_ws1 started

If you look at virt-manager, you will now see vm0001_c5_ws1 up and running on an-node05 and shut off on an-node04.

View of vm0001_c5_ws1 running on an-node05.

There we go. We've now seen how to stop, define and start the VM using virsh. Nothing too fancy!

Testing VM Migration

One of the biggest benefits of virtual servers in clusters is that they can be migrated between nodes without needing to shut down the VM. This is useful for planned maintenance on nodes, as you can push off all of it's VMs, take it out of the cluster and do your maintenance and your VM users will see minimal or no interruption in service.

There are two types of migration;

  • Cold Migration; The VM is frozen, it's RAM is copied to the other node and then it is thawed on the new host. This is the fastest method of migrating, but the users will see a period where they can not interact with the VM.
  • Live Migration; The VM continues to run during the migration. Performance will degrade a bit and the migration process will take longer to complete, but users should not see any interruption in service.

To manually migrate the vm0001_c5_ws1 VM from an-node05 back to an-node04, run the following command.

On an-node05 (there will be no output):

virsh migrate --live vm0001_c5_ws1 xen+ssh://root@an-node04

If you flip over to virt-manager, you will see that the node shows as Running on an-node04 on an-node04 and Shutoff on an-node05 right away, but there will still be CPU activity on both. This is the live migration process running. In the screen shot below, I opened a standard terminal and ssh'ed into vm0001_c5_ws1 and started a ping flood to Google before starting the live migration. Notice how the migration completed and no packets were dropped?

View of vm0001_c5_ws1 being live migrated to an-node04 from an-node05 with a ping-flood running via an ssh session.

This should tickle your geek glands.

How to "Pull the Power" on a VM

If something happens to the VM and you can't shut it down, virsh provides a command that is the equivalent of pulling the power on a physical server. This command forces the virtual server off without giving the VM a chance to react at all. For obvious reasons, you will want to be somewhat careful in using this as it has all the same potential for problems as killing the power of a real server.

So to "pull the plug", you can run this:

virsh destroy vm0001_c5_ws1
Domain vm0001_c5_ws1 destroyed

The VM is still defined, but it's no longer running.

virsh list --all
 Id Name                 State
----------------------------------
  0 Domain-0             running
  - vm0001_c5_ws1        shut off

How to Delete a VM and Start Over

Note: It is very likely that you will run into problems when you first start trying to provision your VM. If you want to delete the VM and start over, the way to do it is with virsh, the virtual shell.

Check that it's there.

virsh list --all
 Id Name                 State
----------------------------------
  0 Domain-0             running
  - vm0001_c5_ws1        shut off

"Undefine" it, which deletes it from Xen.

virsh undefine vm0001_c5_ws1
Domain vm0001_c5_ws1 has been undefined

Confirm that it is gone.

virsh list --all
 Id Name                 State
----------------------------------
  0 Domain-0             running

Now you can try again.

Provisioning vm0002_win1; A Windows Server

We're going to provision a Microsoft Windows 2008 server this time. This will largely be the same process as with vm0001_c5_ws1. The main differences is that we'll be installing from an ISO file which was copied into /xen_shared/iso/Win_Server_2008_Bis_x86_64.iso.

Microsoft Windows is commercial software. You will need a proper license to use it in production, but you can download a trial version, which will be sufficient to follow along with this tutorial.

We won't go over all the details again, but we will show all the specific commands.

First, prov

lvcreate -l 100%FREE -n vm0002_1 --addtag @an-cluster /dev/drbd_an5_vg0
  Logical volume "vm0002_1" created

Now we need to craft the provision script. Some key differences are that we're going to create a "hardware virtualized machine", known as hvm, which requires support in the CPU. We'll also directly boot off of a DVD ISO, as if we had put a DVD in a drive and booted from it on a real server. We also need to change the --os-type and --os-variant values to windows as well.

virt-install --connect xen \
        --name vm0002_win1 \
        --ram 1048 \
        --arch x86_64 \
        --vcpus 1 \
        --cpuset 1 \
        --cdrom /xen_shared/iso/Win_Server_2008_Bis_x86_64.iso \
        --os-type windows \
        --os-variant win2k8 \
        --disk path=/dev/drbd_an5_vg0/vm0002_1 \
        --network bridge=xenbr0 \
        --vnc \
        --hvm
Starting the install of Windows 2008 R2 as a virtual machine

I like to close the default VNC session and flip over to virt-manager. This is what you should see if you do the same.

Monitoring the install of Windows 2008 R2 via virt-manager.

As before, let the install finish before proceeding. Once the install is completed and you've booted for the first time, dump the configuration to an XML file, define it on an-node04 and update the backups on either node's /root/ directory.

Dump the XML definition.

virsh dumpxml vm0002_win1 > /xen_shared/definitions/vm0002_win1.xml
ls -lah /xen_shared/definitions/vm0002_win1.xml
-rw-r--r-- 1 root root 1.5K Apr 17 01:46 /xen_shared/definitions/vm0002_win1.xml

Define the VM on an-node04.

virsh define /xen_shared/definitions/vm0002_win1.xml
Domain vm0002_win1 defined from /xen_shared/definitions/vm0002_win1.xml

Backup the new VM definition on each node.

rsync -av /xen_shared/definitions ~/
building file list ... done
definitions/
definitions/vm0002_win1.xml

sent 1651 bytes  received 48 bytes  3398.00 bytes/sec
total size is 2667  speedup is 1.57
Seeing the Windows 2008 R2 on both nodes via virt-manager.

Now we see both VMs defined on both nodes!

Making Our VMs Highly Available Cluster Services

We're ready to start the final step; Making our VMs highly available via cluster management! This involves two major steps:

  • Creating two new, ordered failover Domains; One with each node as the highest priority.
  • Adding our VMs as services, one is each new failover domain.

Creating the Ordered Failover Domains

The idea here is that each new failover domain will have one node with a higher priority than the other. That is, one will have an-node04 with the highest priority and the other will have an-node05 as the highest. This way, VMs that we want to normally run on a given node will be added to the matching failover domain.

To add the two new failover domains, we'll add the following to /etc/cluster/cluster.conf:

              <failoverdomains>
                      ...
                      <failoverdomaon name="an4_primary" nofailback="0" ordered="1" restricted="1">
                                <failoverdomainnode name="an-node04.alteeve.com" priority="1">
                                <failoverdomainnode name="an-node05.alteeve.com" priority="2">
                        </failoverdomain>
                        <failoverdomaon name="an5_primary" nofailback="0" ordered="1" restricted="1">
                                <failoverdomainnode name="an-node04.alteeve.com" priority="2">
                                <failoverdomainnode name="an-node05.alteeve.com" priority="1">
                        </failoverdomain>
              </failoverdomains>

As always, validate it. We'll see here what the complete file now looks like.

xmllint --relaxng /usr/share/system-config-cluster/misc/cluster.ng /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="an-cluster" config_version="14">
        <cman expected_votes="1" two_node="1"/>
        <clusternodes>
                <clusternode name="an-node04.alteeve.com" nodeid="1">
                        <fence>
                                <method name="node_assassin">
                                        <device name="batou" port="01" action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="an-node05.alteeve.com" nodeid="2">
                        <fence>
                                <method name="node_assassin">
                                        <device name="batou" port="02" action="reboot"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice name="batou" agent="fence_na" ipaddr="batou.alteeve.com" login="admin" passwd="secret" quiet="1"/>
        </fencedevices>
        <fence_daemon post_join_delay="60"/>
        <totem rrp_mode="none" secauth="off"/>
        <rm>
                <resources>
                        <script file="/etc/init.d/drbd" name="drbd"/>
                        <script file="/etc/init.d/clvmd" name="clvmd"/>
                        <script file="/etc/init.d/gfs2" name="gfs2"/>
                </resources>
                <failoverdomains>
                        <failoverdomain name="an4_only" nofailback="0" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node04.alteeve.com" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="an5_only" nofailback="0" ordered="0" restricted="1">
                                <failoverdomainnode name="an-node05.alteeve.com" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="an4_primary" nofailback="0" ordered="1" restricted="1">
                                <failoverdomainnode name="an-node04.alteeve.com" priority="1" />
                                <failoverdomainnode name="an-node05.alteeve.com" priority="2" />
                        </failoverdomain>
                        <failoverdomain name="an5_primary" nofailback="0" ordered="1" restricted="1">
                                <failoverdomainnode name="an-node04.alteeve.com" priority="2" />
                                <failoverdomainnode name="an-node05.alteeve.com" priority="1" />
                        </failoverdomain>
                </failoverdomains>
                <service autostart="0" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
                <service autostart="0" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart" restart_expire_time="0">
                        <script ref="drbd">
                                <script ref="clvmd">
                                        <script ref="gfs2"/>
                                </script>
                        </script>
                </service>
        </rm>
</cluster>
/etc/cluster/cluster.conf validates

With it validating, push it to the other node.

ccs_tool update /etc/cluster/cluster.conf
Config file updated from version 13 to 14

Update complete.

Adding The VMs To rgmanager

This is where we tell rgmanager which VMs we want to run on which nodes when both are online.

Note: There is a bit of a trick when using rgmanager with our cluster. There is no real way to delay the start of virtual machines until after the storage services are online. The side effect of this is that, if the VMs are set to automatically start with rgmanager, the VMs will fail because their underlying storage takes too long to come online. For this reason, we will not configure them to start automatically.

Creating the vm:<domu> Resources

Virtual machine services are a special-case in rgmanager, and have their own <vm .../> tag. Here are the two we will be adding for the two VMs we created in the previous section.

Warning: Make sure that the VMs are shut down before adding them to the cluster! Otherwise rgmanager will restart them when you first enable the new <vm /> resources.
        <rm>
                ...
                <vm name="vm0001_c5_ws1" domain="an4_primary" path="/xen_shared/definitions/"
                 autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
                <vm name="vm0002_win1" domain="an5_primary" path="/xen_shared/definitions/"
                 autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
        </rm>

The attributes are:

  • name; This is the name of the VM and must match the name of the VM shown by virsh as well as the definition file name, minus the .xml suffix.
  • domain; This is the name of the failover domain that this VM will operate within. You'll notice that the failover domain name and the LVM VGs used correspond. This is, of course, by preference and not a requirement.
  • path; This is the full path to where the VM definition files are kept. It is not the full path to the actual definition file itself!
  • autostart; As mentioned above, we do not want the VMs to automatically start with the rgmanager, so we set this to 0.
  • exclusive; When set, this will prevent any other service from running on the node. This would take out the storage services, so this must be set to 0.
  • recovery; This is how the VM should be recovered after it crashes. The options are restart, relocate and disable.
  • max_restarts; This is how many times the VM is allowed to be restarted (from a crash) before the VM is migrated to another node in the failover domain. The idea here is that, normally, we simply want to restart the VM in-place if the VM itself crashed and the underlying node is healthy. However, once the VM restarts this number a times, we assume that there is actually a problem with the VM running on the current node, so we want to give up and move the VM to another node. We will use 2 restarts before switching to a migration.
  • restart_expire_time; Whenever a VM is restarted, a counter is incremented, which is compared against max_restarts. After this many seconds, that restart is "forgotten" and the restart counter is reduced by one. With our value of 600 seconds (10 minutes) and a max_restarts of 2, the VM will be relocated instead of restarted after the third crash in ten minutes.

Again, validate it. We'll see here what the complete file now looks like.

xmllint --relaxng /usr/share/system-config-cluster/misc/cluster.ng /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="15" name="an-cluster">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node04.alteeve.com" nodeid="1">
			<fence>
				<method name="node_assassin">
					<device action="reboot" name="batou" port="01"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-node05.alteeve.com" nodeid="2">
			<fence>
				<method name="node_assassin">
					<device action="reboot" name="batou" port="02"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice agent="fence_na" ipaddr="batou.alteeve.com" login="admin" name="batou" passwd="secret" quiet="1"/>
	</fencedevices>
	<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="60"/>
	<totem rrp_mode="none" secauth="off"/>
	<rm>
		<resources>
			<script file="/etc/init.d/drbd" name="drbd"/>
			<script file="/etc/init.d/clvmd" name="clvmd"/>
			<script file="/etc/init.d/gfs2" name="gfs2"/>
		</resources>
		<failoverdomains>
			<failoverdomain name="an4_only" nofailback="0" ordered="0" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.com" priority="1"/>
			</failoverdomain>
			<failoverdomain name="an5_only" nofailback="0" ordered="0" restricted="1">
				<failoverdomainnode name="an-node05.alteeve.com" priority="1"/>
			</failoverdomain>
			<failoverdomain name="an4_primary" nofailback="0" ordered="1" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.com" priority="1"/>
				<failoverdomainnode name="an-node05.alteeve.com" priority="2"/>
			</failoverdomain>
			<failoverdomain name="an5_primary" nofailback="0" ordered="1" restricted="1">
				<failoverdomainnode name="an-node04.alteeve.com" priority="2"/>
				<failoverdomainnode name="an-node05.alteeve.com" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<service autostart="0" domain="an4_only" exclusive="0" max_restarts="0" name="an4_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
		<service autostart="0" domain="an5_only" exclusive="0" max_restarts="0" name="an5_storage" recovery="restart" restart_expire_time="0">
			<script ref="drbd">
				<script ref="clvmd">
					<script ref="gfs2"/>
				</script>
			</script>
		</service>
		<vm name="vm0001_c5_ws1" domain="an4_primary" path="/xen_shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
		<vm name="vm0002_win1" domain="an5_primary" path="/xen_shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
	</rm>
</cluster>
/etc/cluster/cluster.conf validates

Now push the updated configuration out.

ccs_tool update /etc/cluster/cluster.conf
Config file updated from version 14 to 15

Update complete.

Using the new VM Resources

Note: We'll be running all of the commands in this section on an-node04.

If you now run clustat on either node, you should see the new VM resources.

clustat
Cluster Status for an-cluster @ Mon Apr 18 17:10:37 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an4_storage            an-node04.alteeve.com          started       
 service:an5_storage            an-node05.alteeve.com          started       
 vm:vm0001_c5_ws1               (none)                         disabled      
 vm:vm0002_win1                 (none)                         disabled

Now we can start the VMs using rgmanager!

Note: As we'll be starting a non-standard, vm service, we need to type out the full service name, vm:domu.
clusvcadm -e vm:vm0001_c5_ws1
Local machine trying to enable vm:vm0001_c5_ws1...Success
vm:vm0001_c5_ws1 is now running on an-node04.alteeve.com

If you check with virsh, you'll see it running now.

virsh list --all
 Id Name                 State
----------------------------------
  0 Domain-0             running
  3 vm0001_c5_ws1        idle
  - vm0002_win1          shut off

Likewise, if you check clustat, you will see the new VM service running on an-node04.

clustat
Cluster Status for an-cluster @ Mon Apr 18 22:05:33 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an4_storage            an-node04.alteeve.com          started       
 service:an5_storage            an-node05.alteeve.com          started       
 vm:vm0001_c5_ws1               an-node04.alteeve.com          started       
 vm:vm0002_win1                 (none)                         disabled

So far, so good. Now let's start the vm0002_win1 VM.

clusvcadm -e vm:vm0002_win1
Local machine trying to enable vm:vm0002_win1...Success
vm:vm0002_win1 is now running on an-node04.alteeve.com

It started, but it didn't start on the node we normally want!

clustat
Cluster Status for an-cluster @ Mon Apr 18 22:33:25 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an4_storage            an-node04.alteeve.com          started       
 service:an5_storage            an-node05.alteeve.com          started       
 vm:vm0001_c5_ws1               an-node04.alteeve.com          started       
 vm:vm0002_win1                 an-node04.alteeve.com          started

The vm0002_win1 VM started on the node that the command was executed from. We could have added -m an-node05.alteeve.com to the clusvcadm, which we'll do later. It's already running though, so lets use this "mistake" as a chance to look at migrating the VM using clusvcadm.

So to tell rgmanager to perform a live migration from an-node04 to an-node05, use the special -M switch along with the normal -m migration switch. For more information on these switches, please take a few minutes to read man clusvcadm.

clusvcadm -M vm:vm0002_win1 -m an-node05.alteeve.com
Trying to migrate vm:vm0002_win1 to an-node05.alteeve.com...Success

Now we can use clustat to see that vm:vm0002_win1 service is now running on the proper an-node05.alteeve.com node.

clustat
Cluster Status for an-cluster @ Tue Apr 19 01:02:55 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, rgmanager
 an-node05.alteeve.com                       2 Online, Local, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an4_storage            an-node04.alteeve.com          started       
 service:an5_storage            an-node05.alteeve.com          started       
 vm:vm0001_c5_ws1               an-node04.alteeve.com          started       
 vm:vm0002_win1                 an-node05.alteeve.com          started

Before starting the migration, I logged into the vm0002_win1 machine and started a continuous ping against the router. As you can see below, one packet was stalled by roughly 15ms, and no packets were lost.

File:Live migration vm0002 win7 to an-node05 ping.png
Continuous ping from within the live-migrated VM against the router.

Congratulations, Your Cluster Is Complete! Now, Break It!

You may have noticed that the two storage resources are still not set to automatically start with rgmanager. This is on purpose, as we now need to work through all of the possible failure modes. Until we've done so, out cluster is not production ready!

It's true, at this point the cluster is technically finished. As we'll soon see, we can kill a node and it's lost VMs will recover on the surviving node. However, that is only a part of this exercise.

Remember back at the beginning how we talked about the inherent complexity of clusters? We need to now break our cluster at every point within that complexity that we can. We need to see how things go wrong so that we can learn how to resolve the problems that will arise now, while we have the luxury of time and a cluster with no real data on it.

Once you go in to production, it is too late to learn.

Backup a Second; Let's See How It's Supposed to Work

Before we grab a hammer, let's go over how a clean stop and start should work.

Gracefully Shutting Down the Cluster

If you've followed through this tutorial in order, you probably already have everything running, so let's start by talking about how to shut down the cluster properly.

The stop order is:

  • Lock rgmanager services that can migrate; The vm services in our case.
  • Disable all rgmanager services.
  • Stop the rgmanager daemon.
  • Stop the cman daemon.

Stopping the virtual machines is no longer a simple task. If you try to power down the VM from within the OS, the cluster will "recover" it as soon as it shuts off. Likewise if you try to stop it using virsh shutdown domU. You can stop a VM by simple disabling it via rgmanager, but that is not enough when preparing for a complete shutdown of the cluster as the VM could be restarted on another node in some cases.

To ensure that the VM stays off, we'll "lock" the service. This will prevent all actions except for disabling. Once quorum is lost though, this lock is lost, so you don't need to worry about unlocking it later when you restart the cluster.

So let's take a look at the running resources.

clustat
Cluster Status for an-cluster @ Wed Apr 20 00:31:53 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an4_storage            an-node04.alteeve.com          started       
 service:an5_storage            an-node05.alteeve.com          started       
 vm:vm0001_c5_ws1               an-node04.alteeve.com          started       
 vm:vm0002_win1                 an-node05.alteeve.com          started

We don't need to worry about the two storage services as they're in failover domains that, well, don't fail over anyway. Thus, we'll lock the two VMs. Note that it doesn't matter where the lock is issued.

clusvcadm -l vm:vm0001_c5_ws1
Resource groups locked
clusvcadm -l vm:vm0002_win1
Resource groups locked

I don't know of a way to see if a service has been locked as clustat will show no change. However, you can unlock a service if you decided not to shutdown the cluster by replacing the -l switch with -u in the calls above.

Locking the two VM services prior to cluster shutdown.

Now you can disable the two VM services safely. Note that the disable call will not return until the VM has shut down, so be patient.

clusvcadm -d vm:vm0001_c5_ws1
Local machine disabling vm:vm0001_c5_ws1...Success
clusvcadm -d vm:vm0002_win1
Local machine disabling vm:vm0002_win1...Success
Disabling the two VM services prior to cluster shutdown.

You may notice in the screenshot above that the both VMs were disabled from an-node04, despite vm0002_win1 running on an-node05 even without the -m <node> option.

Check to confirm that the VMs are off now.

clustat
Cluster Status for an-cluster @ Wed Apr 20 00:49:08 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an4_storage            an-node04.alteeve.com          started       
 service:an5_storage            an-node05.alteeve.com          started       
 vm:vm0001_c5_ws1               (an-node04.alteeve.com)        disabled      
 vm:vm0002_win1                 (an-node05.alteeve.com)        disabled

Now that the VMs are down, we can stop rgmanager on both nodes. This will stop the storage services on each node in the process, and we don't need to worry about them being restarted as they can't fail to another node and rgmanager will be gone before they could restart. It's a bit lazy, but it's safe.

/etc/init.d/rgmanager stop
Shutting down Cluster Service Manager...
Waiting for services to stop:                              [  OK  ]
Cluster Service Manager is stopped.
Stopping the rgmanager daemon on both nodes.

Notice in the screenshot above that we can see the storage service halting after rgmanager is told to stop.

We can confirm that storage is stopped simply by checking the status of drbd. If anything went wrong, one or more of the DRBD resources would have been held open and prevented from stopping. If the drbd module is unloaded, we know that the shutdown was successful.

Check this from both nodes.

/etc/init.d/drbd status
drbd not loaded
Verifying that rgmanager and the storage resources stopped completely.

Now, all that is left is to stop cman!

/etc/init.d/cman stop
Stopping cluster: 
   Stopping fencing... done
   Stopping cman... done
   Stopping ccsd... done
   Unmounting configfs... done
                                                           [  OK  ]
Stopping cman to complete the cluster shut down.

That's it, you can down safely shut down the nodes!

Cold Starting the Cluster

Starting the cluster from scratch is a little different from starting and joining a node to en existing cluster, as we will see later. There are two main reasons:

  • If a node doesn't hear back from the other node when corosync starts, it must assume that the other node has crashed and that it needs to be fenced. Remember the post_join_delay? That is the maximum amount of time that a node will wait on start before it fires off a fence. Thus, we must start cman on both nodes within post_join_delay seconds of one another.
  • DRBD will not start until both nodes can talk to each other. If you start the storage service on either node, drbd will hang forever waiting for the other node to show up.

Once both nodes are up, you can shut one node back down and safely run of just the one node. This is because the surviving node will see the other node withdraw, and thus will confidently know that it is not going to access the clustered resources.

With this in mind, the cold-start order is:

  1. Start the cman daemon on both nodes within post_join_delay seconds.
  2. Start the rgmanager daemon on both nodes. At this point, the storage services are not set to start with the system, so there are no timing concerns yet.
  3. Enable the storage services on both nodes. We did not enable the DRBD timeout, so we don't have timing concerns here, but be aware that the enable command on the first node will not complete or return until the storage service has been started on the second node. For this reason, you'll want to have two terminals open; One connected to either node.
  4. Verify that the storage services are all online.
  5. Start the virtual machine resources in the order that best suits you.

So, start cman:

/etc/init.d/cman start
Starting cluster: 
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
                                                           [  OK  ]
Starting cman on both nodes at the same time.

Now we'll start rgmanager on both nodes.

/etc/init.d/rgmanager start
Starting Cluster Service Manager:                          [  OK  ]

I like to make a habit of running clustat right after starting, just to ensure that services are or are not running, as I'd expect.

clustat
Cluster Status for an-cluster @ Tue Apr 19 23:54:02 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an4_storage            (none)                         disabled      
 service:an5_storage            (none)                         disabled      
 vm:vm0001_c5_ws1               (none)                         disabled      
 vm:vm0002_win1                 (none)                         disabled
Starting rgmanager on both nodes and checking service states with clustat.

The astute observer will notice two things in the screen shot above.

First, I ran clustat on an-node04 after rgmanager has started, but no services were listed. This is not a problem, it just takes a minute for the service states to become known to rgmanager.

Second, the log files are complaining that they could not find the VM definition files in the search path. Remember back in the rgmanager section how we talked about the delay in getting the clustered storage online? This is the problem. The definitions are on the GFS2 partition which isn't available. Even if we started the storage resources with rgmanager, which we will do later, it's still not fast enough to prevent rgmanager from failing to find the definition files and giving up. This is why we'll need to always start the virtual machines manually.

As an aside, this isn't a problem with pacemaker, as we'll see in the EL6 tutorial later.

So back to it then; Let's start the clustered storage services. As an experiment, start the an4_storage service and then wait some time before starting the an5_storage. You'll see that the first service will pause indefinitely, as we discussed.

clusvcadm -e an4_storage
Local machine trying to enable service:an4_storage...
Starting an4_storage on just an-node04.

Once you start the an5_storage service, both will complete and return to the command line. Once done, I like to run a status check of drbd, clvmd and gfs2 to ensure that things are as I expect them.

Starting an5_storage on an-node05 and then performing the status checks.

Everything is in place, so now we can start the virtual machines. Given that VMs can run on either node, it's a good habit to explicitly

Start the web server:

clusvcadm -e vm:vm0001_c5_ws1 -m an-node04.alteeve.com
Member an-node04.alteeve.com trying to enable vm:vm0001_c5_ws1...Success
vm:vm0001_c5_ws1 is now running on an-node04.alteeve.com

Start the windows server:

clusvcadm -e vm:vm0002_win1 -m an-node05.alteeve.com
Member an-node05.alteeve.com trying to enable vm:vm0002_win1...Success
vm:vm0002_win1 is now running on an-node05.alteeve.com

With the -m <node> switch, we can enable the VM service from any node in the cluster, and it will start on the node we want.

Starting both VMs and the checking their status with clustat.

There we have it! The cluster is up and running from a complete cold start.

Testing Migration

We've already looked at live migration of VMs before they were added to the cluster, but we've not yet looked at live migrations within the cluster.

Our tests will cover:

  • A controlled migration, as will be done before planned maintenance on a node or after restoring a node and migrating VMs back to it.
  • Crashing a VM directly, and making sure that rgmanager detects the crash and restarts the VM.
  • Crashing a VM enough times and within enough time to trigger a relocation to the second node.
  • Crashing the host node and checking that lost VMs restart on the surviving node.

Controlled Live Migration Using clusvcadm

There will be times when you will want to migrate a VM off of a node. The classic example would be to upgrade the hardware, install a new kernel or repair a RAID array. When you know that ahead of time that a node will go down, you can easily migrate the VM services off of it to another node in the cluster.

Let's look at migrating the vm0001_c5_ws1 VM from an-node04 to an-node05. First, confirm that it is on the source node.

clustat
Cluster Status for an-cluster @ Wed Apr 20 12:34:19 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an4_storage            an-node04.alteeve.com          started       
 service:an5_storage            an-node05.alteeve.com          started       
 vm:vm0001_c5_ws1               an-node04.alteeve.com          started       
 vm:vm0002_win1                 (none)                         disabled

Now perform the actual migration. Note that we will be using the special -M (live migrate) switch, rather than the usual -r (relocate) switch.

clusvcadm -M vm:vm0001_c5_ws1 -m an-node05.alteeve.com
Trying to migrate vm:vm0001_c5_ws1 to an-node05.alteeve.com...Success

If you then run clustat again, you will see the VM now running on the target node.

clustat
Cluster Status for an-cluster @ Wed Apr 20 12:38:11 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an4_storage            an-node04.alteeve.com          started       
 service:an5_storage            an-node05.alteeve.com          started       
 vm:vm0001_c5_ws1               an-node05.alteeve.com          started       
 vm:vm0002_win1                 (none)                         disabled
Live migrating vm0001_c5_ws1 from an-node04 to an-node05.

That was easy!

Crashing the VM Itself

There are many ways to crash a VM, and you can and should try crashing it all the ways that you can think of. In Linux machines, we can trigger a crash by echo'ing c to the special /proc/sysrq-trigger file. This will instantly crash the server and you will not see the command return.

Let's do this to the vm0001_c5_ws1 VM. Connect to the virtual machine, either directly to it's console by running xm console vm0001_c5_ws1 from the host, or by ssh'ing into the machine. Once logged in, run:

echo c > /proc/sysrq-trigger

Within moments, you will see the xen vifX.Y interfaces disable and then a new vifZ.Y get created as the VM is restarted. If you are fast enough, you may see clustat report the VM as disabled, though it starts up very quickly so it may be hard to catch.

Killing a VM internally and watching it restart.

Crashing the VM Enough Times to Trigger a Relocation

Note: This doesn't seem to be working at the moment. Filed a Red Hat bugzilla ticket.

VM always restarts on the node it was last running on.

Crashing the Host Node

Note: While writing this section, I found a bug in Node Assassin where the simultaneous connections to the Node Assassin from cman and drbd would cause both to fail. This is a great example of why failure testing prior to going in to production is so important! Even when you think you've got it all sorted out, you will still find issues that, on hind sight should have been obvious, where missed during your cluster design.

As we did when we crashed the virtual machine, we will crash the operating system on the node currently running one of the VMs. In the case, we have vm0001_c5_ws1 running on an-node04 and vm0002_win1 running on an-node05.

clustat
Cluster Status for an-cluster @ Wed Apr 20 15:03:30 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-node04.alteeve.com                       1 Online, Local, rgmanager
 an-node05.alteeve.com                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:an4_storage            an-node04.alteeve.com          started       
 service:an5_storage            an-node05.alteeve.com          started       
 vm:vm0001_c5_ws1               an-node05.alteeve.com          started       
 vm:vm0002_win1                 an-node04.alteeve.com          started

Once we crash an-node04, watch the log file in an-node05. You will see the crashed an-node04 machine get fenced by both D

echo c > /proc/sysrq-trigger

Testing Done - Going Into Production

Troubleshooting

ToDo

Common Administrative Tasks

This is far from a comprehensive list!

This section will attempt to cover some of the day to day tasks you may want to perform on you cluster of VMs.

Enabling MTU Sizes Over 1500 Bytes

Warning: This requires the use of a kernel compiled outside of the main repos. For this reason, do not apply this unless you have a particular need for jumbo frames and are willing to take on the additional risk of installing and running an unsupported kernel.

ToDo: Cover actual examples of setting MTU sizes, sending large-packets pings to test and blocking yum from updating the kernel.

Currently, enabling MTU sizes over 1500 bytes requires compiling a new kernel and replacing/patching two Xen scripts. I've made a pre-compiled kernel and the patched scripts available on https://alteeve.com/xen. Red Hat bugzilla bugs have been filed but as of the time of writing, they've not been applied. You can track the bug progress below:

Below is a pretty ugly bash chain of commands that will download, install and copy into place everything needed to make jumbo frames work.

cd /etc/xen/ && \
	mv qemu-ifup qemu-ifup.orig && \
	wget https://alteeve.com/xen/qemu-ifup && \
	cd scripts/ && \
	mv xen-network-common.sh xen-network-common.sh.orig && \
	wget https://alteeve.com/xen/xen-network-common.sh && \
	cd ~ && \
	wget https://alteeve.com/xen/RPMS/x86_64/kernel-xen-2.6.18-238.9.3.el5.x86_64.rpm && \
	wget https://alteeve.com/xen/RPMS/x86_64/kernel-headers-2.6.18-238.9.3.el5.x86_64.rpm && \
	wget https://alteeve.com/xen/RPMS/x86_64/kernel-devel-2.6.18-238.9.3.el5.x86_64.rpm && \
	wget https://alteeve.com/xen/RPMS/x86_64/kernel-2.6.18-238.9.3.el5.x86_64.rpm && \
	rpm -ivh ~/kernel-*
Warning: Choosing a jumbo frame size larger that what is supported by your network interfaces and switches will cause networking to fail when the first large packet is sent. Consult your hardware documentation before setting an MTU size and remember to use the lowest size supported by all of your equipment. Note that some manufacturers will claim jumbo frame support when the actually only support ~4000 bytes.

Once this is done, you will need to reboot the use the new kernel. Before you do though, edit your /etc/sysconfig/network-scripts/ifcfg-eth* files and add MTU=xxxx, where xxxx is the frame size you want.

Once set, you can reboot.

Renaming a Virtual Machine

There may be times when you want to rename a VM domain. For example, if you provision a machine and then realize that you gave it a name that didn't describe it properly.

Things to keep in mind before starting;

  • The new name of the VM must match the name of the definition file as well as the name of the VM service in cluster.conf
  • The VM will need to be shut down for the renaming process to succeed.

At this time, the only way to rename a VM is:

  1. Use virsh dumpxml old_name > /xen_shared/definitions/new_name.xml.
  2. Shut down the VM with virsh shutdown old_name.
  3. Edit the /xen_shared/definitions/new_name.xml XML definition file and change <name>old_name</name> to <name>new_name</name>.
  4. Undefine the VM using virsh undefine old_name on all nodes.
  5. Redefine the VM using virsh define /xen_shared/definitions/new_name.xml on all nodes.
  6. Update the cluster service name.
    1. Edit /etc/cluster/cluster.conf and change <vm name="old_name" ... /> to <vm name="new_name" ... />
    2. Increment the <cluster ... config_version="x"> attribute.
    3. Push the new cluster configuration using ccs_tool update /etc/cluster/cluster.conf.
  7. Confirm that the new name is seen by both virsh list --all and clustat.
  8. Start the VM back up.

Adding Space to a VM

Here we will see what it takes to add a new 50 GiB LV to a VM as a second virtual hard drive.

This process requires a few steps.

  • Setting the /dev/drbd3 resource as a new LVM PV.
  • Create a new VG called drbd_an4_vg1.
  • Carve out a 50 GB LV called vm0001_xvdb.
  • Attach it to the vm0001_c5_ws1.
  • Dumping the VM's updated configuration to /xen_shared/definitions/vm0001_c5_ws1.xml.
  • Redefining the VM on an-node05 (assuming that it is currently running on an-node04).
  • Logging into the vm0001_c5_ws1 VM, formatting the new space and adding the partition to /etc/fstab.
Note: It is assumed that vm0001_c5_ws1 is currently running on an-node04. Unless stated otherwise, all the following commands should, thus, be run from an-node04.

Creating a new PV, VG and LV

Create the new PV:

pvcreate /dev/drbd3
  Physical volume "/dev/drbd3" successfully created

Create the new VG:

vgcreate -c y --addtag @an-cluster drbd_an4_vg1 /dev/drbd3
  Clustered volume group "drbd_an4_vg1" successfully created

Create the new LV:

lvcreate -L 50G --addtag @an-cluster -n vm0001_xvdb /dev/drbd_an4_vg1
  Logical volume "vm0001_xvdb" created

Attaching the new LV to the VM

Attach the new LV to the vm0001_c5_ws1 VM. This is done using the virsh attach-disk. We'll tell virsh to attach the new LV and to create it as /dev/xvdb within the VM.

virsh attach-disk vm0001_c5_ws1 /dev/drbd_an4_vg1/vm0001_xvdb xvdb
Disk attached successfully
Note: Log in to the vm0001_c5_ws1 VM and run the following commands there. Note that, in this tutorial, the VM's hostname has been changed to vm0001_c5_ws1 and has been statically assigned to 192.168.1.253.
ssh root@192.168.1.253
root@192.168.1.253's password: 
Last login: Sun Apr  3 18:18:13 2011 from 192.168.1.102
[root@vm0001_c5_ws1 ~]#

Confirm that the new /dev/xvdb device now exists.

fdisk -l
Disk /dev/xvda: 10.7 GB, 10737418240 bytes
255 heads, 63 sectors/track, 1305 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

    Device Boot      Start         End      Blocks   Id  System
/dev/xvda1   *           1          33      265041   83  Linux
/dev/xvda2              34         164     1052257+  82  Linux swap / Solaris
/dev/xvda3             165        1305     9165082+  83  Linux

Disk /dev/xvdb: 53.6 GB, 53687091200 bytes
255 heads, 63 sectors/track, 6527 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Configuring the new Virtual Hard Drive in the VM

From here on in, we'll be proceeding exactly the same as if we had added a real hard drive to a bare-iron server.

Create a single partition out of the new space.

fdisk /dev/xvdb
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.


The number of cylinders for this disk is set to 6527.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-6527, default 1): 
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-6527, default 6527): 
Using default value 6527

Command (m for help): p

Disk /dev/xvdb: 53.6 GB, 53687091200 bytes
255 heads, 63 sectors/track, 6527 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

    Device Boot      Start         End      Blocks   Id  System
/dev/xvdb1               1        6527    52428096   83  Linux

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
Note: Unlike when we worked on the nodes, we do not need to reboot the VM to see the changes on the disk. This is not because it's a virtual server, but rather because the new virtual disk is not used by the OS.

Now, format the new /dev/xvdb1 partition.

mkfs.ext3 /dev/xvdb1
mke2fs 1.39 (29-May-2006)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
6553600 inodes, 13107024 blocks
655351 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
400 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424

Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 32 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

Check that /var/www does not yet exist. If it doesn't, create it.

ls -lah /var/www
ls: /var/www: No such file or directory
mkdir /var/www
ls -lah /var/www
total 12K
drwxr-xr-x  2 root root 4.0K Apr  3 23:01 .
drwxr-xr-x 21 root root 4.0K Apr  3 23:01 ..

Mount the newly formatted partition.

mount /dev/xvdb1 /var/www/
df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvda3            8.5G  2.0G  6.1G  25% /
/dev/xvda1            251M   25M  214M  11% /boot
tmpfs                 524M     0  524M   0% /dev/shm
/dev/xvdb1             50G  180M   47G   1% /var/www

Add the new partition to /etc/fstab so that the partition mounts on boot.

echo "/dev/xvdb1              /var/www                ext3    defaults        1 3" >> /etc/fstab
cat /etc/fstab
LABEL=/                 /                       ext3    defaults        1 1
LABEL=/boot             /boot                   ext3    defaults        1 2
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
LABEL=SWAP-xvda2        swap                    swap    defaults        0 0
/dev/xvdb1              /var/www                ext3    defaults        1 3
Note: Now go back to an-node04.

Updating the VM Definition on Both Nodes

Now we want to ensure that the new VM configuration with the second hard drive persists over time. To do this, we will re-export the vm0001_c5_ws1 configuration file to /xen_shared/definitions/vm0001_c5_ws1.xml and redefine the VM on both nodes using this new configuration.

cp /xen_shared/definitions/vm0001_c5_ws1.xml /xen_shared/definitions/vm0001_c5_ws1.xml.old
virsh dumpxml vm0001_c5_ws1 > /xen_shared/definitions/vm0001_c5_ws1.xml
diff /xen_shared/definitions/vm0001_c5_ws1.xml /xen_shared/definitions/vm0001_c5_ws1.xml.old
1c1
< <domain type='xen' id='1'>
---
> <domain type='xen'>
9,12c9
<     <type>linux</type>
<     <kernel>/var/lib/xen/boot_kernel.zTzRQT</kernel>
<     <initrd>/var/lib/xen/boot_ramdisk.EbWDbG</initrd>
<     <cmdline>ro root=LABEL=/ crashkernel=auto</cmdline>
---
>     <type arch='x86_64' machine='xenpv'>linux</type>
24,28d20
<     <disk type='block' device='disk'>
<       <driver name='phy'/>
<       <source dev='/dev/drbd_an4_vg1/vm0001_xvdb'/>
<       <target dev='xvdb' bus='xen'/>
<     </disk>
33d24
<       <target dev='vif1.0'/>
39d29
<       <target dev='vif1.1'/>
41,42c31
<     <console type='pty' tty='/dev/pts/2'>
<       <source path='/dev/pts/2'/>
---
>     <console type='pty'>
46c35
<     <graphics type='vnc' port='5900' autoport='yes' keymap='en-us'/>
---
>     <graphics type='vnc' port='-1' autoport='yes' keymap='en-us'/>

There are various changes, but the only one that we really are interested in is the new <source dev='/dev/drbd_an4_vg1/vm0001_xvdb'/> section.

Now, the last step is to update an-node05's definition for vm0001_c5_ws1.

On both nodes:

virsh define /xen_shared/definitions/vm0001_c5_ws1.xml
Domain vm0001_c5_ws1 defined from /xen_shared/definitions/vm0001_c5_ws1.xml

Remember to back up the new XML definition on each node's /root/ directory.

rsync -av /xen_shared/definitions ~/
...

That's it! We just effectively installed a new hard drive into our VM.


 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.