Red Hat Cluster Service 2 Tutorial - Archive

From Alteeve Wiki
Revision as of 19:10, 20 March 2011 by Digimer (talk | contribs) (→‎Notes)
Jump to navigation Jump to search

 AN!Wiki :: How To :: Red Hat Cluster Service 2 Tutorial - Archive

Overview

This paper has one goal;

  • Creating a 2-node, high-availability cluster hosting Xen virtual machines using RHCS "stable 2" using DRBD for synchronized storage.

Technologies We Will Use

  • Enterprise Linux 5; specifically we will be using CentOS v5.5.
  • Red Hat Cluster Services "Stable" version 2. This describes the following core components:
    • OpenAIS; Provides cluster communications using the totem protocol.
    • Cluster Manager (cman); Manages the starting, stopping and managing of the cluster.
    • Resource Manager (rgmanager); Manages cluster resources and services. Handles service recovery during failures.
    • Cluster Logical Volume Manager (clvm); Cluster-aware (disk) volume manager. Backs GFS2 filesystems and Xen virtual machines.
    • Global File Systems version 2 (gfs2); Cluster-aware, concurrently mountable file system.
  • Distributed Redundant Block Device (DRBD); Keeps shared data synchronized across cluster nodes.
  • Xen; Hypervisor that controls and supports virtual machines.

A Note on Patience

There is nothing inherently hard about clustering. However, there are many components that you need to understand before you can begin. The result is that clustering has an inherently steep learning curve.

You must have patience. Lots of it.

Many technologies can be learned by creating a very simple base and then building on it. The classic "Hello, World!" script created when first learning a programming language is an example of this. Unfortunately, there is no real analog to this in clustering. Even the most basic cluster requires several pieces be in place and working together. If you try to rush by ignoring pieces you think are not important, you will almost certainly waste time. A good example is setting aside fencing, thinking that your test cluster's data isn't important. The cluster software has no concept of "test". It treats everything as critical all the time and will shut down if anything goes wrong.

Take your time, work through these steps, and you will have the foundation cluster sooner than you realize. Clustering is fun because it is a challenge.

Prerequisites

It is assumed that you are familiar with Linux systems administration, specifically Red Hat Enterprise Linux and it's derivatives. You will need to have somewhat advanced networking experience as well. You should be comfortable working in a terminal (directly or over ssh). Familiarity with XML will help, but is not terribly required as it's use here is pretty self-evident.

If you feel a little out of depth at times, don't hesitate to set this tutorial aside. Branch over to the components you feel the need to study more, then return and continue on. Finally, and perhaps most importantly, you must have patience! If you have a manager asking you to "go live" with a cluster in a month, tell him or her that it simply won't happen. If you rush, you will skip important points and you will fail. Patience is vastly more important than any pre-existing skill.

Focus and Goal

There is a different cluster for every problem. Generally speaking though, there are two main problems that clusters try to resolve; Performance and High Availability. Performance clusters are generally tailored to the application requiring the performance increase. There are some general tools for performance clustering, like Red Hat's LVS (Linux Virtual Server) for load-balancing common applications like the Apache web-server.

This tutorial will focus on High Availability clustering, often shortened to simply HA and not to be confused with the HA Linux "heartbeat" cluster suite, which we will not be using here. The cluster will provide a shared file systems and will provide for the high availability on Xen-based virtual servers. The goal will be to have the virtual servers live-migrate during planned node outages and automatically restart on a surviving node when the original host node fails.

A very brief overview;

High Availability clusters like our have two main parts; Cluster management and resource management.

The cluster itself is responsible for maintaining the cluster nodes in a group. This group is part of a "Cluster Process Group", or CPG. When a node fails, the cluster manager must detect the failure, reliably eject the node from the cluster and reform the CPG. Each time the cluster changes, or "re-forms", the resource manager is called. The resource manager checks to see how the cluster changed, consults it's configuration and determines what to do, if anything.

The details of all this will be discussed in detail a little later on. For now, it's sufficient to have in mind these two major roles and understand that they are somewhat independent entities.

Platform

This tutorial was written using CentOS version 5.5, x86_64. No attempt was made to test on i686 or other EL5 derivatives. That said, there is no reason to believe that this tutorial will not apply to any variant. As much as possible, the language will be distro-agnostic. For reasons of memory constraints, it is advised that you use an x86_64 (64-bit) platform if at all possible.

Do note that as of EL5.4 and above, significant changes were made to how RHCS are supported. It is strongly advised that you use at least version 5.4 or newer while working with this tutorial.

A Word On Complexity

Clustering is not inherently hard, but it is inherently complex. Consider;

  • Any given program has N bugs.
    • RHCS uses; cman, openais, totem, fenced, rgmanager, dlm, qdisk and GFS2,
    • We will be adding DRBD, CLVM and Xen.
    • Right there, we have N^11 possible bugs. We'll call this A.
  • A cluster has Y nodes.
    • In our case, 2 nodes, each with 3 networks.
    • The network infrastructure (Switches, routers, etc). If you use managed switches, add another layer of complexity.
    • This gives us another Y^(2*3), and then ^2 again for managed switches. We'll call this B.
  • Let's add the human factor. Let's say that a person needs roughly 5 years of cluster experience to be considered an expert. For each year less than this, add a Z "oops" factor, (5-Z)^2. We'll call this C.
  • So, finally, add up the complexity, using this tutorial's layout, 0-years of experience and managed switches.
    • (N^11) * (Y^(2*3)^2) * ((5-0)^2) == (A * B * C) == an-unknown-but-big-number.

This isn't meant to scare you away, but it is meant to be a sobering statement. Obviously, those numbers are somewhat artificial, but the point remains.

Any one piece is easy to understand, thus, clustering is inherently easy. However, given the large number of variables, you must really understand all the pieces and how they work together. DO NOT think that you will have this mastered and working in a month. Certainly don't try to sell clusters as a service without a lot of internal testing.

Clustering is kind of like chess. The rules are pretty straight forward, but the complexity can take some time to master.

Overview of Components

When looking at a cluster, there is a tendency to want to dive right into the configuration file. That is not very useful in clustering.

  • When you look at the configuration file, it is quite short.

It isn't like most applications or technologies though. Most of us learn by taking something, like a configuration file, and tweaking it this way and that to see what happens. I tried that with clustering and learned only what it was like to bang my head against the wall.

  • Understanding the parts and how they work together is critical.

You will find that the discussion on the components of clustering, and how those components and concepts interact, will be much longer than the initial configuration. It is true that we could talk very briefly about the actual syntax, but it would be a disservice. Please, don't rush through the next section or, worse, skip it and go right to the configuration. You will waste far more time than you will save.

  • Clustering is easy, but it has a complex web of inter-connectivity. You must grasp this network if you want to be an effective cluster administrator!

Component; cman

This was, traditionally, the cluster manager. In the 3.0 series, it acts mainly as a service manager, handling the starting and stopping of clustered services. In the 3.1 series, cman will be removed entirely.

Component; openais / corosync

OpenAIS is the heart of the cluster. All other computers operate though this component, and no cluster component can work without it. Further, it is shared between both Pacemaker and RHCS clusters.

In Red Hat clusters, openais is configured via the central cluster.conf file. In Pacemaker clusters, it is configured directly in openais.conf. As we will be building an RHCS, we will only use cluster.conf. That said, (almost?) all openais.conf options are available in cluster.conf. This is important to note as you will see references to both configuration files when searching the Internet.

A Little History

There were significant changes between RHCS version 2, which we are using, and version 3 available on EL6 and recent Fedoras.

In the RHCS version 2, there was a component called openais which handled totem. The OpenAIS project was designed to be the heart of the cluster and was based around the Service Availability Forum's Application Interface Specification. AIS is an open API designed to provide inter-operable high availability services.

In 2008, it was decided that the AIS specification was overkill for RHCS clustering and a duplication of effort in the existing and easier to maintain corosync project. OpenAIS was then split off as a separate project specifically designed to act as an optional add-on to corosync for users who wanted AIS functionality.

You will see a lot of references to OpenAIS while searching the web for information on clustering. Understanding it's evolution will hopefully help you avoid confusion.

Concept; quorum

Quorum is defined as a collection of machines and devices in a cluster with a clear majority of votes.

The idea behind quorum is that, which ever group of machines has it, can safely start clustered services even when defined members are not accessible.

Take this scenario;

  • You have a cluster of four nodes, each with one vote.
    • The cluster's expected_votes is 4. A clear majority, in this case, is 3 because (4/2)+1, rounded down, is 3.
    • Now imagine that there is a failure in the network equipment and one of the nodes disconnects from the rest of the cluster.
    • You now have two partitions; One partition contains three machines and the other partition has one.
    • The three machines will have quorum, and the other machine will lose quorum.
    • The partition with quorum will reconfigure and continue to provide cluster services.
    • The partition without quorum will withdraw from the cluster and shut down all cluster services.

This behaviour acts as a guarantee that the two partitions will never try to access the same clustered resources, like a shared filesystem, thus guaranteeing the safety of those shared resources.

This also helps explain why an even 50% is not enough to have quorum, a common question for people new to clustering. Using the above scenario, imagine if the split were 2-nodes and 2-nodes. Because either can't be sure what the other would do, neither can safely proceed. If we allowed an even 50% to have quorum, both partition might try to take over the clustered services and disaster would soon follow.

There is one, and only one except to this rule.

In the case of a two node cluster, as we will be building here, any failure results in a 50/50 split. If we enforced quorum in a two-node cluster, there would never be high availability because and failure would cause both nodes to withdraw. The risk with this exception is that we now place the entire safety of the cluster on fencing, a concept we will cover in a second. Fencing is a second line of defense and something we are loath to rely on alone.

Even in a two-node cluster though, proper quorum can be maintained by using a quorum disk, called a qdisk. This is another topic we will touch on in a moment. This tutorial will implement a qdisk specifically so that we can get away from this two_node exception.

Concept; Virtual Synchrony

All cluster operations have to occur in the same order across all nodes. This concept is called "virtual synchrony", and it is provided by openais using "closed process groups", CPG.

Let's look at how locks are handled on clustered file systems as an example.

  • As various nodes want to work on files, they send a lock request to the cluster. When they are done, they send a lock release to the cluster.
    • Lock and unlock messages must arrive in the same order to all nodes, regardless of the real chronological order that they were issued.
  • Let's say one node sends out messages "a1 a2 a3 a4". Meanwhile, the other node sends out "b1 b2 b3 b4".
    • All of these messages go to openais which gathers them up and sends them up and sorts them.
    • It is totally possible that openais will get the messages as "a2 b1 b2 a1 b4 a3 a4 b4".
    • The openais application will then ensure that all nodes get the messages in the above order, one at a time. All nodes must confirm that they got a given message before the next message is sent to any node.

This will tie into fencing and totem, as we'll see in the next sections.

Concept; Fencing

Fencing is a absolutely critical part of clustering. Without fully working fence devices, your cluster will fail.

Was that strong enough, or should I say that again? Let's be safe:

DO NOT BUILD A CLUSTER WITHOUT PROPER, WORKING AND TESTED FENCING.

Sorry, I promise that this will be the only time that I speak so strongly. Fencing really is critical, and explaining the need for fencing is nearly a weekly event. So then, let's discuss fencing.

When a node stops responding, an internal timeout and counter start ticking away. During this time, no messages are moving through the cluster and the cluster is, essentially, hung. If the node responds in time, the timeout and counter reset and the cluster begins operating properly again.

If, on the other hand, the node does not respond in time, the node will be declared dead. The cluster will take a "head count" to see which nodes is still has contact with and will determine then if there are enough to have quorum. If so, the cluster will issue a "fence" against the silent node. This is a call to a program called fenced, the fence daemon.

The fence daemon will look at the cluster configuration and get the fence devices configured for the dead node. Then, one at a time and in the order that they appear in the configuration, the fence daemon will call those fence devices, via their fence agents, passing to the fence agent any configured arguments like username, password, port number and so on. If the first fence agent returns a failure, the next fence agent will be called. If the second fails, the third will be called, then the forth and so on. Once the last (or perhaps only) fence device fails, the fence daemon will retry again, starting back at the start of the list. It will do this indefinitely until one of the fence devices success.

Here's the flow, in point form:

  • The openais program collects messages and sends them off, one at a time, to all nodes.
  • All nodes respond, and the next message is sent. Repeat continuously during normal operation.
  • Suddenly, one node stops responding.
    • Communication freezes while the cluster waits for the silent node.
    • A timeout starts (300ms by default), and each time the timeout is hit, and error counter increments.
    • The silent node responds before the counter reaches the limit.
      • The counter is reset to 0
      • The cluster operates normally again.
  • Again, one node stops responding.
    • Again, the timeout begins and the error count increments each time the timeout is reached.
    • Time error exceeds the limit (10 is the default); Three seconds have passed (300ms * 10).
    • The node is declared dead.
    • The cluster checks which members it still has, and if that provides enough votes for quorum.
      • If there are too few votes for quorum, the cluster software freezes and the node(s) withdraw from the cluster.
      • If there are enough votes for quorum, the silent node is declared dead.
        • openais calls fenced, telling it to fence the node.
        • Which fence device(s) to use, that is, what fence_agent to call and what arguments to pass, is gathered.
        • For each configured fence device:
          • The agent is called and fenced waits for the fence_agent to exit.
          • The fence_agent's exit code is examined. If it's a success, recovery starts. If it failed, the next configured fence agent is called.
        • If all (or the only) configured fence fails, fenced will start over.
        • fenced will wait and loop forever until a fence agent succeeds. During this time, the cluster is hung.
    • Once a fence_agent succeeds, the cluster is reconfigured.
      • A new closed process group (cpg) is formed.
      • A new fence domain is formed.
      • Lost cluster resources are recovered as per rgmanager's configuration (including file system recovery as needed).
      • Normal cluster operation is restored.

This skipped a few key things, but the general flow of logic should be there.

This is why fencing is so important. Without a properly configured and tested fence device or devices, the cluster will never successfully fence and the cluster will stay hung forever.

Component; totem

The totem protocol defined message passing within the cluster and it is used by openais. A token is passed around all the nodes in the cluster, and the timeout discussed in fencing above is actually a token timeout. The counter, then, is the number of lost tokens that are allowed before a node is considered dead.

The totem protocol supports something called 'rrp', Redundant Ring Protocol. Through rrp, you can add a second backup ring on a separate network to take over in the event of a failure in the first ring. In RHCS, these rings are known as "ring 0" and "ring 1".

Component; rgmanager

When the cluster configuration changes, openais calls rgmanager, the resource group manager. It will examine what changed and then will start, stop, migrate or recover cluster resources as needed.

Component; qdisk

If you have a cluster of 2 to 16 nodes, you can use a quorum disk. This is a small partition on shared storage device that the cluster can use to make much better decisions about which nodes should have quorum when a split in the network happens.

Sadly, qdisk does not work well on DRBD, so we will not be using it in this tutorial. It is still worth knowing about though.

The way a qdisk works, at it's most basic, is to have one or more votes in quorum. Generally, but not necessarily always, the qdisk device has one vote less than the total number of nodes (N-1).

  • In a two node cluster, the qdisk would have one vote.
  • In a seven node cluster, the qdisk would have six votes.

Imagine these two scenarios; First without qdisk, the revisited to see how qdisk helps.

  • First Scenarion; A two node cluster, which we will implement here.

If the network connection on the totem ring(s) breaks, you will enter into a dangerous state called a "split-brain". Normally, this can't happen because quorum can only be held by one side at a time. In a two_node cluster though, this is allowed.

Without a qdisk, either node could potentially start the cluster resources. This is a disastrous possibility and it is avoided by a fence dual. Both nodes will try to fence the other at the same time, but only the fastest one wins. The idea behind this is that one will always live because the other will die before it can get it's fence call out. In theory, this works fine. In practice though, there are cases where fence calls can be "queued", thus, in fact, allow both nodes to die. This defeats the whole "high availability" thing, now doesn't it? Also, this possibility is why the two_node option is the only exception to the quorum rules.

So how does a qdisk help?

Two ways!

First;

The biggest way it helps is by getting away from the two_node exception. With the qdisk partition, you are back up to three votes, so there will never be a 50/50 split. If either node retains access to the quorum disk while the other loses access, then right there things are decided. The one with the disk has 2 votes and wins quorum and will fence the other. Meanwhile, the other will only have 1 votes, thus it will lose quorum, and will withdraw from the cluster and not try to fence the other node.

Second;

You can use heuristics with qdisk to have a more intelligent partition recovery mechanism. For example, let's look again at the scenario where the link(s) between the two nodes hosting the totem ring is cut. This time though, let's assume that the storage network link is still up, so both nodes have access to the qdisk partition. How would the qdisk act as a tie breaker?

One way is to have a heuristics test that checks to see if one of the nodes has access to a particular router. With this heuristics test, if only one node had access to that switch, the qdisk would give it's vote to that node and ensure that the "healthiest" node survived. Pretty cool, eh?

  • Second Scenarion; A seven node cluster with six dead members.

Admittedly, this is an extreme scenario, but it serves to illustrate the point well. Remember how we said that the general rule is that the qdisk has N-1 votes?

With our seven node cluster, on it's own, there would be a total of 7 votes, so normally quorum would require 4 nodes be alive (((7/2)+1) = (3.5+1) = 4.5, rounded down is 4). With the death of the fourth node, all cluster services would fail. We understand now why this would be the case, but what if the nodes are, for example, serving up websites? In this case, 3 nodes are still sufficient to do the job. Heck, even 1 node is better than nothing. With the rules of quorum though, it just wouldn't happen.

Let's now look at how the qdisk can help.

By giving the qdisk partition 6 votes, you raise the cluster's total expected votes from 7 to 13. With this new count, the votes needed to for quorum is 7 (((13/2)+1) = (6.5+1) = 7.5, rounded down is 7).

So looking back at the scenario where we've lost four of our seven nodes; The surviving nodes have 3 votes, but they can talk to the qdisk which provides another 6 votes, for a total of 9. With that, quorum is achieved and the three nodes are allowed to form a cluster and continue to provide services. Even if you lose all but one node, you are still in business because the one surviving node, which is still able to talk to the qdisk and thus win it's 6 votes, has a total of 7 and thus has quorum!

There is another benefit. As we mentioned in the first scenario, we can add heuristics to the qdisk. Imagine that, rather than having six nodes die, they instead partition off because of a break in the network. Without qdisk, the six nodes would easily win quorum, fence the one other node and then reform the cluster. What if, though, the one lone node was the only one with access to a critical route to the Internet? The six nodes would be useless in a web-server environment. With the heuristics provided by qdisk, that one useful node would get the qdisk's 6 votes and win quorum over the other six nodes!

A little qdisk goes a long way.

Component; DRBD

DRBD; Distributed Replicating Block Device, is a technology that takes raw storage from two or more nodes and keeps their data synchronized in real time. It is sometimes described as "RAID 1 over Nodes", and that is conceptually accurate. In this tutorial's cluster, DRBD will be used to provide that back-end storage as a cost-effective alternative to a tranditional SAN or iSCSI device.

To help visualize DRBD's use and role, look at the map of our cluster's storage.

Component; CLVM

With DRBD providing the raw storage for the cluster, we must now create partitions. This is where Clustered LVM, known as CLVM, comes into play.

CLVM is ideal in that it understands that it is clustered and therefor won't provide access to nodes outside of the formed cluster. That is, not a member of openais's closed process group, which, in turn, requires quorum.

It is ideal because it can take one or more raw devices, known as "physical volumes", or simple as PVs, and combine their raw space into one or more "volume groups", known as VGs. These volume groups then act just like a typical hard drive and can be "partitioned" into one or more "logical volumes", known as LVs. These LVs are what will be formatted with a clustered file system.

LVM is particularly attractive because of how incredibly flexible it is. We can easily add new physical volumes later, and then grow an existing volume group to use the new space. This new space can then be given to existing logical volumes, or entirely new logical volumes can be created. This can all be done while the cluster is online offering an upgrade path with no down time.

Component; GFS2

With DRBD providing the clusters raw storage space, and Clustered LVM providing the logical partitions, we can now look at the clustered file system. This is the role of the Global File System version 2, known simply as GFS2.

It works much like standard filesystem, with mkfs.gfs2, fsck.gfs2 and so on. The major difference is that it and clvmd use the cluster's distributed locking mechanism provided by dlm_controld. Once formatted, the GFS2-formatted partition can be mounted and used by any node in the cluster's closed process group. All nodes can then safely read from and write to the data on the partition simultaneously.

Component; DLM

One of the major roles of a cluster is to provide distributed locking on clustered storage. In fact, storage software can not be clustered without using DLM, as provided by the dlm_controld daemon, using openais's virtual synchrony.

Through DLM, all nodes accessing clustered storage are guaranteed to get POSIX locks, called plocks, in the same order across all nodes. Both CLVM and GFS2 rely on DLM, though other clustered storage, like OCFS2, use it as well.

Component; Xen

There are two major open-source virtualization platforms available in the Linux world today; Xen and KVM. The former is maintained by Citrix and the other by Redhat. It would be difficult to say which is "better", as they're both very good. Xen can be argued to be more mature where KVM is the "official" solution supported by Red Hat directly.

We will be using the Xen hypervisor and a "host" virtual server called dom0. In Xen, every machine is a virtual server, including the system you installed when you built the server. This is possible thanks to a small Xen micro-operating system that initially boots, then starts up your original installed operating system as a virtual server with special access to the underlying hardware and hypervisor management tools.

The rest of the virtual servers in a Xen environment are collectively called "domU" virtual servers. These will be the highly-available resource that will migrate between nodes during failure events.

Base Setup

Before we can look at the cluster, we must first build two cluster nodes and then install the operating system.

Hardware Requirements

The bare minimum requirements are;

  • All hardware must be supported by EL5. It is strongly recommended that you check compatibility before making any purchases.
  • A dual-core CPUs with hardware virtualization support.
  • Three network cards; At least one should be gigabit or faster.
  • One hard drive.
  • 2 GiB of RAM
  • A fence device. This can be an IPMI-enabled server, a Node Assassin, a switched PDU or similar.

This tutorial was written using the following hardware:

This is not an endorsement of the above hardware. I put a heavy emphasis on minimizing power consumption and bought what was within my budget. This hardware was never meant to be put into production, but instead was chosen to serve the purpose of my own study and for creating this tutorial. What you ultimately choose to use, provided it meets the minimum requirements, is entirely up to you and your requirements.

Note: I use three physical NICs, but you can get away with two by merging the storage and back-channel networks, which we will discuss shortly. If you are really in a pinch, you could create three aliases on on interface and isolate them using VLANs. If you go this route, please ensure that your VLANs are configured and working before beginning this tutorial. Pay close attention to multicast traffic.

Pre-Assembly

Before you assemble your nodes, take a moment to record the MAC addresses of each network interface and then note where each interface is physically installed. This will help you later when configuring the networks. I generally create a simple text file with the MAC addresses, the interface I intend to assign to it and where it physically is located.

-=] an-node01
48:5B:39:3C:53:15   # eth0 - onboard interface
00:1B:21:72:96:E8   # eth1 - right-most PCIe interface
00:1B:21:72:9B:56   # eth2 - left-most PCI interface

-=] an-node02
48:5B:39:3C:53:14   # eth0 - onboard interface
00:1B:21:72:9B:5A   # eth1 - right-most PCIe interface
00:1B:21:72:96:EA   # eth2 - left-most PCI interface

OS Install

Later steps will include packages to install, so the initial OS install can be minimal. I like to change the default run-level to 3, remove rhgb quiet from the grub menu, disable the firewall and disable SELinux. In a production cluster, you will want to use firewalling and selinux, but until you finish studying, leave it off to keep things simple.

  • Note: Before EL5.4, you could not use SELinux. It is now possible to use it, and it is recommended that you do so in any production cluster.
  • Note: Ports and protocols to open in a firewall will be discussed later in the networking section.

I like to minimize and automate my installs as much as possible. To that end, I run a little PXE server on my network and use a kickstart script to automate the install. Here is a simple one for use on a single-drive node:

If you decide to manually install EL5 on your nodes, please try to keep the installation as small as possible. The fewer packages installed, the fewer sources of problems and vectors for attack.

Post Install OS Changes

This section discusses changes I recommend, but are not required.

Network Planning

The most important change that is recommended is to get your nodes into a consistent networking configuration. This will prove very handy when trying to keep track of your networks and where they're physically connected. This becomes exponentially more helpful as your cluster grows.

The first step is to understand the three networks we will be creating. Once you understand their role, you will need to decide which interface on the nodes will be used for each network.

Cluster Networks

The three networks are;

Network Acronym Use
Back-Channel Network BCN Private cluster communications, virtual machine migrations, fence devices
Storage Network SN Used exclusively for storage communications. Possible to use as totem's redundant ring.
Internet-Facing Network IFN Internet-polluted network. No cluster or storage communication or devices.

Things To Consider

When planning which interfaces to connect to each network, consider the following, in order of importance:

  • If your nodes have IPMI and an interface sharing a physical RJ-45 connector, this must be on the Back-Channel Network. The reasoning is that having your fence device accessible on the Internet-Facing Network poses a major security risk. Having the IPMI interface on the Storage Network can cause problems if a fence is fired and the network is saturated with storage traffic.
  • The lowest-latency network interface should be used as the Back-Channel Network. The cluster is maintained by multicast messaging between the nodes using something called the totem protocol. Any delay in the delivery of these messages can risk causing a failure and ejection of effected nodes when no actual failure existed. This will be discussed in greater detail later.
  • The network with the most raw bandwidth should be used for the Storage Network. All disk writes must be sent across the network and committed to the remote nodes before the write is declared complete. This causes the network to become the disk I/O bottle neck. Using a network with jumbo frames and high raw throughput will help minimize this bottle neck.
  • During the live migration of virtual machines, the VM's RAM is copied to the other node using the BCN. For this reason, the second fastest network should be used for back-channel communication. However, these copies can saturate the network, so care must be taken to ensure that cluster communications get higher priority. This can be done using a managed switch. If you can not ensure priority for totem multicast, then be sure to configure Xen later to use the storage network for migrations.
  • The remain, slowest interface should be used for the IFN.

Planning the Networks

This paper will use the following setup. Feel free to alter the interface to network mapping and the IP subnets used to best suit your needs. For reasons completely my own, I like to start my cluster IPs final octal at 71 for node 1 and then increment up from there. This is entirely arbitrary, so please use what ever makes sense to you. The remainder of this tutorial will follow the convention below:

Network Interface Subnet
IFN eth0 192.168.1.0/24
SN eth1 192.168.2.0/24
BCN eth2 192.139.3.0/24

This translates to the following per-node configuration:

an-node01 an-node02
Interface IP Address Host Name(s) IP Address Host Name(s)
IFN eth0 192.168.1.71 an-node01.ifn 192.168.1.72 an-node02.ifn
SN eth1 192.168.2.71 an-node01.sn 192.168.2.72 an-node02.sn
BCN eth2 192.168.3.71 an-node01 an-node01.alteeve.com an-node01.bcn 192.168.3.72 an-node02 an-node02.alteeve.com an-node02.bcn

Network Configuration

So now we've planned the network, so it is time to implement it.

Warning About Managed Switches

WARNING: Please pay attention to this warning! The vast majority of cluster problems end up being network related. The hardest ones to diagnose are usually multicast issues.

If you use a managed switch, be careful about enabling and configuring Multicast IGMP Snooping or Spanning Tree Protocol. They have been known to cause problems by not allowing multicast packets to reach all nodes fast enough or at all. This can cause somewhat random break-downs in communication between your nodes, leading to seemingly random fences and DLM lock timeouts. If your switches support PIM Routing, be sure to use it!

If you have problems with your cluster not forming, or seemingly random fencing, try using a cheap unmanaged switch. If the problem goes away, you are most likely dealing with a managed switch configuration problem.

Disable Firewalling

To "keep things simple", we will disable all firewalling on the cluster nodes. This is not recommended in production environments, obviously, so below will be a table of ports and protocols to open when you do get into production. Until then, we will simply use chkconfig to disable iptables and ip6tables.

Note: Cluster 2 does not support IPv6, so you can skip or ignore it if you wish. I like to disable it just to be certain that it can't cause issues though.

chkconfig iptables off
chkconfig ip6tables off

Now confirm that they are off by having iptables and ip6tables list their rules.

iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
ip6tables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

When you do prepare to go into production, these are the protocols and ports you need to open between cluster nodes. Remember to allow multicast communications as well!

Port Protocol Component
5404, 5405 UDP cman
8084, 5405 TCP luci
11111 TCP ricci
14567 TCP gnbd
16851 TCP modclusterd
21064 TCP dlm
50006, 50008, 50009 TCP ccsd
50007 UDP ccsd

Disable NetworkManager, Enable network

The NetworkManager daemon is an excellent daemon in environments where a system connects to a variety of networks. The NetworkManager daemon handles changing the networking configuration whenever it senses a change in the network state, like when a cable is unplugged or a wireless network comes or goes. As useful as this is on laptops and workstations, it can be detrimental in a cluster.

To prevent the networking from changing once we've got it setup, we want to replace NetworkManager daemon with the network initialization script. The network script will start and stop networking, but otherwise it will leave the configuration alone. This is ideal in servers, and doubly-so in clusters given their sensitivity to transient network issues.

Start by removing NetworkManager:

yum remove NetworkManager NetworkManager-glib NetworkManager-gnome NetworkManager-devel NetworkManager-glib-devel

Now you want to ensure that network starts with the system.

chkconfig network on

Setup /etc/hosts

The /etc/hosts file, by default, will resolve the hostname to the lo (127.0.0.1) interface. The cluster uses this name though for knowing which interface to use for the totem protocol (and thus all cluster communications). To this end, we will remove the hostname from 127.0.0.1 and instead put it on the IP of our BCN connected interface. At the same time, we will add entries for all networks for each node in the cluster and entries for the fence devices. Once done, the edited /etc/hosts file should be suitable for copying to all nodes in the cluster.

vim /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1	localhost.localdomain localhost
::1		localhost6.localdomain6 localhost6

192.168.1.71	an-node01.ifn
192.168.2.71	an-node01.sn
192.168.3.71	an-node01 an-node01.bcn an-node01.alteeve.com

192.168.1.72	an-node02.ifn
192.168.2.72	an-node02.sn
192.168.3.72	an-node02 an-node02.bcn an-node02.alteeve.com

192.168.3.61	batou.alteeve.com	# Node Assassin
192.168.3.62	motoko.alteeve.com	# Switched PDU

Mapping Interfaces to ethX Names

Chances are good that the assignment of ethX interface names to your physical network cards is not ideal. There is no strict technical reason to change the mapping, but it will make you life a lot easier if all nodes use the same ethX names for the same subnets.

The actual process of changing the mapping is a little involved. For this reason, there is a dedicated mini-tutorial which you can find below. Please jump to it and then return once your mapping is as you like it.

Set IP Addresses

The last step in setting up the network interfaces is to manually assign the IP addresses and define the subnets for the interfaces. This involves directly editing the /etc/sysconfig/network-scripts/ifcfg-ethX files. There are a large set of options that can be set in these configuration files, but most are outside the scope of this tutorial. To get a better understanding of the available options, please see:

Here are my three configuration files which you can use as guides. Please do not copy these over your files! Doing so will cause your interfaces to fail outright as every interface's MAC address is unique. Adapt these to suite your needs.

vim /etc/sysconfig/network-scripts/ifcfg-eth0
# Internet-Facing Network
HWADDR=48:5B:39:3C:53:15
DEVICE=eth0
BOOTPROTO=static
ONBOOT=yes
IPADDR=192.168.1.71
NETMASK=255.255.255.0
GATEWAY=192.168.1.254
vim /etc/sysconfig/network-scripts/ifcfg-eth1
# Storage Network
HWADDR=00:1B:21:72:96:E8
DEVICE=eth1
BOOTPROTO=static
ONBOOT=yes
IPADDR=192.168.2.71
NETMASK=255.255.255.0
vim /etc/sysconfig/network-scripts/ifcfg-eth2
# Back Channel Network
HWADDR=00:1B:21:72:9B:56
DEVICE=eth2
BOOTPROTO=static
ONBOOT=yes
IPADDR=192.168.3.71
NETMASK=255.255.255.0

You will also need to setup the /etc/resolv.conf file for DNS resolution. You can learn more about this file's purpose by reading it's man page; man resolv.conf. The main thing is to set valid DNS server IP addresses in the nameserver sections. Here is mine, for reference:

vim /etc/resolv.conf
search alteeve.com
nameserver 192.139.81.117
nameserver 192.139.81.1

Finally, restart network and you should have you interfaces setup properly.

/etc/init.d/network restart
Shutting down interface eth0:                              [  OK  ]
Shutting down interface eth1:                              [  OK  ]
Shutting down interface eth2:                              [  OK  ]
Shutting down loopback interface:                          [  OK  ]
Bringing up loopback interface:                            [  OK  ]
Bringing up interface eth0:                                [  OK  ]
Bringing up interface eth1:                                [  OK  ]
Bringing up interface eth2:                                [  OK  ]

You can verify your configuration using the ifconfig tool.

ifconfig
eth0      Link encap:Ethernet  HWaddr 48:5B:39:3C:53:15  
          inet addr:192.168.1.71  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::92e6:baff:fe71:82ea/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1727 errors:0 dropped:0 overruns:0 frame:0
          TX packets:655 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:208916 (204.0 KiB)  TX bytes:133171 (130.0 KiB)
          Interrupt:252 Base address:0x2000 

eth1      Link encap:Ethernet  HWaddr 00:1B:21:72:96:E8  
          inet addr:192.168.2.71  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::221:91ff:fe19:9653/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:998 errors:0 dropped:0 overruns:0 frame:0
          TX packets:47 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:97702 (95.4 KiB)  TX bytes:6959 (6.7 KiB)
          Interrupt:16 

eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56  
          inet addr:192.168.3.71  Bcast:192.168.3.255  Mask:255.255.255.0
          inet6 addr: fe80::20e:cff:fe59:46e4/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5241 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4439 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1714026 (1.6 MiB)  TX bytes:1624392 (1.5 MiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:42 errors:0 dropped:0 overruns:0 frame:0
          TX packets:42 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:6449 (6.2 KiB)  TX bytes:6449 (6.2 KiB)

Setting up SSH

Setting up SSH shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This will be needed later when we want to enable applications like libvirtd and virt-manager.

SSH is, on it's own, a very big topic. If you are not familiar with SSH, please take some time to learn about it before proceeding. A great first step is the Wikipedia entry on SSH, as well as the SSH man page; man ssh.

SSH can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user on each node, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the other node's root user's directory.

For each user, on each machine you want to connect from, run:

# The '2047' is just to screw with brute-forces a bit. :)
ssh-keygen -t rsa -N "" -b 2047 -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
Created directory '/root/.ssh'.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
a1:65:a9:50:bb:15:ae:b1:6e:06:12:4a:29:d1:68:f3 root@an-node01.alteeve.com

This will create two files: the private key called ~/.ssh/id_rsa and the public key called ~/.ssh/id_rsa.pub. The private must never be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

Private key:

cat ~/.ssh/id_rsa
-----BEGIN RSA PRIVATE KEY-----
MIIEnwIBAAKCAQBTNg6FZyDKm4GAm7c+F2enpLWy+t8ZZjm4Z3Q7EhX09ukqk/Qm
MqprtI9OsiRVjce+wGx4nZ8+Z0NHduCVuwAxG0XG7FpKkUJC3Qb8KhyeIpKEcfYA
tsDUFnWddVF8Tsz6dDOhb61tAke77d9E01NfyHp88QBxjJ7w+ZgB2eLPBFm6j1t+
K50JHwdcFfxrZFywKnAQIdH0NCs8VaW91fQZBupg4OGOMpSBnVzoaz2ybI9bQtbZ
4GwhCghzKx7Qjz20WiqhfPMfFqAZJwn0WXfjALoioMDWavTbx+J2HM8KJ8/YkSSK
dDEgZCItg0Q2fC35TDX+aJGu3xNfoaAe3lL1AgEjAoIBABVlq/Zq+c2y9Wo2q3Zd
yjJsLrj+rmWd8ZXRdajKIuc4LVQXaqq8kjjz6lYQjQAOg9H291I3KPLKGJ1ZFS3R
AAygnOoCQxp9H6rLHw2kbcJDZ4Eknlf0eroxqTceKuVzWUe3ev2gX8uS3z70BjZE
+C6SoydxK//w9aut5UJN+H5f42p95IsUIs0oy3/3KGPHYrC2Zgc2TIhe25huie/O
psKhHATBzf+M7tHLGia3q682JqxXru8zhtPOpEAmU4XDtNdL+Bjv+/Q2HMRstJXe
2PU3IpVBkirEIE5HlyOV1T802KRsSBelxPV5Y6y5TRq+cEwn0G2le1GiFBjd0xQd
0csCgYEA2BWkxSXhqmeb8dzcZnnuBZbpebuPYeMtWK/MMLxvJ50UCUfVZmA+yUUX
K9fAUvkMLd7V8/MP7GrdmYq2XiLv6IZPUwyS8yboovwWMb+72vb5QSnN6LAfpUEk
NRd5JkWgqRstGaUzxeCRfwfIHuAHikP2KeiLM4TfBkXzhm+VWjECgYBilQEBHvuk
LlY2/1v43zYQMSZNHBSbxc7R5mnOXNFgapzJeFKvaJbVKRsEQTX5uqo83jRXC7LI
t14pC23tpW1dBTi9bNLzQnf/BL9vQx6KFfgrXwy8KqXuajfv1ECH6ytqdttkUGZt
TE/monjAmR5EVElvwMubCPuGDk9zC7iQBQKBgG8hEukMKunsJFCANtWdyt5NnKUB
X66vWSZLyBkQc635Av11Zm8qLusq2Ld2RacDvR7noTuhkykhBEBV92Oc8Gj0ndLw
hhamS8GI9Xirv7JwYu5QA377ff03cbTngCJPsbYN+e/uj6eYEE/1X5rZnXpO1l6y
G7QYcrLE46Q5YsCrAoGAL+H5LG4idFEFTem+9Tk3hDUhO2VpGHYFXqMdctygNiUn
lQ6Oj7Z1JbThPJSz0RGF4wzXl/5eJvn6iPbsQDpoUcC1KM51FxGn/4X2lSCZzgqr
vUtslejUQJn96YRZ254cZulF/YYjHyUQ3byhDRcr9U2CwUBi5OcbFTomlvcQgHcC
gYEAtIpaEWt+Akz9GDJpKM7Ojpk8wTtlz2a+S5fx3WH/IVURoAzZiXzvonVIclrH
5RXFiwfoXlMzIulZcrBJZfTgRO9A2v9rE/ZRm6qaDrGe9RcYfCtxGGyptMKLdbwP
UW1emRl5celU9ZEZRBpIVTES5ZVWqD2RkkkNNJbPf5F/x+w=
-----END RSA PRIVATE KEY-----

Public key:

cat ~/.ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQBTNg6FZyDKm4GAm7c+F2enpLWy+t8ZZjm4Z3Q7EhX09ukqk/QmMqprtI9OsiRVjce+wGx4nZ8+Z0NHduCVuwAxG0XG7FpKkUJC3Qb8KhyeIpKEcfYAtsDUFnWddVF8Tsz6dDOhb61tAke77d9E01NfyHp88QBxjJ7w+ZgB2eLPBFm6j1t+K50JHwdcFfxrZFywKnAQIdH0NCs8VaW91fQZBupg4OGOMpSBnVzoaz2ybI9bQtbZ4GwhCghzKx7Qjz20WiqhfPMfFqAZJwn0WXfjALoioMDWavTbx+J2HM8KJ8/YkSSKdDEgZCItg0Q2fC35TDX+aJGu3xNfoaAe3lL1 root@an-node01.alteeve.com

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From an-node01, type:

ssh root@an-node02
The authenticity of host 'an-node02 (192.168.3.72)' can't be established.
RSA key fingerprint is 55:58:c3:32:e4:e6:5e:32:c1:db:5c:f1:36:e2:da:4b.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,192.168.3.72' (RSA) to the list of known hosts.
Last login: Fri Mar 11 20:45:58 2011 from 192.168.1.202

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

cat ~/.ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQBTNg6FZyDKm4GAm7c+F2enpLWy+t8ZZjm4Z3Q7EhX09ukqk/QmMqprtI9OsiRVjce+wGx4nZ8+Z0NHduCVuwAxG0XG7FpKkUJC3Qb8KhyeIpKEcfYAtsDUFnWddVF8Tsz6dDOhb61tAke77d9E01NfyHp88QBxjJ7w+ZgB2eLPBFm6j1t+K50JHwdcFfxrZFywKnAQIdH0NCs8VaW91fQZBupg4OGOMpSBnVzoaz2ybI9bQtbZ4GwhCghzKx7Qjz20WiqhfPMfFqAZJwn0WXfjALoioMDWavTbx+J2HM8KJ8/YkSSKdDEgZCItg0Q2fC35TDX+aJGu3xNfoaAe3lL1 root@an-node01.alteeve.com

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

Various applications will connect to the other node using different methods and networks. Each connection, when first established, will prompt for you to confirm that you trust the authentication, as we saw above. Many programs can't handle this prompt and will simply fail to connect. So to get around this, I will ssh into both nodes using all hostnames. This will populate a file called ~/.ssh/known_hosts. Once you do this on one node, you can simply copy the known_hosts to the other nodes and user's ~/.ssh/ directories.

I simply paste this into a terminal, answering yes and then immediately exiting from the ssh session. This is a bit tedious, I admit. Take the time to check the fingerprints as they are displayed to you. It is a bad habit to blindly type yes.

Alter this to suit your host names.

ssh root@an-node01 && \
ssh root@an-node01.alteeve.com && \
ssh root@an-node01.bcn && \
ssh root@an-node01.sn && \
ssh root@an-node01.ifn && \
ssh root@an-node02 && \
ssh root@an-node02.alteeve.com && \
ssh root@an-node02.bcn && \
ssh root@an-node02.sn && \
ssh root@an-node02.ifn

Altering Boot Up

Note: These are an optional steps.

There are two changes I like to make on my nodes. These are not required, but I find it helps to keep things as simple as possible. Particularly in the earlier learning and testing stages.

Changing the Default Run-Level

If you choose not to implement it, please change any referenced to /etc/rc3.d to /etc/rc5.d later in this tutorial.

I prefer to minimize the running daemons and apps on my nodes for two reasons; Performance and security. One of the simplest ways to minimize the number of running programs is to change the run-level to 3 by editing /etc/inittab. This tells the node when it boots not to start the graphical interface and instead simply boot to a bash shell.

This change is actually quite simple. Simple edit /etc/inittab and change the line id:5:initdefault: to id:3:initdefault:.

vim /etc/inittab
# Default runlevel. The runlevels used by RHS are:
#   0 - halt (Do NOT set initdefault to this)
#   1 - Single user mode
#   2 - Multiuser, without NFS (The same as 3, if you do not have networking)
#   3 - Full multiuser mode
#   4 - unused
#   5 - X11
#   6 - reboot (Do NOT set initdefault to this)
# 
id:3:initdefault:

If you are still in a graphical environment and want to disable the GUI without rebooting, you can run init 3. Conversely, if you want to start the GUI for a certain task, you can do so my running init 5.

Making Boot Messages Visible

Another optional step, in-line with the change above, is to disable the rhgb (Red Hat Graphical Boot) and quiet kernel arguments. These options provide the clean boot screen you normally see with EL5, but they also hide a lot of boot messages that we may find helpful.

To make this change, edit the grub boot-loader menu and remove the rhgb quiet arguments from the kernel /vmlinuz... line. These arguments are usually the last ones on the line. If you leave this until later you may see two or more kernel entries. Delete these arguments where ever they are found.

vim /boot/grub/menu.lst

Change:

title CentOS (2.6.18-194.32.1.el5)
        root (hd0,0)
        kernel /vmlinuz-2.6.18-194.32.1.el5 ro root=LABEL=/ rhgb quiet
        initrd /initrd-2.6.18-194.32.1.el5.img

To:

title CentOS (2.6.18-194.32.1.el5)
        root (hd0,0)
        kernel /vmlinuz-2.6.18-194.32.1.el5 ro root=LABEL=/
        initrd /initrd-2.6.18-194.32.1.el5.img

There is nothing more to do now. Future reboots will be simple terminal displays.

Setting Up Xen

It may seem premature to discuss Xen before the cluster itself. The reason we need to look at it now, before the cluster, is because Xen makes some fairly significant changes to the networking. Given how changes to networking can effect the cluster, we will want to get these changes out of the way.

We're not going to provision any virtual machines until the cluster is built.

A Brief Overview

Xen is a hypervisor the converts the installed operating system into a virtual machine running on a small Xen kernel. This same small kernel also runs all of the virtual machines you will add later. In this way, you will always be working in a virtual machine once you switch to booting a Xen kernel. In Xen terminology, virtual machines are known as domains.

The "host" operating system is known as dom0 (domain 0) and has a special view of the hardware plus contains the configuration and control of Xen itself. All other Xen virtual machines are known as domU (domain U). This is a collective term that represents the transient ID number assigned to all virtual machines. For example, when you boot the first virtual machine, it is known as dom1. The next will be dom2, then dom3 and so on. Do note that if a domU shuts down, it's ID is not reused. So when it restarts, it will use the next free ID (ie: dom4 in this list, despite it having been, say, dom1 initially).

This makes Xen somewhat unique in the virtualization world. Most others do not touch or alter the "host" OS, instead running the guest VMs fully withing the context of the host operating system.

Install Xen

In EL5, all the software we need to run Xen VMs and the Xen hypervisor are available from the stock repositories. These are admittedly older, but they are supported, so they will be used in this tutorial. If you wish to use newer versions, by all means do so, but that falls outside the scope of this paper. I would suggest that you work with the stock versions until you are comfortable with Xen and the Cluster suite before trying to upgrade to the newer versions.

Some of the packages listed below will be used later. To install the packages, run:

yum install kernel-xen kmod-drbd83-xen xen xen-libs libvirt virt-manager virt-viewer

Understanding Networking in Xen

Xen uses a fairly complex networking system. This is, perhaps, it's strongest point. The trade off though is that it can be a little tricky to wrap your head around. To help you become familiar, there is a short tutorial dedicated to this topic. Please read it over before proceeding in you are not familiar with Xen's networking.

Taking the time to read and understand the mini-paper below will save you a lot of heartache in the following stages.

Making Network Interfaces Available To Xen Clients

As discussed above, Xen makes some significant changes to the dom0 network, which happens to be where the cluster will operate. These changes including shutting down and moving around the interfaces. As we will discuss later, this behaviour can trigger cluster failures. This is the main reason for dealing with Xen now. Once the changes are in place, the network is stable and safe for running the cluster on.

A Brief Overview

By default, Xen only makes eth0 available to the virtual machines. We will want to add eth2 as well, as we will use the Back Channel Network for inter-VM communication. We do not want to add the Storage Network to Xen though! Doing so puts the DRBD link at risk. Should xend get shut down, it could trigger a split-brain in DRBD.

What Xen does, in brief, is move the "real" eth0 over to a new device called peth0. Then it creates a virtual "clone" of the network interface called eth0. Next, Xen creates a bridge called xenbr0. Finally, both the real peth0 and the new virtual eth0 are connected to the xenbr0 bridge.

The reasoning behind all this is to separate the traffic coming to and from dom0 from any traffic doing to the various domUs. Think of it sort of like the bridge being a network switch, the peth0 being an uplink cable to the outside world and the virtual eth0 being dom0's "port" on the switch. We want the same to be done to the interface on the Back-Channel Network, too. The Storage Network will never be exposed to the domU machines, so combining the risk to the underlying storage, there is no reason to add eth1 to Xen's control.

Disable the 'qemu' Bridge

By default, libvirtd creates a bridge called virbr0 designed to connect virtual machines to the first eth0 interface. Our system will not need this, so we will remove it. This bridge is configured in the /etc/libvirt/qemu/networks/default.xml file, so to remove this bridge, simply delete the contents of the file.

cat /dev/null >/etc/libvirt/qemu/networks/default.xml

The next time you reboot, that bridge will be gone.

Create /etc/xen/scripts/an-network-script

We will create a script that Xen will be told to use for bringing up the "xenified" network interfaces.

Please note:

  1. You don't need to use the name 'an-network-script'. I suggest this name mainly to keep in line with the rest of the 'AN!x' naming used on this wiki.
  2. If you install convirt (not discussed further here), it will create it's own bridge script called convirt-xen-multibridge. Other tools may do something similar.

First, touch the file and then chmod it to be executable.

touch /etc/xen/scripts/an-network-script
chmod 755 /etc/xen/scripts/an-network-script

Now edit it to contain the following:

vim /etc/xen/scripts/an-network-script
#!/bin/sh
dir=$(dirname "$0")
"$dir/network-bridge" "$@" vifnum=0 netdev=eth0 bridge=xenbr0
"$dir/network-bridge" "$@" vifnum=2 netdev=eth2 bridge=xenbr2

Now tell Xen to reference that script by editing /etc/xen/xend-config.sxp file and changing the network-script argument to point to this new script (this is line 91 in the default xend-config.sxp script):

vim /etc/xen/xend-config.sxp
#(network-script network-bridge)
(network-script an-network-script)

Finally, check that it works by (re)starting xend:

/etc/init.d/xend restart
restart xend:                                              [  OK  ]

Now we'll use ifconfig to see the new network configuration (with a dash of creative grep to save screen space):

ifconfig |grep "Link encap" -A 1
eth0      Link encap:Ethernet  HWaddr 48:5B:39:3C:53:15
          inet addr:192.168.1.71  Bcast:192.168.1.255  Mask:255.255.255.0
--
eth1      Link encap:Ethernet  HWaddr 00:1B:21:72:96:E8
          inet addr:192.168.2.71  Bcast:192.168.2.255  Mask:255.255.255.0
--
eth2      Link encap:Ethernet  HWaddr 00:1B:21:72:9B:56
          inet addr:192.168.3.71  Bcast:192.168.3.255  Mask:255.255.255.0
--
lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
--
peth0     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
--
peth2     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
--
vif0.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
--
vif0.2    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
--
xenbr0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
--
xenbr2    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1

If you see this, the Xen networking is setup properly!

Altering When xend Starts

As was mentioned, xend rather dramatically modified the networking when it starts. We need to now make sure that xend starts before cman, which is not the case by default. To do this, we will edit the /etc/init.d/xend script and change it's default start position from 98 to 12.

Edit /etc/init.d/xend:

vim /etc/init.d/xend

And change the chkconfig: 2345 98 01 header from:

#!/bin/bash
#
# xend          Script to start and stop the Xen control daemon.
#
# Author:       Keir Fraser <keir.fraser@cl.cam.ac.uk>
#
# chkconfig: 2345 98 01
# description: Starts and stops the Xen control daemon.

To chkconfig: 2345 12 01:

#!/bin/bash
#
# xend          Script to start and stop the Xen control daemon.
#
# Author:       Keir Fraser <keir.fraser@cl.cam.ac.uk>
#
# chkconfig: 2345 12 01
# description: Starts and stops the Xen control daemon.

Now delete and re-add the xend initialization script using chkconfig.

chkconfig xend off
chkconfig xend on

If it worked, you should see it now higher up the start list than cman (ignore xendomains:

ls -lah /etc/rc3.d/ | grep -e cman -e xend
lrwxrwxrwx  1 root root   20 Mar  2 11:36 K00xendomains -> ../init.d/xendomains
lrwxrwxrwx  1 root root   14 Mar 14 11:38 S12xend -> ../init.d/xend
lrwxrwxrwx  1 root root   14 Mar 14 11:38 S21cman -> ../init.d/cman

That's it! The initial Xen configuration is done and we can start on the cluster configuration itself!

Cluster Setup

In Red Hat Cluster Services, the heart of the cluster is found in the /etc/cluster/cluster.conf XML configuration file.

There are three main ways of editing this file. Two are already well documented, so I won't bother discussing them, beyond introducing them. The third way is by directly hand-crafting the cluster.conf file. This method is not very well documented, and directly manipulating configuration files is my preferred method. As my boss loves to say; The more computers do for you, the more they do to you". I've grudging come to agree with him.

The first two, well documented, graphical interface methods are:

  • system-config-cluster, older GUI tool run directly from one of the cluster nodes.
  • Conga, comprised of the ricci node-side client and the luci web-based server (can be run on machines outside the cluster).

I do like the tools above, but I often find issues that send me back to the command line. I'd recommend setting them aside for now as well. Once you feel comfortable with cluster.conf syntax, then by all means, go back and use them. I'd recommend not relying on them though, which might be the case if you try to use them too early in your studies.

The First cluster.conf Foundation Configuration

The very first stage of building the cluster is to create a configuration file that is as minimal as possible. To do that, we need to define a few thing;

  • The name of the cluster and the cluster file version.
    • Define cman options
    • The nodes in the cluster
      • The fence method for each node
    • Define fence devices
    • Define fenced options

That's it. Once we've defined this minimal amount, we will be able to start the cluster for the first time! So lets get to it, finally.

Name the Cluster and Set The Configuration Version

The cluster tag is the parent tag for the entire cluster configuration file.

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="1">
</cluster>

This has two attributes that we need to set are name="" and config_version="".

The name="" attribute defines the name of the cluster. It must be unique amongst the clusters on your network. It should be descriptive, but you will not want to make it too long, either. You will see this name in the various cluster tools and you will enter in, for example, when creating a GFS2 partition later on. This tutorial uses the cluster name an_cluster.

The config_version="" attribute is an integer marking the version of the configuration file. Whenever you make a change to the cluster.conf file, you will need to increment this version number by 1. If you don't increment this number, then the cluster tools will not know that the file needs to be reloaded. As this is the first version of this configuration file, it will start with 1. Note that this tutorial will increment the version after every change, regardless of whether it is explicitly pushed out to the other nodes and reloaded. The reason is to help get into the habit of always increasing this value.

Configuring cman Options

We are going to setup a special case for our cluster; A 2-Node cluster.

This is a special case because traditional quorum will not be useful. With only two nodes, each having a vote of 1, the total votes is 2. Quorum needs 50% + 1, which means that a single node failure would shut down the cluster, as the remaining node's vote is 50% exactly. That kind of defeats the purpose to having a cluster at all.

So to account for this special case, there is a special attribute called two_node="1". This tells the cluster manager to continue operating with only one vote. This option requires that the expected_votes="" attribute be set to 1. Normally, expected_votes is set automatically to the total sum of the defined cluster nodes' votes (which itself is a default of 1). This is the other half of the "trick", as a single node's vote of 1 now always provides quorum (that is, 1 meets the 50% + 1 requirement).

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="2">
	<cman expected_votes="1" two_node="1"/>
</cluster>

Take note of the self-closing <... /> tag. This is an XML syntax that tells the parser not to look for any child or a closing tags.

Defining Cluster Nodes

This example is a little artificial, please don't load it into your cluster as we will need to add a few child tags, but one thing at a time.

This actually introduces two tags.

The first is parent clusternodes tag, which takes no variables of it's own. It's sole purpose is to contain the clusternode child tags.

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="3">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node01.alteeve.com" nodeid="1" />
		<clusternode name="an-node02.alteeve.com" nodeid="2" />
	</clusternodes>
</cluster>

The clusternode tag defines each cluster node. There are many attributes available, but we will look at just the two required ones.

The first is the name="" attribute. This must match the name given by uname -n when run on each node. The IP address that the name resolves to also sets the interface and subnet that the totem ring will run on. That is, the main cluster communications, which we are calling the Back-Channel Network. This is why it is so important to setup our /etc/hosts file correctly.

The second attribute is nodeid="". This must be a unique integer amongst the <clusternode ...> tags. It is used by the cluster to identify the node.

Defining Fence Devices

Fencing devices are designed to forcible eject a node from a cluster. This is done by forcing it to power off or reboot, generally. Some SAN switches can logically disconnect a node from the shared storage device, which has the same effect of guaranteeing that the defective node can not alter the shared storage. A common, third type of fence device is one that cuts the mains power to the server.

All fence devices are contained withing the parent fencedevices tag. This parent tag has no attributes. Within this parent tag are one or more fencedevice child tags.

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="4">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-node01.alteeve.com" nodeid="1" />
		<clusternode name="an-node02.alteeve.com" nodeid="2" />
	</clusternodes>
	<fencedevices>
		<fencedevice agent="fence_na" ipaddr="batou.alteeve.com" login="admin" name="batou" passwd="secret" quiet="1"/>
	</fencedevices>
</cluster>

Every fence device used in your cluster will have it's own fencedevice tag. If you are using IPMI, this means you will have a fencedevice entry for each node, as each physical IPMI BMC is a unique fence device.

All fencedevice tags share two basic attributes; name="" and agent="".

The name attribute must be unique among all the fence devices in your cluster. As we will see in the next step, this name will be used within the <clusternode...> tag.

The agent tag tells the cluster which fence agent to use when the fenced daemon needs to communicate with the physical fence device. A fence agent is simple a shell script that acts as a glue layer between the fenced daemon and the fence hardware. It takes the arguments from the daemon, like what port to act on and what action to take, and executes the node. The agent is responsible for ensuring that the execution succeeded and returning an appropriate success or failure exit code, depending. For those curious, the full details are described in the FenceAgentAPI. If you have two or more of the same fence device, like IPMI, then you will use the same fence agent value a corresponding number of times.

Beyond these two attributes, each fence agent will have it's own subset of attributes. The scope of which is outside this tutorial, though we will see examples for IPMI, a switched PDU and a Node Assassin devices. Most, if not all, fence agents have a corresponding man page that will show you what attributes it accepts and how they are used. The three fence agents we will see here have their attributes defines in the following man pages.

  • man fence_na - Node Assassin fence agent
  • man fence_ipmilan - IPMI fence agent
  • man fence_tripplite - Looks like I need to work on this agent...

The example above is what this tutorial will use. For reference though, here is an example if IPMI was also in use. Note that the corresponding elements in the <clusternode...> tag will be shown as an alternative example shortly.

Example <fencedevice...> Tag For Node Assassin

This is the device used throughout this tutorial. It is for the open source, open hardware Node Assassin fence device that you can build yourself.

	<fencedevices>
		<fencedevice agent="fence_na" ipaddr="batou.alteeve.com" login="admin" name="batou" passwd="secret" quiet="1"/>
	</fencedevices>

Being a network-attached fence device, as most fence devices are, the attributes for fence_na include connection information. The attribute variable names are generally the same across fence agents, and they are:

  • ipaddr; This is the resolvable name or IP address of the device. If you use a resolvable name, it is strongly advised that you put the name in /etc/hosts as DNS is another layer of abstraction which could fail.
  • login; This is the login name to use when the fenced daemon connects to the device. This is configured in /etc/cluster/fence_na.conf.
  • passwd; This is the login password to use when the fenced daemon connects to the device. This is also configured in /etc/cluster/fence_na.conf.
  • name; This is the general name of this particular fence device which, as we will see shortly, is used in the .

Example <fencedevice...> Tag For IPMI

Here we will show what IPMI <fencedevice...> tags look like.

	<fencedevices>
		<fencedevice agent="fence_ipmilan" ipaddr="192.168.3.63" login="root" name="fence_ipmi_3" passwd="secret"/>
		<fencedevice agent="fence_ipmilan" ipaddr="192.168.3.64" login="root" name="fence_ipmi_4" passwd="secret"/>
	</fencedevices>


Notes

These are notes to work in later that I don't want to forget.

  • You must have clvmd running when you run vgcreate, otherwise it will not automatically be created as a cluster VG. To be safe, it's probably safer to just always use vgcreate -c y ... (--cluster yes). See rhbz 684896.
  • Because we will be live-migrating the VMs across the clustered LVs, we need to configure volume_list in /etc/lvm/lvm.conf. This value restricts what hosts and VGs/LVs are to be used on the system. The tag name itself is not so important, only that it be consistently used in all our VGs and is prefixed with @. We will use A-Z a-z 0-9 _ + . -. We will use @an-cluster in this tutorial. Thus, we will set volume_list to volume_list = ["@an-cluster"]. With that set, that actual command to tag each VG is vgchange --addtag @an-cluster /dev/VGNAME.

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.