Anvil! m2 Tutorial

From Alteeve Wiki
Revision as of 19:38, 7 December 2013 by Digimer (talk | contribs)
Jump to navigation Jump to search

 AN!Wiki :: How To :: Anvil! m2 Tutorial

Warning: This tutorial is far from complete! Please follow AN!Cluster Tutorial 2 for the time being.
Note: This is the second edition of the 2-Node Red Hat KVM Cluster Tutorial tutorial. This edition introduces the AN!CDB dashboard, drastically simplifying management of the cluster and it's servers.

This paper has one goal;

  • Show you how to build an Anvil! HA platform for hosting servers. It will show you how to build the ideal system architecture, how to install and configure the underling systems and how to manually create and manage virtual servers.

Grab a coffee, put on some nice music and settle in for some geekly fun.

The Task Ahead

Before we start, let's take a few minutes to discuss clustering and its complexities.

Technologies We Will Use

  • Red Hat Enterprise Linux 6 (EL6); You can use a derivative like CentOS v6.
  • Red Hat Cluster Services "Stable" version 3. This describes the following core components:
    • Corosync; Provides cluster communications using the totem protocol.
    • Cluster Manager (cman); Manages the starting, stopping and managing of the cluster.
    • Resource Manager (rgmanager); Manages cluster resources and services. Handles service recovery during failures.
    • Clustered Logical Volume Manager (clvm); Cluster-aware (disk) volume manager. Backs GFS2 filesystems and KVM virtual machines.
    • Global File Systems version 2 (gfs2); Cluster-aware, concurrently mountable file system.
  • Distributed Redundant Block Device (DRBD); Keeps shared data synchronized across cluster nodes.
  • KVM; Hypervisor that controls and supports virtual machines.

A Note on Hardware

In this tutorial, I will make reference to specific hardware components and devices. In the interest of full discloser, Alteeve's Niche! is a reseller of Fujitsu. Other vendors will be used which we have no reseller agreement with but have tested and used extensively and thus recommend. You can, of course, use any hardware vendor you wish, provided it meets the requirements mentioned a little later in this tutorial.

A Note on Patience

When someone wants to become a pilot, they can't jump into a plane and try to take off. It's not that flying is inherently hard, but it requires a foundation of understanding. Clustering is the same in this regard; there are many different pieces that have to work together just to get off the ground.

You must have patience.

Like a pilot on their first flight, seeing a cluster come to life is a fantastic experience. Don't rush it! Do your homework and you'll be on your way before you know it.

Coming back to earth:

Many technologies can be learned by creating a very simple base and then building on it. The classic "Hello, World!" script created when first learning a programming language is an example of this. Unfortunately, there is no real analogue to this in clustering. Even the most basic cluster requires several pieces be in place and working together. If you try to rush by ignoring pieces you think are not important, you will almost certainly waste time. A good example is setting aside fencing, thinking that your test cluster's data isn't important. The cluster software has no concept of "test". It treats everything as critical all the time and will shut down if anything goes wrong.

Take your time, work through these steps, and you will have the foundation cluster sooner than you realize. Clustering is fun because it is a challenge.

Prerequisites

It is assumed that you are familiar with Linux systems administration, specifically Red Hat Enterprise Linux and its derivatives. You will need to have somewhat advanced networking experience as well. You should be comfortable working in a terminal (directly or over ssh). Familiarity with XML will help, but is not terribly required as its use here is pretty self-evident.

If you feel a little out of depth at times, don't hesitate to set this tutorial aside. Browse over to the components you feel the need to study more, then return and continue on. Finally, and perhaps most importantly, you must have patience! If you have a manager asking you to "go live" with a cluster in a month, tell him or her that it simply won't happen. If you rush, you will skip important points and you will fail.

Patience is vastly more important than any pre-existing skill.

Focus and Goal

There is a different cluster for every problem. Generally speaking though, there are two main problems that clusters try to resolve; Performance and High Availability. Performance clusters are generally tailored to the application requiring the performance increase. There are some general tools for performance clustering, like Red Hat's LVS (Linux Virtual Server) for load-balancing common applications like the Apache web-server.

This tutorial will focus on High Availability clustering, often shortened to simply HA and not to be confused with the Linux-HA "heartbeat" cluster suite, which we will not be using here. The cluster will provide a shared file systems and will provide for the high availability on KVM-based virtual servers. The goal will be to have the virtual servers live-migrate during planned node outages and automatically restart on a surviving node when the original host node fails.

Below is a very brief overview:

High Availability clusters like ours have two main parts; Cluster management and resource management.

The cluster itself is responsible for maintaining the cluster nodes in a group. This group is part of a "Closed Process Group", or CPG. When a node fails, the cluster manager must detect the failure, reliably eject the node from the cluster using fencing and then reform the CPG. Each time the cluster changes, or "re-forms", the resource manager is called. The resource manager checks to see how the cluster changed, consults its configuration and determines what to do, if anything.

The details of all this will be discussed in detail a little later on. For now, it's sufficient to have in mind these two major roles and understand that they are somewhat independent entities.

Platform

This tutorial was written using RHEL version 6.4, x86_64 architecture. The KVM hypervisor will not run on i686. CentOS 6 has been tested and works perfectly. No testing was done on other EL6 derivatives. That said, there is no reason to believe that this tutorial will not apply to any variant of EL6. As much as possible, the language will be distro-agnostic.

A Word On Complexity

Introducing the Fabimer Principle:

Clustering is not inherently hard, but it is inherently complex. Consider:

  • Any given program has N bugs.
    • RHCS uses; cman, corosync, dlm, fenced, rgmanager, and many more smaller apps.
    • We will be adding DRBD, GFS2, clvmd, libvirtd and KVM.
    • Right there, we have N^10 possible bugs. We'll call this A.
  • A cluster has Y nodes.
    • In our case, 2 nodes, each with 3 networks across 6 interfaces bonded into pairs.
    • The network infrastructure (Switches, routers, etc). We will be using two managed switches, adding another layer of complexity.
    • This gives us another Y^(2*(3*2))+2, the +2 for managed switches. We'll call this B.
  • Let's add the human factor. Let's say that a person needs roughly 5 years of cluster experience to be considered an proficient. For each year less than this, add a Z "oops" factor, (5-Z)^2. We'll call this C.
  • So, finally, add up the complexity, using this tutorial's layout, 0-years of experience and managed switches.
    • (N^10) * (Y^(2*(3*2))+2) * ((5-0)^2) == (A * B * C) == an-unknown-but-big-number.

This isn't meant to scare you away, but it is meant to be a sobering statement. Obviously, those numbers are somewhat artificial, but the point remains.

Any one piece is easy to understand, thus, clustering is inherently easy. However, given the large number of variables, you must really understand all the pieces and how they work together. DO NOT think that you will have this mastered and working in a month. Certainly don't try to sell clusters as a service without a lot of internal testing.

Clustering is kind of like chess. The rules are pretty straight forward, but the complexity can take some time to master.

Overview of Components

When looking at a cluster, there is a tendency to want to dive right into the configuration file. That is not very useful in clustering.

  • When you look at the configuration file, it is quite short.

Clustering isn't like most applications or technologies. Most of us learn by taking something such as a configuration file, and tweaking it to see what happens. I tried that with clustering and learned only what it was like to bang my head against the wall.

  • Understanding the parts and how they work together is critical.

You will find that the discussion on the components of clustering, and how those components and concepts interact, will be much longer than the initial configuration. It is true that we could talk very briefly about the actual syntax, but it would be a disservice. Please don't rush through the next section, or worse, skip it and go right to the configuration. You will waste far more time than you will save.

  • Clustering is easy, but it has a complex web of inter-connectivity. You must grasp this network if you want to be an effective cluster administrator!

Component; cman

The cman portion of the the cluster is the cluster manager. In the "cluster stable 3.0" series used in EL6, cman acts mainly as a quorum provider. That is, is adds up the votes from the cluster members and decides if there is a simple majority. If there is, the cluster is "quorate" and is allowed to provide cluster services. Newer versions of the Red Hat Cluster Suite found in Fedora will use a new quorum provider and cman will be removed entirely.

Until it is removed, the cman service will be used to start and stop all of the daemons needed to make the cluster operate.

Component; corosync

Corosync is the heart of the cluster. Almost all other cluster compnents operate though this.

In Red Hat clusters, corosync is configured via the central cluster.conf file. It can be configured directly in corosync.conf, but given that we will be building an RHCS cluster, we will only use cluster.conf. That said, almost all corosync.conf options are available in cluster.conf. This is important to note as you will see references to both configuration files when searching the Internet.

Corosync sends messages using multicast messaging by default. Recently, unicast support has been added, but due to network latency, it is only recommended for use with small clusters of two to four nodes. We will be using multicast in this tutorial.

A Little History

There were significant changes between RHCS the old version 2 and version 3 available on EL6, which we are using.

In the RHCS version 2, there was a component called openais which provided totem. The OpenAIS project was designed to be the heart of the cluster and was based around the Service Availability Forum's Application Interface Specification. AIS is an open API designed to provide inter-operable high availability services.

In 2008, it was decided that the AIS specification was overkill for most clustered applications being developed in the open source community. At that point, OpenAIS was split in to two projects: Corosync and OpenAIS. The former, Corosync, provides totem, cluster membership, messaging, and basic APIs for use by clustered applications, while the OpenAIS project became an optional add-on to corosync for users who want the full AIS API.

You will see a lot of references to OpenAIS while searching the web for information on clustering. Understanding its evolution will hopefully help you avoid confusion.

Concept; quorum

Quorum is defined as the minimum set of hosts required in order to provide clustered services and is used to prevent split-brain situations.

The quorum algorithm used by the RHCS cluster is called "simple majority quorum", which means that more than half of the hosts must be online and communicating in order to provide service. While simple majority quorum is a very common quorum algorithm, other quorum algorithms exist (grid quorum, YKD Dyanamic Linear Voting, etc.).

The idea behind quorum is that, when a cluster splits into two or more partitions, which ever group of machines has quorum can safely start clustered services knowing that no other lost nodes will try to do the same.

Take this scenario;

  • You have a cluster of four nodes, each with one vote.
    • The cluster's expected_votes is 4. A clear majority, in this case, is 3 because (4/2)+1, rounded down, is 3.
    • Now imagine that there is a failure in the network equipment and one of the nodes disconnects from the rest of the cluster.
    • You now have two partitions; One partition contains three machines and the other partition has one.
    • The three machines will have quorum, and the other machine will lose quorum.
    • The partition with quorum will reconfigure and continue to provide cluster services.
    • The partition without quorum will withdraw from the cluster and shut down all cluster services.

When the cluster reconfigures and the partition wins quorum, it will fence the node(s) in the partition without quorum. Once the fencing has been confirmed successful, the partition with quorum will begin accessing clustered resources, like shared filesystems.

This also helps explain why an even 50% is not enough to have quorum, a common question for people new to clustering. Using the above scenario, imagine if the split were 2 and 2 nodes. Because either can't be sure what the other would do, neither can safely proceed. If we allowed an even 50% to have quorum, both partition might try to take over the clustered services and disaster would soon follow.

There is one, and only one except to this rule.

In the case of a two node cluster, as we will be building here, any failure results in a 50/50 split. If we enforced quorum in a two-node cluster, there would never be high availability because and failure would cause both nodes to withdraw. The risk with this exception is that we now place the entire safety of the cluster on fencing, a concept we will cover in a second. Fencing is a second line of defense and something we are loath to rely on alone.

Even in a two-node cluster though, proper quorum can be maintained by using a quorum disk, called a qdisk. Unfortunately, qdisk on a DRBD resource comes with its own problems, so we will not be able to use it here.

Concept; Virtual Synchrony

Many cluster operations, like distributed locking and so on, have to occur in the same order across all nodes. This concept is called "virtual synchrony".

This is provided by corosync using "closed process groups", CPG. A closed process group is simply a private group of processes in a cluster. Within this closed group, all messages between members are ordered. Delivery, however, is not guaranteed. If a member misses messages, it is up to the member's application to decide what action to take.

Let's look at two scenarios showing how locks are handled using CPG;

  • The cluster starts up cleanly with two members.
  • Both members are able to start service:foo.
  • Both want to start it, but need a lock from DLM to do so.
    • The an-c05n01 member has its totem token, and sends its request for the lock.
    • DLM issues a lock for that service to an-c05n01.
    • The an-c05n02 member requests a lock for the same service.
    • DLM rejects the lock request.
  • The an-c05n01 member successfully starts service:foo and announces this to the CPG members.
  • The an-c05n02 sees that service:foo is now running on an-c05n01 and no longer tries to start the service.
  • The two members want to write to a common area of the /shared GFS2 partition.
    • The an-c05n02 sends a request for a DLM lock against the FS, gets it.
    • The an-c05n01 sends a request for the same lock, but DLM sees that a lock is pending and rejects the request.
    • The an-c05n02 member finishes altering the file system, announces the changed over CPG and releases the lock.
    • The an-c05n01 member updates its view of the filesystem, requests a lock, receives it and proceeds to update the filesystems.
    • It completes the changes, annouces the changes over CPG and releases the lock.

Messages can only be sent to the members of the CPG while the node has a totem tokem from corosync.

Concept; Fencing

Warning: DO NOT BUILD A CLUSTER WITHOUT PROPER, WORKING AND TESTED FENCING.
Laugh, but this is a weekly conversation.

Fencing is a absolutely critical part of clustering. Without fully working fence devices, your cluster will fail.

Sorry, I promise that this will be the only time that I speak so strongly. Fencing really is critical, and explaining the need for fencing is nearly a weekly event.

So then, let's discuss fencing.

When a node stops responding, an internal timeout and counter start ticking away. During this time, no DLM locks are allowed to be issued. Anything using DLM, including rgmanager, clvmd and gfs2, are effectively hung. The hung node is detected using a totem token timeout. That is, if a token is not received from a node within a period of time, it is considered lost and a new token is sent. After a certain number of lost tokens, the cluster declares the node dead. The remaining nodes reconfigure into a new cluster and, if they have quorum (or if quorum is ignored), a fence call against the silent node is made.

The fence daemon will look at the cluster configuration and get the fence devices configured for the dead node. Then, one at a time and in the order that they appear in the configuration, the fence daemon will call those fence devices, via their fence agents, passing to the fence agent any configured arguments like username, password, port number and so on. If the first fence agent returns a failure, the next fence agent will be called. If the second fails, the third will be called, then the forth and so on. Once the last (or perhaps only) fence device fails, the fence daemon will retry again, starting back at the start of the list. It will do this indefinitely until one of the fence devices succeeds.

Here's the flow, in point form:

  • The totem token moves around the cluster members. As each member gets the token, it sends sequenced messages to the CPG members.
  • The token is passed from one node to the next, in order and continuously during normal operation.
  • Suddenly, one node stops responding.
    • A timeout starts (~238ms by default), and each time the timeout is hit, and error counter increments and a replacement token is created.
    • The silent node responds before the failure counter reaches the limit.
      • The failure counter is reset to 0
      • The cluster operates normally again.
  • Again, one node stops responding.
    • Again, the timeout begins. As each totem token times out, a new packet is sent and the error count increments.
    • The error counts exceed the limit (4 errors is the default); Roughly one second has passed (238ms * 4 plus some overhead).
    • The node is declared dead.
    • The cluster checks which members it still has, and if that provides enough votes for quorum.
      • If there are too few votes for quorum, the cluster software freezes and the node(s) withdraw from the cluster.
      • If there are enough votes for quorum, the silent node is declared dead.
        • corosync calls fenced, telling it to fence the node.
        • The fenced daemon notifies DLM and locks are blocked.
        • Which fence device(s) to use, that is, what fence_agent to call and what arguments to pass, is gathered.
        • For each configured fence device:
          • The agent is called and fenced waits for the fence_agent to exit.
          • The fence_agent's exit code is examined. If it's a success, recovery starts. If it failed, the next configured fence agent is called.
        • If all (or the only) configured fence fails, fenced will start over.
        • fenced will wait and loop forever until a fence agent succeeds. During this time, the cluster is effectively hung.
      • Once a fence_agent succeeds, fenced notifies DLM and lost locks are recovered.
        • GFS2 partitions recover using their journal.
        • Lost cluster resources are recovered as per rgmanager's configuration (including file system recovery as needed).
  • Normal cluster operation is restored, minus the lost node.

This skipped a few key things, but the general flow of logic should be there.

This is why fencing is so important. Without a properly configured and tested fence device or devices, the cluster will never successfully fence and the cluster will remain hung until a human can intervene.

Component; totem

The totem protocol defines message passing within the cluster and it is used by corosync. A token is passed around all the nodes in the cluster, and nodes can only send messages while they have the token. A node will keep its messages in memory until it gets the token back with no "not ack" messages. This way, if a node missed a message, it can request it be resent when it gets its token. If a node isn't up, it will simply miss the messages.

The totem protocol supports something called 'rrp', Redundant Ring Protocol. Through rrp, you can add a second backup ring on a separate network to take over in the event of a failure in the first ring. In RHCS, these rings are known as "ring 0" and "ring 1". The RRP is being re-introduced in RHCS version 3. Its use is experimental and should only be used with plenty of testing.

Component; rgmanager

When the cluster membership changes, corosync tells the rgmanager that it needs to recheck its services. It will examine what changed and then will start, stop, migrate or recover cluster resources as needed.

Within rgmanager, one or more resources are brought together as a service. This service is then optionally assigned to a failover domain, an subset of nodes that can have preferential ordering.

The rgmanager daemon runs separately from the cluster manager, cman. This means that, to fully start the cluster, we need to start both cman and then rgmanager.

Component; pacemaker

Pacemaker is an alternate resource manager that can be used instead of rgmanager. We do not use it in this tutorial.

The Pacemaker project is planned to replace cman and rgmanager in RHEL version 7. It is currently available as a "Tech Preview" in RHEL version 6. What that means is that Red Hat does not offer support for clusters using pacemaker on RHEL 6 and that updates are not provided in between y-stream releases.

Component; qdisk

Note: qdisk does not work reliably on a DRBD resource, so we will not be using it in this tutorial.

A Quorum disk, known as a qdisk is small partition on SAN storage used to enhance quorum. It generally carries enough votes to allow even a single node to take quorum during a cluster partition. It does this by using configured heuristics, that is custom tests, to decided which node or partition is best suited for providing clustered services during a cluster reconfiguration. These heuristics can be simple, like testing which partition has access to a given router, or they can be as complex as the administrator wishes using custom scripts.

Though we won't be using it here, it is well worth knowing about when you move to a cluster with SAN storage.

Component; DRBD

DRBD; Distributed Replicating Block Device, is a technology that takes raw storage from two or more nodes and keeps their data synchronized in real time. It is sometimes described as "RAID 1 over Cluster Nodes", and that is conceptually accurate. In this tutorial's cluster, DRBD will be used to provide that back-end storage as a cost-effective alternative to a traditional SAN device.

To help visualize DRBD's use and role, Take a look at how we will implement our cluster's storage.

This shows;

  • Each node having four physical disks tied together in a RAID Level 5 array and presented to the Node's OS as a single drive which is found at /dev/sda.
  • Each node's OS uses three primary partitions for /boot, <swap> and /.
  • Three extended partitions are created;
    • /dev/sda5 backs a small partition used as a GFS2-formatted shared mount point.
    • /dev/sda6 backs the VMs designed to run primarily on an-c05n01.
    • /dev/sda7 backs the VMs designed to run primarily on an-c05n02.
  • All three extended partitions are combined using DRBD to create three DRBD resources;
    • /dev/drbd0 is backed by /dev/sda5.
    • /dev/drbd1 is backed by /dev/sda6.
    • /dev/drbd2 is backed by /dev/sda7.
  • All three DRBD resources are managed by clustered LVM.
  • The GFS2-formatted LV is mounted on /shared on both nodes.
  • Each VM gets its own LV.
  • All three DRBD resources sync over the Storage Network, which uses the bonded bond1 (backed be eth1 and eth4).

Don't worry if this seems illogical at this stage. The main thing to look at are the drbdX devices and how they each tie back to a corresponding sdaY device on either node.

 ________________________________________________________________                 ________________________________________________________________ 
| [ an-c05n01 ]                                                  |               |                                                  [ an-c05n02 ] |
|  ________       __________                                     |               |                                     __________       ________  |
| [_disk_1_]--+--[_/dev/sda_]                                    |               |                                    [_/dev/sda_]--+--[_disk_1_] |
|  ________   |    |   ___________    _______                    |               |                    _______    ___________   |    |   ________  |
| [_disk_2_]--+    +--[_/dev/sda1_]--[_/boot_]                   |               |                   [_/boot_]--[_/dev/sda1_]--+    +--[_disk_2_] |
|  ________   |    |   ___________    ________                   |               |                   ________    ___________   |    |   ________  |
| [_disk_3_]--+    +--[_/dev/sda2_]--[_<swap>_]                  |               |                  [_<swap>_]--[_/dev/sda2_]--+    +--[_disk_3_] |
|  ________   |    |   ___________    ___                        |               |                        ___    ___________   |    |   ________  |
| [_disk_4_]--+    +--[_/dev/sda3_]--[_/_]                       |               |                       [_/_]--[_/dev/sda3_]--+    |--[_disk_4_] |
|  ________   |    |   ___________                               |               |                               ___________   |    |   ________  |
| [_disk_5_]--+    +--[_/dev/sda5_]---------------------\        |               |        /---------------------[_/dev/sda5_]--+    +--[_disk_5_] |
|  ________   |    |   ___________                      |        |               |        |                      ___________   |    |   ________  |
| [_disk_6_]--/    \--[_/dev/sda6_]--------\            |        |               |        |            /--------[_/dev/sda6_]--/    \--[_disk_6_] |
|                                          |            |        |               |        |            |                                          |
|        _______________    ____________   |            |        |               |        |            |   ____________    _______________        |
|    /--[_Clustered_LVM_]--[_/dev/drbd1_]--/            |        |               |        |            \--[_/dev/drbd1_]--[_Clustered_LVM_]--\    |
|   _|__                         |                      |        |               |        |                      |                         __|_   |
|  [_PV_]                        \======================|=====\  |               |  /=====|======================/                        [_PV_]  |
|   _|_____________                                     |     |  |               |  |     |                                     _____________|_   |
|  [_an-c05n02_vg0_]                                    |     |  |               |  |     |                                    [_an-c05n02_vg0_]  |
|    |   ______________________________    ...........  |     |  |               |  |     |   _________     ______________________________   |    |
|    +--[_/dev/an-c05n02_vg0/vm03-db_0_]---|.vm03-db.|  |     |  |               |  |     |  [_vm03-db_]---[_/dev/an-c05n02_vg0/vm03-db_0_]--+    |
|    |   ______________________________    ...........  |     |  |               |  |     |   _________     ______________________________   |    |
|    \--[_/dev/an-c05n02_vg0/vm04-ms_0_]---|.vm04-ms.|  |     |  |               |  |     |  [_vm04-ms_]---[_/dev/an-c05n02_vg0/vm04-ms_0_]--/    |
|          _______________    ____________              |     |  |               |  |     |              ____________    _______________          |
|      /--[_Clustered_LVM_]--[_/dev/drbd0_]-------------/     |  |               |  |     \-------------[_/dev/drbd0_]--[_Clustered_LVM_]--\      |
|     _|__                         |                          |  |               |  |                          |                         __|_     |
|    [_PV_]                        \========================\ |  |               |  | /========================/                        [_PV_]    |
|     _|_____________                                       | |  |               |  | |                                       _____________|_     |
|    [_an-c05n01_vg0_]                                      | |  |               |  | |                                      [_an-c05n01_vg0_]    |
|      |   ___________________________    _________         | |  |               |  | |         _________    ___________________________   |      |
|      +--[_/dev/an-c05n01_vg0/shared_]--[_/shared_]        | |  |               |  | |        [_/shared_]--[_/dev/an-c05n01_vg0/shared_]--+      |
|      |   _______________________________     __________   | |  |               |  | |  ............    _______________________________   |      |
|      +--[_/dev/an-c05n01_vg0/vm01-dev_0_]---[_vm01-dev_]  | |  |               |  | |  |.vm01-dev.|---[_/dev/an-c05n01_vg0/vm01-dev_0_]--+      |
|      |   _______________________________     __________   | |  |               |  | |  ............    _______________________________   |      |
|      \--[_/dev/an-c05n01_vg0/vm02-web_0_]---[_vm02-web_]  | |  |               |  | |  |.vm02-web.|---[_/dev/an-c05n01_vg0/vm02-web_0_]--/      |
|                                                         __|_|__|   _________   |__|_|__                                                         |
|                                                        | bond1 =--| Storage |--= bond1 |                                                        |
|                                                        |______||  | Network |  ||______|                                                        |
|________________________________________________________________|  |_________|  |________________________________________________________________|
.

Component; Clustered LVM

With DRBD providing the raw storage for the cluster, we must next consider partitions. This is where Clustered LVM, known as CLVM, comes into play.

CLVM is ideal in that by using DLM, the distributed lock manager. It won't allow access to cluster members outside of corosync's closed process group, which, in turn, requires quorum.

It is ideal because it can take one or more raw devices, known as "physical volumes", or simple as PVs, and combine their raw space into one or more "volume groups", known as VGs. These volume groups then act just like a typical hard drive and can be "partitioned" into one or more "logical volumes", known as LVs. These LVs are where KVM's virtual machine guests will exist and where we will create our GFS2 clustered file system.

LVM is particularly attractive because of how flexible it is. We can easily add new physical volumes later, and then grow an existing volume group to use the new space. This new space can then be given to existing logical volumes, or entirely new logical volumes can be created. This can all be done while the cluster is online offering an upgrade path with no down time.

Component; GFS2

With DRBD providing the clusters raw storage space, and Clustered LVM providing the logical partitions, we can now look at the clustered file system. This is the role of the Global File System version 2, known simply as GFS2.

It works much like standard filesystem, with user-land tools like mkfs.gfs2, fsck.gfs2 and so on. The major difference is that it and clvmd use the cluster's distributed locking mechanism provided by the dlm_controld daemon. Once formatted, the GFS2-formatted partition can be mounted and used by any node in the cluster's closed process group. All nodes can then safely read from and write to the data on the partition simultaneously.

Note: GFS2 is only supported when run on top of Clustered LVM LVs. This is because, in certain error states, gfs2_controld will call dmsetup to disconnect the GFS2 partition from its storage in certain failure states.

Component; DLM

One of the major roles of a cluster is to provide distributed locking for clustered storage and resource management.

Whenever a resource, GFS2 filesystem or clustered LVM LV needs a lock, it sends a request to dlm_controld which runs in userspace. This communicates with DLM in kernel. If the lockspace does not yet exist, DLM will create it and then give the lock to the requester. Should a subsequant lock request come in for the same lockspace, it will be rejected. Once the application using the lock is finished with it, it will release the lock. After this, another node may request and receive a lock for the lockspace.

If a node fails, fenced will alert dlm_controld that a fence is pending and new lock requests will block. After a successful fence, fenced will alert DLM that the node is gone and any locks the victim node held are released. At this time, other nodes may request a lock on the lockspaces the lost node held and can perform recovery, like replaying a GFS2 filesystem journal, prior to resuming normal operation.

Note that DLM locks are not used for actually locking the file system. That job is still handled by plock() calls (POSIX locks).

Component; KVM

Two of the most popular open-source virtualization platforms available in the Linux world today and Xen and KVM. The former is maintained by Citrix and the other by Redhat. It would be difficult to say which is "better", as they're both very good. Xen can be argued to be more mature where KVM is the "official" solution supported by Red Hat in EL6.

We will be using the KVM hypervisor within which our highly-available virtual machine guests will reside. It is a type-1 hypervisor, which means that the host operating system runs directly on the bare hardware. Contrasted against Xen, which is a type-2 hypervisor where even the installed OS is itself just another virtual machine.

Node Installation

This section is going to be intentionally vague, as I don't want to influence too heavily what hardware you buy or how you install your operating systems. However, we need a baseline, a minimum system requirement of sorts. Also, I will refer fairly frequently to my setup, so I will share with you the details of what I bought. Please don't take this as an endorsement though... Every cluster will have its own needs, and you should plan and purchase for your particular needs.

In my case, my goal was to have a low-power consumption setup and I knew that I would never put my cluster into production as it's strictly a research and design cluster. As such, I can afford to be quite modest.

Minimum Requirements

This will cover two sections;

  • Node Minimum requirements
  • Infrastructure requirements

The nodes are the two separate servers that will, together, form the base of our cluster. The infrastructure covers the networking and the switched power bars called a PDUs.

Node Requirements

General;

As these nodes will host virtual machines, then will need sufficient RAM and provide virtualization-enabled CPUs. Most, though not all, modern processors support hardware virtualization extensions. Finally, you need to have sufficient network bandwidth across two independent links to support the maximum burst storage traffic plus enough headroom to ensure that cluster traffic is never interrupted.

Network;

This tutorial will use three independent networks, each using two physical interfaces in a bonded configuration. These will route through two separate stacked, managed switches for high-availability networking. Each network will be dedicated to a given traffic type and isolated using a VLAN (configured in the switch). This requires six interfaces and, with a separate IPMI interface, consumes a staggering seven switch ports per node.

Understanding that this may not be feasible, you can drop this to just two connections in a single bonded interface. If you decide to do this, you will need to configure QoS to ensure that totem multicast traffic gets highest priority as a delay of less than one second can cause the cluster to break. You also need to test sustained, heavy disk traffic to ensure that it doesn't cause problems. In particular, run storage tests from a virtual machine and then live-migrate that machine to create a "worst case" network load. If that succeeds, you are probably safe. All of this is outside of this tutorial's scope though.

Power;

In production, you will want to use servers which have redundant power supplies and ensure that either side of the power connects to two separate power sources.

Out-of-Band Management;

As we will discuss later, the ideal method of fencing a node is to use IPMI or one of the vendor-specific variants like HP's iLO, Dell's DRAC or IBM's RSA. This allows another node in the cluster to force the host node to power off, regardless of the state of the operating system. Critically, it can confirm to the caller once the node has been shut down, which allows for the cluster to safely and confidently recover lost services.

For reference, this tutorial was written using two Fujitsu RX200 S7 nodes. Each node was configured with;

  • Six 15,000rpm 146 GB SAS drives in a RAID 5 array with 1 GiB of BBWC.
  • Two Intel Xeon E5-2609 CPUs.
  • 32 GiB of ECC RAM.
  • Two dual-port Gbit network cards in addition to the two onboard Gbit network cards.
  • IPMI BMC with dedicated network interface.
  • Redundant 800w power supplies.

Infrastructure Requirements

Network;

You will need two separate switches in order to provide High Availability. These do not need to be stacked or even managed, but you do need to consider their actual capabilities and disregard the stated capacity. What I mean by this, in essence, is that not all gigabit equipment is equal. You will need to calculate how much bandwidth (in raw data throughput and as packets-per-second) and confirm that the switch can sustain that load. Most switches will rate these two values as their switching fabric capacity, so be sure to look closely at the specifications.

Another thing to consider is whether you wish to run at an MTU higher that 1500 bytes per packet. This is generally referred to in specification sheets as "jumbo frame" support. However, many lesser companies will advertise support for jumbo frames, but they only support up to 4 KiB. Most professional networks looking to implement large MTU sizes aim for 9 KiB frame sizes, so be sure to look at the actual size of the largest supported jumbo frame before purchasing network equipment.

Power;

As we will discuss later, we need a backup fence device. This will be implemented using a specific brand and model of switched power distribution unit, called a PDU which is effectively a power bar whose outlets can be independently turned on and off over the network. This tutorial uses a pair of APC AP7900 PDU, but many others are available. Should you choose to use another make or model, you must first ensure that it has a supported fence agent. Ensuring this is an exercise for the reader.

Not strictly required, but strongly recommended is to use a pair of UPSes behind the PDUs. This way, power events do not impact the cluster in any way. The UPSes also filter and stabilize incoming power to help ensure the long term health and stability of your nodes. The monitoring application we will use can monitor UPSes compatible with the apcupsd project.

Hardware used in this tutorial are;


 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.