AN!Cluster Tutorial 2
Warning: This updated tutorial is not yet complete. Please do not follow this tutorial until this warning has been removed! |
Note: This is the second release of the 2-Node Red Hat KVM Cluster Tutorial. |
This paper has one goal;
- Create an easy to use, fully redundant platform for virtual servers.
What's New?
In the last two years, we've learned a lot about how to make an even more solid high-availability platform. We've created tools to make monitoring and management of the virtual servers and nodes trivially easy. This updated release of our tutorial brings these advances to you!
- Many refinements to the cluster stack that protect against corner cases seen over the last two years.
- Configuration naming convention changes to support the new AN!CDB dashboard.
- Addition of the AN!CM monitoring and alert system.
A Note on Terminology
In this tutorial, we will use the following terms;
- Anvil!: This is our name for the HA platform as a whole.
- Nodes: The physical hardware servers used as members in the cluster and which host the virtual servers.
- Servers: The virtual servers themselves.
- Compute Pack: This describes a pair of nodes that work together to power highly-available servers.
- Foundation Pack: This describes the switches, PDUs and UPSes used to power and connect the nodes.
- Monitor Pack: This describes the equipment used for the AN!CDB management dashboard.
Why Should I Follow This (Lengthy) Tutorial?
Following this tutorial is not the lightest undertaking. It is designed to teach you all the inner details of building an HA platform for VMs. When finished, you will have a detailed and deep understanding of what it takes to build a fully redundant platform for high-availability virtual machines. It's not a light undertaking, but a very worthwhile one if you want to understand high-availability.
If you want to build a VM cluster as quickly and efficiently as possible, AN! provides an installer script which automates most all of the cluster build.
In either case, when finished, you will have the following benefits;
- Totally open source. Everything. This guide and all software used is open!
- You can host virtual servers running almost any operating system.
- The HA platform requires no access to the servers and no special software needs to be installed. Your users may well never know that they're on a virtual machine.
- Your servers will be transparently made highly-available. Most hardware failures will be totally transparent. The most core failures will cause an outage of roughly 30 to 90 seconds.
- Storage is synchronously replicated, guaranteeing that the total destruction of a node will cause no more data loss than a traditional server losing power.
- Storage is replicated without the need for a SAN, reducing cost and providing totally storage redundancy.
- Live-migration of virtual servers enables upgrading and node maintenance without downtime. No more weekend maintenance!
- With the AN! cluster monitoring, total monitoring of the HA stack, from predictive hardware failure detection to simple live migration alerts in a single application.
Ask your local VMWare or Microsoft Hyper-V sales person what they'd charge for all this. :)
High-Level Explanation of How HA Clustering Works
Note: This section is an adaptation of this post to the Linux-HA mailing list. |
Before digging into the details, it might help to start with a high-level explanation of how HA clustering works.
Corosync uses the totem protocol for "heartbeat"-like monitoring of the other node's health. A token is passed around to each node, the node does some work (like acknowledge old messages, send new ones), and then it passes the token on to the next node. This goes around and around all the time. Should a node note pass it's token on after a short time-out period, the token is declared lost, an error count goes up and a new token is sent. If too many tokens are lost in a row, the node is declared lost.
Once the node is declared lost, the remaining nodes reform a new cluster. If enough nodes are left to form quorum (simple majority), then the new cluster will continue to provide services. In two-node clusters, like the ones we're building here, quorum is disabled so each node can work on it's own.
Corosync itself only cares about cluster membership and message passing. What happens after the cluster reforms is up to the cluster manager, cman, and the cluster resource manager, rgmanager.
The first thing cman does after being notified that a node was lost is initiate a fence against the lost node. This is a process where the lost node is powered off, called power fencing, or cut off from the network/storage, called fabric fencing. In either case, the idea is to make sure that the lost node is in a known state. If this is skipped, the node could recover later and try to provide cluster services, not having realized that it was removed from the cluster. This could cause problems from confusing switches to corrupting data.
When rgmanager is told that membership has changed because a node died, it looks to see what services might have been lost. Once it knows what was lost, it looks at the rules it's been given and decides what to do. These rules are defined in the cluster.conf's <rm> element. We'll go into detail on this later.
In two-node clusters, there is also a chance of a "split-brain". Because quorum has to be disabled, it is possible for both nodes to think the other node is dead and both try to provide the same cluster services. By using fencing, after the nodes break from one another (which could happen with a network failure, for example), neither node will offer services until one of them has fenced the other. The faster node will win and the slower node will shut down (or be isolated). The survivor can then run services safely without risking a split-brain.
Once the dead/slower node has been fenced, rgmanager then decides what to do with the services that had been running on the lost node. Generally, this means "restart the service here that had been running on the dead node". The details of this, though, are decided by you when you configure the resources in rgmanager. As we will see with each node's local storage service, the service is not recovered but instead left stopped.
The Task Ahead
Before we start, let's take a few minutes to discuss clustering and its complexities.
A Note on Patience
When someone wants to become a pilot, they can't jump into a plane and try to take off. It's not that flying is inherently hard, but it requires a foundation of understanding. Clustering is the same in this regard; there are many different pieces that have to work together just to get off the ground.
You must have patience.
Like a pilot on their first flight, seeing a cluster come to life is a fantastic experience. Don't rush it! Do your homework and you'll be on your way before you know it.
Coming back to earth:
Many technologies can be learned by creating a very simple base and then building on it. The classic "Hello, World!" script created when first learning a programming language is an example of this. Unfortunately, there is no real analogue to this in clustering. Even the most basic cluster requires several pieces be in place and working together. If you try to rush by ignoring pieces you think are not important, you will almost certainly waste time. A good example is setting aside fencing, thinking that your test cluster's data isn't important. The cluster software has no concept of "test". It treats everything as critical all the time and will shut down if anything goes wrong.
Take your time, work through these steps, and you will have the foundation cluster sooner than you realize. Clustering is fun because it is a challenge.
Technologies We Will Use
- Red Hat Enterprise Linux 6 (EL6); You can use a derivative like CentOS v6. Specifically, we're using 6.4.
- Red Hat Cluster Services "Stable" version 3. This describes the following core components:
- Corosync; Provides cluster communications using the totem protocol.
- Cluster Manager (cman); Manages the starting, stopping and managing of the cluster.
- Resource Manager (rgmanager); Manages cluster resources and services. Handles service recovery during failures.
- Clustered Logical Volume Manager (clvm); Cluster-aware (disk) volume manager. Backs GFS2 filesystems and KVM virtual machines.
- Global File Systems version 2 (gfs2); Cluster-aware, concurrently mountable file system.
- Distributed Redundant Block Device (DRBD); Keeps shared data synchronized across cluster nodes.
- KVM; Hypervisor that controls and supports virtual machines.
- Alteeve's Niche! Cluster DashBoard and Cluster Monitor
A Note on Hardware
Another new change is that Alteeve's Niche!, after years of experimenting with various hardware vendors, has partnered with Fujitsu. We chose them because of the unparalleled quality of their equipment.
This tutorial can be used on any manufacturer's hardware, provided it meets the minimum requirements listed below. That said, we strongly recommend readers give Fujitsu's RX-line of servers a close look. We do not get a discount for this recommendation, we genuinely love the quality of their gear. The only technical argument for using Fujitsu hardware is that we do all our cluster stack monitoring software development on Fujitsu RX200 and RX300 servers. So we can say with confidence that the AN! Software components will work well on their kit.
If you use any other hardware vendor and run into any trouble, please don't hesitate to contact us. We want to make sure that our HA stack works on as many systems as possible and will be happy to help out. Of course, all Alteeve code is open source, so contributions are always welcome, too!
System Requirements
The goal of this tutorial is to help you build an HA platform with zero single points of failure. In order to do this, certain minimum technical requirements must be met.
Bare minimum requirements;
- Two servers with the following;
- CPUs with [hardware-accelerated virtualization].
- Redundant power supplies
- IPMI or vendor-specific out-of-band management, like Fujitsu's iRMC, HP's iLO, Dell's iDRAC, etc.
- Six network interfaces, 1 Gbit or faster (yes, six!)
- 2 GiB of RAM and 44.5 GiB of storage for the host operating system, plus sufficient RAM and storage for your VMs.
- Two switched PDUs; APC-brand recommended but any with a supported fence agent is fine.
- Two network switches
Recommended Hardware; A Little More Detail
The previous section covers the bare-minimum system requirements for following this tutorial. If you are looking to production, we need to discuss important considerations for selecting hardware.
The Most Important Consideration - Storage
There is probably no single consideration more important than choosing the storage you will use.
In our years of building Anvil! HA platforms, we've found no single issue more important than storage latency. This is true for all virtualized environments, in fact.
The problem is this:
Multiple servers on shared storage can cause particularly random storage access. Traditional hard drives have disks with mechanical read/write heads on the ends of arms that sweep back and forth across the disk surfaces. These platters are broken up into "tracks" and each track is itself cut up into "sectors". So when a server needs to read or write data, the hard drive needs to sweep the arm over the track it wants and then wait there for the sector it wants to pass underneath.
This time taken to get the read/write head onto the track and then wait for the sector to pass underneath is called "seek latency". How long this latency actually takes depends on a few things;
- How fast are the platters rotating? The faster the platter speed, the less time it takes for a sector to pass under the read/write head.
- How fast the read/write arms can move and how far they have to travel between tracks. Highly random read/write requests can cause a lot of head travel.
- How many read/write requests are backed up in the queue. A problem that arises when the requests coming in are faster than the drive can service existing requests.
When many people think about hard drives, they generally worry about maximum write speeds. For environments with many virtual servers, this is actually far less important than it might seem. Reducing latency to ensure that read/write requests don't back up is far more important. If too many requests back-up in the cache, storage performance can collapse or stall out entirely.
This is particularly problematic when multiple servers try to boot at the same time. If, for example, a node with multiple servers dies, the surviving node will try to start the lost servers at nearly the same time. This causes a sudden dramatic rise in read requests and can cause all servers to hang entirely, a condition called a "boot storm".
Thankfully, this latency problem can be easily dealt with in one of three ways;
- Use solid-state drives. These have no moving parts, so there is no penalty for highly random read/write requests.
- Use fast platter drives and proper RAID controllers with write-back caching.
- Isolate each server onto dedicated platter drives.
Each of these solutions have benefits and downsides;
Pro | Con | |
---|---|---|
Fast drives + Write-back caching | 15,000rpm SAS drives are extremely reliable and the high rotation speeds minimize latency caused by waiting for sectors to pass under the read/write heads. Using multiple drives in RAID level 5 or level 6 breaks up reads and writes into smaller pieces, allowing requests to be serviced quickly and helping keep the read/write buffer empty. Write-caching allows RAM-like write speeds and the ability to re-order disk access to minimize head movement. | The main con is the number of disks needed to get effective performance gains from striping. AN! Always uses a minimum of six disks, and many entry-level servers support a maximum of 4 drives. So you need to account for the number of disks you plan to use when selecting your hardware. |
SSDs | No moving parts mean that read and write requests do not have to wait for mechanical movements to happen, drastically reducing latency. The minimum number of drives for SSD-based configuration is two. | Solid state drives use NAND flash, which can only be written to a finite number of times. All drives in our Anvil! will be written to roughly the same amount, so hitting this write-limit could mean that all drives in both nodes would fail at nearly the same time. Avoiding this requires careful monitoring of the drives and replacing them before their write limits are hit. |
Isolated Storage | Dedicating hard drives to virtual servers avoids the highly random read/write issues found when multiple servers share the same storage. This allows for the safe use of cheap, inexpensive hard drives. This also means that dedicated hardware RAID controllers with battery-backed cache are not needed. This makes it possible to save a good amount of money in the hardware design. | The obvious down-side to isolated storage is that you significantly limit the number of servers you can host on your Anvil!. If you only need to support one or two servers though, this should not be an issue. |
The last piece to consider is the interface of the drives used, be they SSDs or traditional HDDs. The two common interface types are SATA and SAS.
- SATA HDD drives generally have a platter speed of 7,200rpm. The SATA interface has limited instruction set and provides minimal health reporting. These are "consumer" grade devices that are far less expensive, and far less reliable, than SAS drives.
- SAS drives are generally aimed at the enterprise environment and are built to much higher quality standards. SAS HDDs have rotational speeds of up to 15,000rpm and can handle far more read/write operations per second. Enterprise SSDs using the SAS interface are also much more reliable than their commercial counterpart. The main downside to SAS drives is their cost.
In all production environments, we strongly, strongly recommend SAS-connected drives. For learning, SATA drives are fine.
RAM - Preparing For Degradation
RAM is a far simpler topic than storage, thankfully. Here, all you need to do is add up how much RAM you plan to assign to servers, add at least 2 GiB for the host, and then install that much memory in your nodes.
In production, there are two technologies you will want to consider;
- ECC, error correction code, provide the ability for RAM to recover from single-bit errors. If you are familiar with how [[1]] in RAID arrays work, ECC in RAM is the same idea. This is often included in server-class hardware by default. It is highly recommended.
- Memory Mirroring is, continuing our storage comparison, RAID level 1 for RAM. All writes to memory go to two different chips. Should one fail, the contents of the RAM can still be read from the surviving module.
Never Over Provision!
"Over-provisioning", also called "This provisioning" is a concept made popular in many "cloud" technologies. It is a concept that has almost no place in HA environments.
A common example is creating virtual disks of a given apparent size, but which only pull space from real storage as needed. So if you created a "thin" virtual disk that was 80 GiB large, but only 20 GiB worth of data was used, only 20 GiB from the real storage would be used.
In essence; Over-provisioning is where you allocate more resources to servers than the nodes can actually provide, banking on the hopes that most servers will not use all of the resource allocated to them. The danger here, and the reason it has no place in HA, is that if the servers collectively use more resources than the nodes can provide, someone is going to crash.
CPU Cores - Possibly Acceptable Over-Provisioning
Over provisioning of RAM and storage is never acceptable in an HA environment, as mentioned. Over-allocating CPU cores is possibly acceptable though.
When selecting which CPUs to use in your nodes, the number of cores and the speed of the cores will determine how much computational horse-power you have to allocate to your servers. The main considerations are;
- Core speed; Any given "thread" can be processed by a single CPU core at a time. The faster the given core is, the faster it can process any given request. Many applications do not support multithreading, meaning that the only way to improve performance is to use faster cores, not more cores.
- Core count; Some applications support breaking up jobs into many threads, and passing them to multiple CPU cores at the same time for simultaneous processing. This way, the application feels faster to users because each CPU has to do less work to get a job done. Another benefit of multiple cores is that if one application consumes the processing power of a single core, other cores remain available for other applications, preventing processor congestion.
In processing, each CPU "core" can handle one program "thread" at a time. Since the earliest days of multitasking, operating systems have been able to handle threads waiting for a CPU resource to free up. So the risk of over-provisioning CPUs is restricted to performance issues only.
If you're building an Anvil! to support multiple servers and it's important that, no matter how busy the other servers are, the performance of each server can not degrade, then you need to be sure you have as many real CPU cores as you plan to assign to servers.
So for example, if you plan to have three servers and you plan to allocate each server four virtual CPU cores, you need a minimum of 13 real CPU cores (3 servers x 4 cores each plus at least one core for the node). In this scenario, you will want to choose servers with dual 8-core CPUs, for a total of 16 available real CPU cores. You may choose to buy two 6-core CPUs, for a total of 12 real cores, but you risk congestion still. If all three servers fully utilize their four cores at the same time, the host OS will be left with no available core for it's software, which manages the HA stack.
In many cases, however, risking a performance loss under periods of high CPU load is acceptable. In these cases, allocating more virtual cores than you have real cores is fine. Should the load of the servers climb to a point where all real cores are under 100% utilization, then some applications will slow down as they wait for their turn in the CPU.
In the end, the decision whether to over-provision CPU cores or not, and if so, by how much to over-provision, is up to you, the reader. Remember to consider balancing out faster cores with the number of cores. If your expected load will be short burst of computationally intense jobs, few-but-faster cores may be the best solution.
A Note on Hyper-Threading
Intel's hyper-threading technology can make a CPU appear to the OS to have twice as many real cores than it actually has. For example, a CPU listed as "4c/8t" (four cores, eight threads) will appear to the node as a simple 8-core CPU. In fact, you only have four cores and the additional four cores are emulated attempts to make more efficient use of the processing of each core.
Simply put, the idea behind this technology is to "slip in" a second thread when the CPU would otherwise be idle. For example, if the CPU core has to wait for memory to be fetched for the currently active thread, instead of sitting idle, a thread in the second core will be worked on.
How much benefit this gives you in the real world is debatable and highly depended on your applications. For the purposes of HA, it's recommended to not count the "HT cores" as real cores. That is to say, when calculating load, treat "4c/8t" CPUs as simple 4-core CPUs.
Six Network Interfaces, Seriously?
Yes, seriously.
Obviously, you can put everything on a single network card and your HA software will work, but it would be highly ill advised.
We will go into the network configuration at length later on. For now, here's an overview;
- Each network needs two links in order to be fault-tolerant. One link will go to the first network switch and the second link will go to the second network switch. This way, the failure of a network cable, port or switch will not interrupt traffic.
- There are three main networks in an Anvil!;
- Back-Channel Network; This is used by the cluster stack and is sensitive to latency. Delaying traffic on this network can cause the nodes the "partition", breaking the cluster stack.
- Storage Network; All disk writes will travel over this network. As such, it is easy to saturate this network. Sharing this traffic with other services would mean that it's very possible to significantly impact network performance under high disk write loads. For this reason, it is isolated.
- Internet-Facing Network; This network carries traffic to and from your servers. By isolating this network, users of your servers will never experience performance loss during storage or cluster high loads. Likewise, if your users place a high load on this network, it will not impact the ability of the Anvil! to function properly. It also isolates untrusted network traffic.
So, three networks, each using two links for redundancy, means that we need six network interfaces. Further complicating things, it is strongly recommended that you use three separate dual-port network cards. Using a single network card, as we will discuss in detail later, leaves you vulnerable to losing entire networks should the controller fail.
A Note on Dedicated IPMI Interfaces
Some server manufacturers provide access to IPMI using the same physical interface as one of the on-board network cards. Usually these companies provide optional upgrades to break the IPMI connection out to a dedicated network connector.
Whenever possible, it is recommended that you go with a dedicated IPMI connection.
We've found that it rarely, if ever, is possible for a node to talk to it's own network interface using a shared physical port. This is not strictly a problem, but it can certainly make testing and diagnostics easier when the node can ping and query it's own IPMI interface over the network.
Network Switches
The ideal switches to use in HA clustering are stackable, managed pair of switches. At the very least, a pair of switches that support VLANs is recommended. None of this is strictly required, but here are the reasons they're recommended;
- VLAN allows for totally isolating the BCN, SN and IFN traffic. This adds security and reduces broadcast traffic.
- Manages switches provide a unified interface for configuring both switches at the same time. This drastically simplifies complex configurations, like setting up VLANs that span the physical switches.
- Stacking provides a link between the two switches that effectively makes them work like one. Generally, this bandwidth available in the stack cable is much higher than the bandwidth of individual ports. This provides a high-speed link for all three VLANs in one cable and it allows for multiple links to fail without risking performance degradation. We'll talk more about this later.
Beyond these suggested features, there are a few other things to consider when choosing switches;
Feature | Consideration |
---|---|
MTU size | # The default packet size on a network if 1500 bytes. If you build your VLANs in software, you need to account for the extra size needed for the VLAN header. If your switch supports "Jumbo Frames", then there should be no problem. However, some cheap switches do not support jumbo frames, requiring you to reduce the MTU size value for the interfaces on your nodes.
|
Packets Per Second | This is a measure of how many packets can be routed per second, and generally is a reflection of the switch's processing power and memory. Cheaper switches will not have the ability to route a high number of packets at the same time, potentially causing congestion. |
Multicast Groups | Some fancy switches, like some Cisco hardware, doesn't maintain multicast groups persistently. The cluster software uses multicast for communication, so if your switch drops a multicast group, it will cause your cluster to partition. If you have a managed switch, ensure that persistent multicast groups are enabled. We'll talk more about this later. |
Port speed and count versus Internal Fabric Bandwidth | A switch that has, say, 48 Gbps ports may not be able to route 48 Gbps. This is a problem similar to over-provisioning we discussed above. If an inexpensive 48 port switch has an internal switch fabric of only 20 Gbps, then it can handle only up to 20 saturated ports at a time. Be sure to review the internal fabric capacity and make sure it's high enough to handle all connected interfaces running full speed. Note, of course, that only one link in a given bond will be active at a time. |
Uplink speed | If you have a gigabit switch and you simply link the ports between the two switches, the link speed will be limited to 1 gigabit. Normally, all traffic will be kept on one switch, so this is fine in principle. If a single link fails over to the backup switch, then it will bounce up to the main switch at full speed. However, if a second link fails, both will be sharing the single gigabit uplink, so there is a risk of congestion on the link. If you can't get stacked switches, which generally have 10 Gbps speeds or higher, then look for switches with 10 Gbps dedicated uplink ports and use those for uplinks. Keep in mind that using ports for uplinks, instead of a stacking cable, will mean that each uplink port will be restricted to the VLAN it's a member of (or you need to share the uplink across multiple VLANs, breaking the isolation). |
Port Trunking | If your existing network supports it, choosing a switch with port trunking provides a backup link from the foundation pack switches to the main network. This extends the network redundancy out to the rest of your network. |
There are numerous other valid considerations when choosing network switches for your Anvil!. These are the most prescient considerations, though.
Why Switched PDUs?
We will discuss this in detail later on, but in short, when a node stops responding, we can not simply assume that it is dead. To do so would be to risk a "split-brain" condition which can lead to data divergence, data loss and data corruption.
To deal with this, we need a mechanism of putting a node that is in an unknown state into a known state. A process called "fencing". Many people who build HA platforms use the IPMI interface for this purpose, as will we. The idea here is that, when a node stops responding, the surviving node connects to the lost node's IPMI interface and forces the machine to power off. The IPMI BMC is, effectively, a little computer inside the main computer, so it will work regardless of what state the node itself is in.
Once the node has been confirmed to be off, the services that had been running on it can be restarted on the remaining good node, safe in knowing that the lost peer is not also hosting these services. In our case, these "services" are the shared storage and the virtual servers.
There is a problem with this though. Actually, two.
- The IPMI draws it's power from the same power source as the server itself. If the host node loses power entirely, IPMI goes down with the host.
- The IPMI BMC has a single network interface and it is a single device.
Thus, if we relied on IPMI-based fencing alone, we'd have a single point of failure. If the surviving node can not put a lost node into a known state, it will hang, by design. The logic being that as bad as a hung cluster is, it's better than risking production. So what this means is that, with IPMI-based fencing alone, the loss of power to a single node would not be automatically recoverable.
That just will not do!
To make fencing redundant, we will use switched PDUs. Think of these as network-connected power bars.
Imagine now that one of the nodes blew itself up. The surviving node would try to connect to it's IPMI interface and, of course, get no response. Then it would log into both PDUs (one behind either side of the redundant power supplies) and cut the power going to the node. By doing this, we now have a way of putting a lost node into a known state.
So now, no matter how badly things go wrong, we can always recover!
Network Managed UPSes Are Worth It
We have found that a surprising number of issues that effect service availability are power related. A network-connected smart UPS allows you to monitor the power coming from the building mains. Thanks to this, we've been able to detect far more than simple "lost power" events. We've been able to detect failing transformers and regulators, over and under voltage events... Things that, if caught ahead of time, not only avoids full power outages, but also protects the rest of your gear that isn't protected by UPSes.
So strictly speaking, you don't need network managed UPSes. However, we have found them worth their weight in gold and thus strongly recommend them. We will, of course, be using them in this tutorial.
Dashboard Servers
The Anvil! will be managed by AN!CDB - Cluster Dashboard dashboard, a small little dedicated server. This can be a virtual machine on a laptop or desktop, or a dedicated little server. All that matters is that it can run RHEL or CentOS version 6 with a minimal desktop.
Normally, we setup a couple ASUS EeeBox machines, for redundancy of course, hanging off the back of a monitor. Then users can connect to the dashboard using a browser from any device and control the servers and nodes easily from it. It also provides KVM-like access to the servers on the Anvil!, allowing them to work on the servers when they can't connect over the network. For this reason, you will probably want to pair up the dashboard machines with a monitor that offers a decent resolution to make it easy to see the desktop of the hosted servers.
What You Should Know Before Beginning
It is assumed that you are familiar with Linux systems administration, specifically Red Hat Enterprise Linux and its derivatives. You will need to have somewhat advanced networking experience as well. You should be comfortable working in a terminal (directly or over ssh). Familiarity with XML will help, but is not terribly required as its use here is pretty self-evident.
If you feel a little out of depth at times, don't hesitate to set this tutorial aside. Browse over to the components you feel the need to study more, then return and continue on. Finally, and perhaps most importantly, you must have patience! If you have a manager asking you to "go live" with a cluster in a month, tell him or her that it simply won't happen. If you rush, you will skip important points and you will fail.
Patience is vastly more important than any pre-existing skill.
A Word On Complexity
Introducing the Fabimer Principle:
Clustering is not inherently hard, but it is inherently complex. Consider:
- Any given program has N bugs.
- RHCS uses; cman, corosync, dlm, fenced, rgmanager, and many more smaller apps.
- We will be adding DRBD, GFS2, clvmd, libvirtd and KVM.
- Right there, we have N^10 possible bugs. We'll call this A.
- A cluster has Y nodes.
- In our case, 2 nodes, each with 3 networks across 6 interfaces bonded into pairs.
- The network infrastructure (Switches, routers, etc). We will be using two managed switches, adding another layer of complexity.
- This gives us another Y^(2*(3*2))+2, the +2 for managed switches. We'll call this B.
- Let's add the human factor. Let's say that a person needs roughly 5 years of cluster experience to be considered an proficient. For each year less than this, add a Z "oops" factor, (5-Z)^2. We'll call this C.
- So, finally, add up the complexity, using this tutorial's layout, 0-years of experience and managed switches.
- (N^10) * (Y^(2*(3*2))+2) * ((5-0)^2) == (A * B * C) == an-unknown-but-big-number.
This isn't meant to scare you away, but it is meant to be a sobering statement. Obviously, those numbers are somewhat artificial, but the point remains.
Any one piece is easy to understand, thus, clustering is inherently easy. However, given the large number of variables, you must really understand all the pieces and how they work together. DO NOT think that you will have this mastered and working in a month. Certainly don't try to sell clusters as a service without a lot of internal testing.
Clustering is kind of like chess. The rules are pretty straight forward, but the complexity can take some time to master.
Overview of Components
When looking at a cluster, there is a tendency to want to dive right into the configuration file. That is not very useful in clustering.
- When you look at the configuration file, it is quite short.
Clustering isn't like most applications or technologies. Most of us learn by taking something such as a configuration file, and tweaking it to see what happens. I tried that with clustering and learned only what it was like to bang my head against the wall.
- Understanding the parts and how they work together is critical.
You will find that the discussion on the components of clustering, and how those components and concepts interact, will be much longer than the initial configuration. It is true that we could talk very briefly about the actual syntax, but it would be a disservice. Please don't rush through the next section, or worse, skip it and go right to the configuration. You will waste far more time than you will save.
- Clustering is easy, but it has a complex web of inter-connectivity. You must grasp this network if you want to be an effective cluster administrator!
Component; cman
The cman portion of the the cluster is the cluster manager. In the 3.0 series used in EL6, cman acts mainly as a quorum provider. That is, is adds up the votes from the cluster members and decides if there is a simple majority. If there is, the cluster is "quorate" and is allowed to provide cluster services.
The cman service will be used to start and stop all of the components needed to make the cluster operate.
Component; corosync
Corosync is the heart of the cluster. Almost all other cluster compnents operate though this.
In Red Hat clusters, corosync is configured via the central cluster.conf file. It can be configured directly in corosync.conf, but given that we will be building an RHCS cluster, we will only use cluster.conf. That said, almost all corosync.conf options are available in cluster.conf. This is important to note as you will see references to both configuration files when searching the Internet.
Corosync sends messages using multicast messaging by default. Recently, unicast support has been added, but due to network latency, it is only recommended for use with small clusters of two to four nodes. We will be using multicast in this tutorial.
A Little History
There were significant changes between RHCS the old version 2 and version 3 available on EL6, which we are using.
In the RHCS version 2, there was a component called openais which provided totem. The OpenAIS project was designed to be the heart of the cluster and was based around the Service Availability Forum's Application Interface Specification. AIS is an open API designed to provide inter-operable high availability services.
In 2008, it was decided that the AIS specification was overkill for most clustered applications being developed in the open source community. At that point, OpenAIS was split in to two projects: Corosync and OpenAIS. The former, Corosync, provides totem, cluster membership, messaging, and basic APIs for use by clustered applications, while the OpenAIS project became an optional add-on to corosync for users who want the full AIS API.
You will see a lot of references to OpenAIS while searching the web for information on clustering. Understanding its evolution will hopefully help you avoid confusion.
The Future of Corosync
In EL6, corosync is version 1.4. Upstream, however, it's passed version 2. One of the major changes in the 2+ version is that corosync becomes a quorum provider, helping to remove the need for cman. If you experiment with clustering on Fedora, for example, you will find that cman is gone entirely.
Concept; quorum
Quorum is defined as the minimum set of hosts required in order to provide clustered services and is used to prevent split-brain situations.
The quorum algorithm used by the RHCS cluster is called "simple majority quorum", which means that more than half of the hosts must be online and communicating in order to provide service. While simple majority quorum is a very common quorum algorithm, other quorum algorithms exist (grid quorum, YKD Dyanamic Linear Voting, etc.).
The idea behind quorum is that, when a cluster splits into two or more partitions, which ever group of machines has quorum can safely start clustered services knowing that no other lost nodes will try to do the same.
Take this scenario;
- You have a cluster of four nodes, each with one vote.
- The cluster's expected_votes is 4. A clear majority, in this case, is 3 because (4/2)+1, rounded down, is 3.
- Now imagine that there is a failure in the network equipment and one of the nodes disconnects from the rest of the cluster.
- You now have two partitions; One partition contains three machines and the other partition has one.
- The three machines will have quorum, and the other machine will lose quorum.
- The partition with quorum will reconfigure and continue to provide cluster services.
- The partition without quorum will withdraw from the cluster and shut down all cluster services.
When the cluster reconfigures and the partition wins quorum, it will fence the node(s) in the partition without quorum. Once the fencing has been confirmed successful, the partition with quorum will begin accessing clustered resources, like shared filesystems.
This also helps explain why an even 50% is not enough to have quorum, a common question for people new to clustering. Using the above scenario, imagine if the split were 2 and 2 nodes. Because either can't be sure what the other would do, neither can safely proceed. If we allowed an even 50% to have quorum, both partition might try to take over the clustered services and disaster would soon follow.
There is one, and only one except to this rule.
In the case of a two node cluster, as we will be building here, any failure results in a 50/50 split. If we enforced quorum in a two-node cluster, there would never be high availability because and failure would cause both nodes to withdraw. The risk with this exception is that we now place the entire safety of the cluster on fencing, a concept we will cover in a second. Fencing is a second line of defense and something we are loath to rely on alone.
Even in a two-node cluster though, proper quorum can be maintained by using a quorum disk, called a qdisk. Unfortunately, qdisk on a DRBD resource comes with its own problems, so we will not be able to use it here.
Concept; Virtual Synchrony
Many cluster operations, like distributed locking and so on, have to occur in the same order across all nodes. This concept is called "virtual synchrony".
This is provided by corosync using "closed process groups", CPG. A closed process group is simply a private group of processes in a cluster. Within this closed group, all messages between members are ordered. Delivery, however, is not guaranteed. If a member misses messages, it is up to the member's application to decide what action to take.
Let's look at two scenarios showing how locks are handled using CPG;
- The cluster starts up cleanly with two members.
- Both members are able to start service:foo.
- Both want to start it, but need a lock from DLM to do so.
- The an-c05n01 member has its totem token, and sends its request for the lock.
- DLM issues a lock for that service to an-c05n01.
- The an-c05n02 member requests a lock for the same service.
- DLM rejects the lock request.
- The an-c05n01 member successfully starts service:foo and announces this to the CPG members.
- The an-c05n02 sees that service:foo is now running on an-c05n01 and no longer tries to start the service.
- The two members want to write to a common area of the /shared GFS2 partition.
- The an-c05n02 sends a request for a DLM lock against the FS, gets it.
- The an-c05n01 sends a request for the same lock, but DLM sees that a lock is pending and rejects the request.
- The an-c05n02 member finishes altering the file system, announces the changed over CPG and releases the lock.
- The an-c05n01 member updates its view of the filesystem, requests a lock, receives it and proceeds to update the filesystems.
- It completes the changes, annouces the changes over CPG and releases the lock.
Messages can only be sent to the members of the CPG while the node has a totem token from corosync.
Concept; Fencing
Warning: DO NOT BUILD A CLUSTER WITHOUT PROPER, WORKING AND TESTED FENCING. |
Fencing is a absolutely critical part of clustering. Without fully working fence devices, your cluster will fail.
Sorry, I promise that this will be the only time that I speak so strongly. Fencing really is critical, and explaining the need for fencing is nearly a weekly event.
So then, let's discuss fencing.
When a node stops responding, an internal timeout and counter start ticking away. During this time, no DLM locks are allowed to be issued. Anything using DLM, including rgmanager, clvmd and gfs2, are effectively hung. The hung node is detected using a totem token timeout. That is, if a token is not received from a node within a period of time, it is considered lost and a new token is sent. After a certain number of lost tokens, the cluster declares the node dead. The remaining nodes reconfigure into a new cluster and, if they have quorum (or if quorum is ignored), a fence call against the silent node is made.
The fence daemon will look at the cluster configuration and get the fence devices configured for the dead node. Then, one at a time and in the order that they appear in the configuration, the fence daemon will call those fence devices, via their fence agents, passing to the fence agent any configured arguments like username, password, port number and so on. If the first fence agent returns a failure, the next fence agent will be called. If the second fails, the third will be called, then the forth and so on. Once the last (or perhaps only) fence device fails, the fence daemon will retry again, starting back at the start of the list. It will do this indefinitely until one of the fence devices succeeds.
Here's the flow, in point form:
- The totem token moves around the cluster members. As each member gets the token, it sends sequenced messages to the CPG members.
- The token is passed from one node to the next, in order and continuously during normal operation.
- Suddenly, one node stops responding.
- A timeout starts (~238ms by default), and each time the timeout is hit, and error counter increments and a replacement token is created.
- The silent node responds before the failure counter reaches the limit.
- The failure counter is reset to 0
- The cluster operates normally again.
- Again, one node stops responding.
- Again, the timeout begins. As each totem token times out, a new packet is sent and the error count increments.
- The error counts exceed the limit (4 errors is the default); Roughly one second has passed (238ms * 4 plus some overhead).
- The node is declared dead.
- The cluster checks which members it still has, and if that provides enough votes for quorum.
- If there are too few votes for quorum, the cluster software freezes and the node(s) withdraw from the cluster.
- If there are enough votes for quorum, the silent node is declared dead.
- corosync calls fenced, telling it to fence the node.
- The fenced daemon notifies DLM and locks are blocked.
- Which fence device(s) to use, that is, what fence_agent to call and what arguments to pass, is gathered.
- For each configured fence device:
- The agent is called and fenced waits for the fence_agent to exit.
- The fence_agent's exit code is examined. If it's a success, recovery starts. If it failed, the next configured fence agent is called.
- If all (or the only) configured fence fails, fenced will start over.
- fenced will wait and loop forever until a fence agent succeeds. During this time, the cluster is effectively hung.
- Once a fence_agent succeeds, fenced notifies DLM and lost locks are recovered.
- GFS2 partitions recover using their journal.
- Lost cluster resources are recovered as per rgmanager's configuration (including file system recovery as needed).
- Normal cluster operation is restored, minus the lost node.
This skipped a few key things, but the general flow of logic should be there.
This is why fencing is so important. Without a properly configured and tested fence device or devices, the cluster will never successfully fence and the cluster will remain hung until a human can intervene.
Is "Fencing" the same as STONITH?
Yes.
In the old days, there were two distinct open-source HA clustering stacks. The Linux-HA's project used the term "STONITH", an acronym for "Shoot The Other Node In The Head", for fencing. Red Hat's cluster stack used the term "fencing" for the same concept.
We prefer the term "fencing" because the fundamental goal is to put the target node into a state where it can not effect cluster resources or provide clustered services. This can be accomplished by powering it off, called "power fencing", or by disconnecting it from SAN storage and/or network, a process called "fabric fencing".
The term "STONITH", based on it's acronym, implies power fencing. This is not a big deal, but it is the reason this tutorial sticks with the term "fencing".
Component; totem
The totem protocol defines message passing within the cluster and it is used by corosync. A token is passed around all the nodes in the cluster, and nodes can only send messages while they have the token. A node will keep its messages in memory until it gets the token back with no "not ack" messages. This way, if a node missed a message, it can request it be resent when it gets its token. If a node isn't up, it will simply miss the messages.
The totem protocol supports something called 'rrp', Redundant Ring Protocol. Through rrp, you can add a second backup ring on a separate network to take over in the event of a failure in the first ring. In RHCS, these rings are known as "ring 0" and "ring 1". The RRP is being re-introduced in RHCS version 3. Its use is experimental and should only be used with plenty of testing.
Component; rgmanager
When the cluster membership changes, corosync tells the rgmanager that it needs to recheck its services. It will examine what changed and then will start, stop, migrate or recover cluster resources as needed.
Within rgmanager, one or more resources are brought together as a service. This service is then optionally assigned to a failover domain, an subset of nodes that can have preferential ordering.
The rgmanager daemon runs separately from the cluster manager, cman. This means that, to fully start the cluster, we need to start both cman and then rgmanager.
What about Pacemaker?
Pacemaker is also a resource manager, like rgmanager. You can not use both in the same cluster.
Back prior to 2008, there were two distinct open-source cluster projects;
- Red Hat's Cluster Service
- Linux-HA's Heartbeat
Pacemaker was born out of the Linux-HA project as an advanced resource manager that could use either heartbeat or openais for cluster membership and communication. Unlike RHCS and heartbeat, it's sole focus was resource management.
In 2008, plans were made to begin the slow process of merging the two independent stacks into one. As mentioned in the corosync overview, it replaced openais and became the default cluster membership and communication layer for both RHCS and Pacemaker. Development of heartbeat was ended, though Linbit continues to maintain the heartbeat code to this day.
The fence and resource agents, software that acts as a glue between the cluster and the devices and resource they manage, were merged next. You can now use the same set of agents on both pacemaker and RHCS.
Red Hat introduced pacemaker as "Tech Preview" in RHEL 6.0. It has been available beside RHCS ever since, though support is not offered yet*.
Red Hat has a strict policy of not saying what will happen in the future. That said, the speculation is that Pacemaker will become supported soon and will replace rgmanager entirely in RHEL 7, given that cman and rgmanager no longer exist upstream in Fedora.
So why don't we use pacemaker here?
We believe that, no matter how promising software looks, stability is king. Pacemaker on other distributions has been stable and supported for a long time. However, on RHEL, it's a recent addition and the developers have been doing a tremendous amount of work on pacemaker and associated tools. For this reason, we feel that on RHEL 6, pacemaker is too much of a moving target at this time. That said, we do intend to switch to pacemaker some time in the next year or two, depending on how the Red Hat stack evolves.
Component; qdisk
Note: qdisk does not work reliably on a DRBD resource, so we will not be using it in this tutorial. |
A Quorum disk, known as a qdisk is small partition on SAN storage used to enhance quorum. It generally carries enough votes to allow even a single node to take quorum during a cluster partition. It does this by using configured heuristics, that is custom tests, to decided which node or partition is best suited for providing clustered services during a cluster reconfiguration. These heuristics can be simple, like testing which partition has access to a given router, or they can be as complex as the administrator wishes using custom scripts.
Though we won't be using it here, it is well worth knowing about when you move to a cluster with SAN storage.
Component; DRBD
DRBD; Distributed Replicating Block Device, is a technology that takes raw storage from two nodes and keeps their data synchronized in real time. It is sometimes described as "network RAID Level 1", and that is conceptually accurate. In this tutorial's cluster, DRBD will be used to provide that back-end storage as a cost-effective alternative to a traditional SAN device.
DRBD is, fundamentally, a raw block device. If you've ever used mdadm to create a software RAID array, then you will be familiar with this.
Think of it this way;
With traditional software raid, you would take;
- /dev/sda5 + /dev/sdb5 -> /dev/md0
With DRBD, you have this;
- node1:/dev/sda5 + node2:/dev/sda5 -> both:/dev/drbd0
In both cases, as soon as you create the new md0 or drbd0 device, you pretend like the member devices no longer exist. You format a filesystem onto /dev/md0, use /dev/drbd0 as an LVM physical volume, and so on.
The main difference with DRBD is that the /dev/drbd0 will always be the same on both nodes. If you write something to node 1, it's instantly available on node 2, and vice versa. Of course, this means that what ever you put on top of DRBD has to be "cluster aware". That is to say, the program or file system using the new /dev/drbd0 device has to understand that the contents of the disk might change because of another node.
Component; Clustered LVM
With DRBD providing the raw storage for the cluster, we must next consider partitions. This is where Clustered LVM, known as CLVM, comes into play.
CLVM is ideal in that by using DLM, the distributed lock manager. It won't allow access to cluster members outside of corosync's closed process group, which, in turn, requires quorum.
It is ideal because it can take one or more raw devices, known as "physical volumes", or simple as PVs, and combine their raw space into one or more "volume groups", known as VGs. These volume groups then act just like a typical hard drive and can be "partitioned" into one or more "logical volumes", known as LVs. These LVs are where KVM's virtual machine guests will exist and where we will create our GFS2 clustered file system.
LVM is particularly attractive because of how flexible it is. We can easily add new physical volumes later, and then grow an existing volume group to use the new space. This new space can then be given to existing logical volumes, or entirely new logical volumes can be created. This can all be done while the cluster is online offering an upgrade path with no down time.
Component; GFS2
With DRBD providing the clusters raw storage space, and Clustered LVM providing the logical partitions, we can now look at the clustered file system. This is the role of the Global File System version 2, known simply as GFS2.
It works much like standard filesystem, with user-land tools like mkfs.gfs2, fsck.gfs2 and so on. The major difference is that it and clvmd use the cluster's distributed locking mechanism provided by the dlm_controld daemon. Once formatted, the GFS2-formatted partition can be mounted and used by any node in the cluster's closed process group. All nodes can then safely read from and write to the data on the partition simultaneously.
Note: GFS2 is only supported when run on top of Clustered LVM LVs. This is because, in certain error states, gfs2_controld will call dmsetup to disconnect the GFS2 partition from its storage in certain failure states. |
Component; DLM
One of the major roles of a cluster is to provide distributed locking for clustered storage and resource management.
Whenever a resource, GFS2 filesystem or clustered LVM LV needs a lock, it sends a request to dlm_controld which runs in userspace. This communicates with DLM in kernel. If the lockspace does not yet exist, DLM will create it and then give the lock to the requester. Should a subsequant lock request come in for the same lockspace, it will be rejected. Once the application using the lock is finished with it, it will release the lock. After this, another node may request and receive a lock for the lockspace.
If a node fails, fenced will alert dlm_controld that a fence is pending and new lock requests will block. After a successful fence, fenced will alert DLM that the node is gone and any locks the victim node held are released. At this time, other nodes may request a lock on the lockspaces the lost node held and can perform recovery, like replaying a GFS2 filesystem journal, prior to resuming normal operation.
Note that DLM locks are not used for actually locking the file system. That job is still handled by plock() calls (POSIX locks).
Component; KVM
Two of the most popular open-source virtualization platforms available in the Linux world today and Xen and KVM. The former is maintained by Citrix and the other by Redhat. It would be difficult to say which is "better", as they're both very good. Xen can be argued to be more mature where KVM is the "official" solution supported by Red Hat in EL6.
We will be using the KVM hypervisor within which our highly-available virtual machine guests will reside. It is a type-1 hypervisor, which means that the host operating system runs directly on the bare hardware. Contrasted against Xen, which is a type-2 hypervisor where even the installed OS is itself just another virtual machine.
Node Installation
This section is going to be intentionally vague, as I don't want to influence too heavily what hardware you buy or how you install your operating systems. However, we need a baseline, a minimum system requirement of sorts. Also, I will refer fairly frequently to my setup, so I will share with you the details of what I bought. Please don't take this as an endorsement though... Every cluster will have its own needs, and you should plan and purchase for your particular needs.
In my case, my goal was to have a low-power consumption setup and I knew that I would never put my cluster into production as it's strictly a research and design cluster. As such, I can afford to be quite modest.
Node Host Names
Before we begin, let
We need to decide what naming convention and IP ranges to use for our nodes and their networks.
The IP addresses and subnets you decide to use are completely up to you. The host names though need to follow a certain standard, if you wish to use the AN!CDB dashboard, as we will do here. Specifically, the node names on your nodes must end in n01 for node #1 and n02 for node #2. The reason for this will be discussed later.
The node host name convention that we've created is this:
- xx-cYYn0{1,2}
- xx is a two or three letter prefix used to denote the company, group or person who owns the Anvil!
- cYY is a simple zero-padded sequence number number.
- n0{1,2} indicated the node in the cluster.
In this tutorial, the Anvil! is owned and operated by "Alteeve's Niche!", so the prefix "an" is used. This is the fifth cluster we've got, so the cluster name is an-cluster-05, so the host name's cluster number is c05. Thus, node #1 is named an-c05n01 and node #2 is named an-c05n02.
As we have three distinct networks, we have three network-specific suffixes we apply to these host names which we will map to subnets in /etc/hosts later.
- <hostname>.bcn; Back-Channel Network host name.
- <hostname>.sn; Storage Network hostname.
- <hostname>.ifn; Internet-Facing Network host name.
Again, what you use is entirely up to you. Just remember that the node's host name must end in n01 and n02 for AN!CDB to work.
Foundation Pack Host Names
The foundation pack devices, switches, PDUs and UPSes, can support multiple Anvil! platforms. Likewise, the dashboard servers support multiple Anvil!s as well. For this reason, the cXX portion of the host name does not make sense when choosing host names for these devices.
As always, you are free to choose host names that make sense to you. For this tutorial, the following host names are used;
Device | Host name | Examples | Note |
---|---|---|---|
Network Switches | xx-sYY |
|
The xx prefix is the owner's prefix and YY is a simple sequence number. |
Switched PDUs | xx-pYY |
|
The xx prefix is the owner's prefix and YY is a simple sequence number. |
Network Managed UPSes | xx-uYY |
|
The xx prefix is the owner's prefix and YY is a simple sequence number. |
Dashboard Servers | xx-mYY |
|
The xx prefix is the owner's prefix and YY is a simple sequence number. Note that the m letter was chosen for historical reasons. The dashboard used to be called "monitoring servers". For consistency with existing dashboards, m has remained. Note also that the dashboards will connect to both the BCN and SN, so like the nodes, host names with the .bcn and .ifn suffixes will be used. |
OS Installation
Warning: EL6.1 shipped with a version of corosync that had a token retransmit bug. On slower systems, there would be a form of race condition which would cause totem tokens the be retransmitted and cause significant performance problems. This has been resolved in EL6.2 so please be sure to upgrade. |
Beyond being based on RHEL 6, there are no requirements for how the operating system is installed. This tutorial is written using "minimal" installs, and as such, installation instructions will be provided that will install all needed packages if they aren't already installed on your nodes.
A few notes about the installation used for this tutorial;
- RHCS stable 3 supports selinux, but it is disabled in this tutorial.
- Both iptables and ip6tables firewalls are disabled.
Obviously, this significantly reduces the security of your nodes. For learning, which is the goal here, this helps keep a focus on the clustering and simplifies debugging when things go wrong. In production clusters though, these steps are ill advised. It is strongly suggested that you enable first the firewall, then when that is working, enabling selinux. Leaving selinux for last is intentional, as it generally takes the most work to get right.
Network Security
When building production clusters, you will want to consider two options with regard to network security.
First, the interfaces connected to an untrusted network, like the Internet, should not have an IP address, though the interfaces themselves will need to be up so that virtual machines can route through them to the outside world. Alternatively, anything inbound from the virtual machines or inbound from the untrusted network should be DROPed by the firewall.
Second, if you can not run the cluster communications or storage traffic on dedicated network connections over isolated subnets, you will need to configure the firewall to block everything except the ports needed by storage and cluster traffic. The default ports are below.
Component | Protocol | Port | Note |
---|---|---|---|
dlm | TCP | 21064 | |
drbd | TCP | 7788+ | Each DRBD resource will use an additional port, generally counting up (ie: r0 will use 7788, r1 will use 7789, r2 will use 7790 and so on). |
luci | TCP | 8084 | Optional web-based configuration tool, not used in this tutorial. |
modclusterd | TCP | 16851 | |
ricci | TCP | 11111 | Each DRBD resource will use an additional port, generally counting up (ie: r1 will use 7790, r2 will use 7791 and so on). |
totem | UDP/multicast | 5404, 5405 | Uses a multicast group for cluster communications |
Note: As of EL6.2, you can now use unicast for totem communication instead of multicast. This is not advised, and should only be used for clusters of two or three nodes on networks where unresolvable multicast issues exist. If using gfs2, as we do here, using unicast for totem is strongly discouraged. |
Network
Before we begin, let's take a look at a block diagram of what we're going to build. This will help when trying to see what we'll be talking about.
A Map!
Nodes \_/
____________________________________________________________________________ _____|____ ____________________________________________________________________________
| an-c05n01.alteeve.ca | /--------{_Internet_}---------\ | an-c05n02.alteeve.ca |
| Network: | | | | Network: |
| _________________ _____________________| | _________________________ | |_____________________ _________________ |
| Servers: | vbr2 |---| bond2 | | | an-s01 Switch 1 | | | bond2 |---| vbr2 | Servers: |
| _______________________ | 10.255.50.1 | | ____________________| | |____ Internet-Facing ____| | |____________________ | | 10.255.50.2 | ......................... |
| | [ vm01-win2008 ] | |_________________| || eth2 =----=_01_] Network [_02_=----= eth2 || |_________________| : [ vm01-win2008 ] : |
| | ____________________| | : | | : : | | || 00:1B:21:81:C3:34 || | |____________________[_24_=-/ || 00:1B:21:81:C2:EA || : : | | : : | : :____________________ : |
| | | NIC 1 =----/ : | | : : | | ||___________________|| | | an-s02 Switch 2 | ||___________________|| : : | | : : | :----= NIC 1 | : |
| | | 10.255.1.1 || : | | : : | | | ____________________| | |____ ____| |____________________ | : : | | : : | :| 10.255.1.1 | : |
| | | ..:..:..:..:..:.. || : | | : : | | || eth5 =----=_01_] VLAN ID 101 [_02_=----= eth5 || : : | | : : | :| ..:..:..:..:..:.. | : |
| | |___________________|| : | | : : | | || A0:36:9F:02:E0:05 || | |____________________[_24_=-\ || A0:36:9F:07:D6:2F || : : | | : : | :|___________________| : |
| | ____ | : | | : : | | ||___________________|| | | ||___________________|| : : | | : : | : ____ : |
| /--=--[_c:_] | : | | : : | | |_____________________| \-----------------------------/ |_____________________| : : | | : : | : [_c:_]--=--\ |
| | |_______________________| : | | : : | | _____________________| |_____________________ : : | | : : | :.......................: | |
| | : | | : : | | | bond1 | _________________________ | bond1 | : : | | : : | | |
| | ......................... : | | : : | | | 10.10.50.1 | | an-s01 Switch 1 | | 10.10.50.2 | : : | | : : | _______________________ | |
| | : [ vm02-win2012 ] : : | | : : | | | ____________________| |____ Storage ____| |____________________ | : : | | : : | | [ vm02-win2012 ] | | |
| | : ____________________: : | | : : | | || eth1 =----=_09_] Network [_10_=----= eth1 || : : | | : : | |____________________ | | |
| | : | NIC 1 =---: | | : : | | || 00:19:99:9C:9B:9F || |_________________________| || 00:19:99:9C:A0:6D || : : | | : : \---= NIC 1 | | | |
| | : | 10.255.1.2 |: | | : : | | ||___________________|| | an-s02 Switch 2 | ||___________________|| : : | | : : || 10.255.1.2 | | | |
| | : | ..:..:..:..:..:.. |: | | : : | | | ____________________| |____ ____| |____________________ | : : | | : : || ..:..:..:..:..:.. | | | |
| | : |___________________|: | | : : | | || eth4 =----=_09_] VLAN ID 100 [_10_=----= eth4 || : : | | : : ||___________________| | | |
| | : ____ : | | : : | | || A0:36:9F:02:E0:04 || |_________________________| || A0:36:9F:07:D6:2E || : : | | : : | ____ | | |
| | /--=--[_c:_] : | | : : | | ||___________________|| ||___________________|| : : | | : : | [_c:_]--=--\ | |
| | | :.......................: | | : : | | /--|_____________________| |_____________________|--\ : : | | : : |_______________________| | | |
| | | | | : : | | | _____________________| |_____________________ | : : | | : : | | |
| | | _______________________ | | : : | | | | bond0 | _________________________ | bond0 | | : : | | : : ......................... | | |
| | | | [ vm03-win7 ] | | | : : | | | | 10.20.50.1 | | an-s01 Switch 1 | | 10.20.50.2 | | : : | | : : : [ vm02-win2012 ] : | | |
| | | | ____________________| | | : : | | | | ____________________| |____ Back-Channel ____| |____________________ | | : : | | : : :____________________ : | | |
| | | | | NIC 1 =-----/ | : : | | | || eth0 =----=_13_] Network [_14_=----= eth0 || | : : | | : :-----= NIC 1 | : | | |
| | | | | 10.255.1.3 || | : : | | | || 00:19:99:9C:9B:9E || |_________________________| || 00:19:99:9C:A0:6C || | : : | | : :| 10.255.1.3 | : | | |
| | | | | ..:..:..:..:..:.. || | : : | | | ||___________________|| | an-s02 Switch 2 | ||___________________|| | : : | | : :| ..:..:..:..:..:.. | : | | |
| | | | |___________________|| | : : | | | || eth3 =----=_13_] VLAN ID 1 [_14_=----= eth3 || | : : | | : :|___________________| : | | |
| | | | ____ | | : : | | | || 00:1B:21:81:C3:35 || |_________________________| || 00:1B:21:81:C2:EB || | : : | | : : ____ : | | |
| +--|-=--[_c:_] | | : : | | | ||___________________|| ||___________________|| | : : | | : : [_c:_]--=--|--+ |
| | | |_______________________| | : : | | | |_____________________| |_____________________| | : : | | : :.......................: | | |
| | | | : : | | | | | | : : | | : | | |
| | | _______________________ | : : | | | | | | : : | | : ......................... | | |
| | | | [ vm04-win8 ] | | : : | | \ | | / : : | | : : [ vm04-win8 ] : | | |
| | | | ____________________| | : : | | \ | | / : : | | : :____________________ : | | |
| | | | | NIC 1 =-------/ : : | | | | | | : : | | :-------= NIC 1 | : | | |
| | | | | 10.255.1.4 || : : | | | | | | : : | | :| 10.255.1.4 | : | | |
| | | | | ..:..:..:..:..:.. || : : | | | | | | : : | | :| ..:..:..:..:..:.. | : | | |
| | | | |___________________|| : : | | | | | | : : | | :|___________________| : | | |
| | | | ____ | : : | | | | | | : : | | : ____ : | | |
| +--|-=--[_c:_] | : : | | | | | | : : | | : [_c:_]--=--|--+ |
| | | |_______________________| : : | | | | | | : : | | :.......................: | | |
| | | : : | | | | | | : : | | | | |
| | | ......................... : : | | | | | | : : | | _______________________ | | |
| | | : [ vm05-freebsd9 ] : : : | | | | | | : : | | | [ vm05-freebsd9 ] | | | |
| | | : ____________________: : : | | | | | | : : | | |____________________ | | | |
| | | : | em0 =---------: : | | | | | | : : | \---------= em0 | | | | |
| | | : | 10.255.1.5 |: : | | | | | | : : | || 10.255.1.5 | | | | |
| | | : | ..:..:..:..:..:.. |: : | | | | | | : : | || ..:..:..:..:..:.. | | | | |
| | | : |___________________|: : | | | | | | : : | ||___________________| | | | |
| | | : ______ : : | | | | | | : : | | ______ | | | |
| | +--=--[_ada0_] : : | | | | | | : : | | [_ada0_]--=--+ | |
| | | :.......................: : | | | | | | : : | |_______________________| | | |
| | | : | | | | | | : : | | | |
| | | ......................... : | | | | | | : : | _______________________ | | |
| | | : [ vm06-solaris11 ] : : | | | | | | : : | | [ vm06-solaris11 ] | | | |
| | | : ____________________: : | | | | | | : : | |____________________ | | | |
| | | : | net0 =-----------: | | | | | | : : \-----------= net0 | | | | |
| | | : | 10.255.1.6 |: | | | | | | : : || 10.255.1.6 | | | | |
| | | : | ..:..:..:..:..:.. |: | | | | | | : : || ..:..:..:..:..:.. | | | | |
| | | : |___________________|: | | | | | | : : ||___________________| | | | |
| | | : ______ : | | | | | | : : | ______ | | | |
| | +--=--[_c3d0_] : | | | | | | : : | [_c3d0_]--=--+ | |
| | | :.......................: | | | | | | : : |_______________________| | | |
| | | | | | | | | : : | | |
| | | _______________________ | | | | | | : : ......................... | | |
| | | | [ vm07-rhel6 ] | | | | | | | : : : [ vm07-rhel6 ] : | | |
| | | | ____________________| | | | | | | : : :____________________ : | | |
| | | | | eth0 =-------------/ | | | | | : :-------------= eth0 | : | | |
| | | | | 10.255.1.7 || | | | | | : :| 10.255.1.7 | : | | |
| | | | | ..:..:..:..:..:.. || | | | | | : :| ..:..:..:..:..:.. | : | | |
| | | | |___________________|| | | | | | : :|___________________| : | | |
| | | | _____ | | | | | | : : _____ : | | |
| +--|--=--[_vda_] | | | | | | : : [_vda_]--=--|--+ |
| | | |_______________________| | | | | | : :.......................: | | |
| | | | | | | | : | | |
| | | _______________________ | | | | | : ......................... | | |
| | | | [ vm08-sles11 ] | | | | | | : : [ vm08-sles11 ] : | | |
| | | | ____________________| | | | | | : :____________________ : | | |
| | | | | eth0 =---------------/ | | | | :---------------= eth0 | : | | |
| | | | | 10.255.1.8 || | | | | :| 10.255.1.8 | : | | |
| | | | | ..:..:..:..:..:.. || | | | | :| ..:..:..:..:..:.. | : | | |
| | | | |___________________|| | | | | :|___________________| : | | |
| | | | _____ | | | | | : _____ : | | |
| +--|--=--[_vda_] | | | | | : [_vda_]--=--|--+ |
| | | |_______________________| | | | | :.......................: | | |
| | | | | | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | | Storage: | | | | Storage: | | |
| | | __________ | | | | __________ | | |
| | | [_/dev/sda_] | | | | [_/dev/sda_] | | |
| | | | ___________ _______ | | | | _______ ___________ | | | |
| | | +--[_/dev/sda1_]--[_/boot_] | | | | [_/boot_]--[_/dev/sda1_]--+ | | |
| | | | ___________ ________ | | | | ________ ___________ | | | |
| | | +--[_/dev/sda2_]--[_<swap>_] | | | | [_<swap>_]--[_/dev/sda2_]--+ | | |
| | | | ___________ ___ | | | | ___ ___________ | | | |
| | | +--[_/dev/sda3_]--[_/_] | | | | [_/_]--[_/dev/sda3_]--+ | | |
| | | | ___________ ____ ____________ | | | | ____________ ____ ___________ | | | |
| | | +--[_/dev/sda5_]--[_r0_]--[_/dev/drbd0_]--+ | | +--[_/dev/drbd0_]--[_r0_]--[_/dev/sda5_]--+ | | |
| | | | | | | | | | | | | |
| | | | \----|--\ | | /--|----/ | | | |
| | | | ___________ ____ ____________ | | | | | | ____________ ____ ___________ | | | |
| | | \--[_/dev/sda6_]--[_r1_]--[_/dev/drbd1_]--/ | | | | \--[_/dev/drbd1_]--[_r1_]--[_/dev/sda6_]--/ | | |
| | | | | | | | | | | |
| | | Clustered LVM: | | | | | | Clustered LVM: | | |
| | | _________________________________ | | | | | | _________________________________ | | |
| | +--[_/dev/an-c05n01_vg0/vm02-win2012_]-----+ | | | | +--[_/dev/an-c05n01_vg0/vm02-win2012_]-----+ | |
| | | __________________________________ | | | | | | __________________________________ | | |
| | +--[_/dev/an-c05n01_vg0/vm05-freebsd9_]----+ | | | | +--[_/dev/an-c05n01_vg0/vm05-freebsd9_]----+ | |
| | | ___________________________________ | | | | | | ___________________________________ | | |
| | \--[_/dev/an-c05n01_vg0/vm06-solaris11_]---/ | | | | \--[_/dev/an-c05n01_vg0/vm06-solaris11_]---/ | |
| | | | | | | |
| | _________________________________ | | | | _________________________________ | |
| +-----[_/dev/an-c05n02_vg0/vm01-win2008_]-------------+ | | +----------[_/dev/an-c05n02_vg0/vm01-win2008_]--------+ |
| | ______________________________ | | | | ______________________________ | |
| +-----[_/dev/an-c05n02_vg0/vm03-win7_]----------------+ | | +----------[_/dev/an-c05n02_vg0/vm03-win7_]-----------+ |
| | ______________________________ | | | | ______________________________ | |
| +-----[_/dev/an-c05n02_vg0/vm04-win8_]----------------+ | | +----------[_/dev/an-c05n02_vg0/vm04-win8_]-----------+ |
| | _______________________________ | | | | _______________________________ | |
| +-----[_/dev/an-c05n02_vg0/vm07-rhel6_]---------------+ | | +----------[_/dev/an-c05n02_vg0/vm07-rhel6_]----------+ |
| | ________________________________ | | | | ________________________________ | |
| \-----[_/dev/an-c05n02_vg0/vm08-sles11_]--------------+ | | +----------[_/dev/an-c05n02_vg0/vm08-sles11_]---------/ |
| ___________________________ | | | | ___________________________ |
| /--[_/dev/an-c05n01_vg0/shared_]-------------------/ | | \----------[_/dev/an-c05n01_vg0/shared_]--\ |
| | _________ | _________________________ | ________ | |
| \--[_/shared_] | | an-s01 Switch 1 | | [_shared_]--/ |
| ____________________| |____ Back-Channel ____| |____________________ |
| | IPMI =----=_03_] Network [_04_=----= IPMI | |
| | 10.20.51.1 || |_________________________| || 10.20.51.2 | |
| _________ _____ | 00:19:99:9C:9B:9E || | an-s02 Switch 2 | || 00:19:99:9A:D8:E8 | _____ _________ |
| {_sensors_}--[_BMC_]--|___________________|| | | ||___________________|--[_BMC_]--{_sensors_} |
| ______ ______ | | VLAN ID 101 | | ______ ______ |
| | PSU1 | PSU2 | | |____ ____ ____ ____| | | PSU1 | PSU2 | |
|____________________________________________________________|______|______|_| |_03_]_[_07_]_[_08_]_[_04_| |_|______|______|____________________________________________________________|
|| || | | | | || ||
/---------------------------||-||-------------|------/ \-------|-------------||-||---------------------------\
| || || | | || || |
_______________|___ || || __________|________ ________|__________ || || ___|_______________
| UPS 1 | || || | PDU 1 | | PDU 2 | || || | UPS 2 |
| an-u01 | || || | an-p01 | | an-p02 | || || | an-u02 |
_______ | 10.20.3.1 | || || | 10.20.2.1 | | 10.20.2.2 | || || | 10.20.3.1 | _______
{_Mains_}==| 00:C0:B7:58:3A:5A |=======================||=||==| 00:C0:B7:56:2D:AC | | 00:C0:B7:59:55:7C |==||=||=======================| 00:C0:B7:C8:1C:B4 |=={_Mains_}
|___________________| || || |___________________| |___________________| || || |___________________|
|| || || || || || || ||
|| \\===[ Port 1 ]====// || || \\====[ Port 2 ]===// ||
\\======[ Port 1 ]=======||=====// ||
\\==============[ Port 2 ]======//
Subnets
The cluster will use three separate /16 (255.255.0.0) networks;
Note: There are situations where it is not possible to add additional network cards, blades being a prime example. In these cases it will be up to the admin to decide how to proceed. If there is sufficient bandwidth, you can merge all networks, but it is advised in such cases to isolate IFN traffic from the SN/BCN traffic using VLANs. |
If you plan to have two or more Anvil! platforms on the same network, then it is recommended that you use the third octal of the IP addresses to identify the cluster. We've found the following works well;
- Third octal is the cluster ID times 10
- Fourth octal is the node ID.
In our case, we're building our fifth cluster, so node #1 will always have the final part of it's IP be x.y.50.1 and node #2 will always have the final part of it's IP be x.y.50.2.
Purpose | Subnet | Notes |
---|---|---|
Internet-Facing Network (IFN) | 10.255.50.0/16 |
|
Storage Network (SN) | 10.10.50.x/16 |
|
Back-Channel Network (BCN) | 10.20.50.0/16 |
|
We will be using six interfaces, bonded into three pairs of two NICs in Active/Passive (mode=1) configuration. Each link of each bond will be on alternate switches. We will also configure affinity by specifying interfaces eth0, eth1 and eth2 as primary for the bond0, bond1 and bond2 interfaces, respectively. This way, when everything is working fine, all traffic is routed through the same switch for maximum performance.
Note: Red Hat supports bonding modes 0 and 2 as of RHEL 6.4. We do not recommend these bonding modes as we've found the most reliable and consistent ability to survive switch failure and recovery with mode 1 only. If you wish to use a different bonding more, please be sure to test various failure modes extensively! |
If you can not install six interfaces in your server, then four interfaces will do with the SN and BCN networks merged.
Warning: If you wish to merge the SN and BCN onto one interface, test to ensure that the storage traffic will not block cluster communication. Test by forming your cluster and then pushing your storage to maximum read and write performance for an extended period of time (minimum of several seconds). If the cluster partitions, you will need to do some advanced quality-of-service or other network configuration to ensure reliable delivery of cluster network traffic. |
In this tutorial, we will use two D-Link DGS-3120-24TC/SI, stacked, using three VLANs to isolate the three networks.
- BCN will have VLAN ID of 1, which is the default VLAN.
- SN will have VLAN ID number 100.
- IFN will have VLAN ID number 101.
Note: Switch configuration details. |
The actual mapping of interfaces to bonds to networks will be:
Subnet | Cable Colour | VLAN ID | Link 1 | Link 2 | Bond | IP |
---|---|---|---|---|---|---|
BCN | White | 1 | eth0 | eth3 | bond0 | 10.20.x.y/16 |
SN | Green | 100 | eth1 | eth4 | bond1 | 10.10.x.y/16 |
IFN | Black | 101 | eth2 | eth5 | bond2 | 10.255.x.y/16 |
Setting Up the Network
Warning: The following steps can easily get confusing, given how many files we need to edit. Losing access to your server's network is a very real possibility! Do not continue without direct access to your servers! If you have out-of-band access via iKVM, console redirection or similar, be sure to test that it is working before proceeding. |
Planning The Use of Physical Interfaces
In production clusters, I generally intentionally get three separate dual-port controllers (two on-board interfaces plus two separate dual-port PCIe cards). I then ensure that no bond uses two interfaces on the same physical board. Thus, should a card or its bus interface fail, none of the bonds will fail completely.
Lets take a look at an example layout;
____________________
| [ an-c05n01 ] |
| ___________| _______
| | ______| | bond0 |
| | O | eth0 =-----------=---.---=------{
| | n |_____|| /--------=--/ |
| | b | | |_______|
| | o ______| | _______
| | a | eth1 =--|--\ | bond1 |
| | r |_____|| | \----=--.----=------{
| | d | | /-----=--/ |
| |___________| | | |_______|
| ___________| | | _______
| | ______| | | | bond2 |
| | P | eth2 =--|--|-----=---.---=------{
| | C |_____|| | | /--=--/ |
| | I | | | | |_______|
| | e ______| | | |
| | | eth3 =--/ | |
| | 1 |_____|| | |
| |___________| | |
| ___________| | |
| | ______| | |
| | P | eth4 =-----/ |
| | C |_____|| |
| | I | |
| | e ______| |
| | | eth5 =--------/
| | 2 |_____||
| |___________|
|____________________|
Consider the possible failure scenarios;
- The on-board controllers fail;
- bond0 falls back onto eth3 on the PCIe 1 controller.
- bond1 falls back onto eth4 on the PCIe 2 controller.
- bond2 is unaffected.
- The PCIe #1 controller fails
- bond0 remains on eth0 interface but losses its redundancy as eth3 is down.
- bond1 is unaffected.
- bond2 falls back onto eth5 on the PCIe 2 controller.
- The PCIe #2 controller fails
- bond0 is unaffected.
- bond1 remains on eth1 interface but losses its redundancy as eth4 is down.
- bond2 remains on eth2 interface but losses its redundancy as eth5 is down.
In all three failure scenarios, no network interruption occurs making for the most robust configuration possible.
Connecting Fence Devices
As we will see soon, each node can be fenced either by calling its IPMI interface or by calling the PDU and cutting the node's power. Each of these methods are inherently single points of failure as each has only one network connection. To work around this concern, we will connect all IPMI interfaces to one switch and the PDUs to the secondary switch. This way, should a switch fail, only one of the two fence devices will fail and fencing in general will still be possible via the alternate fence device.
By convention, we always connect the IPMI interfaces to the primary switch and the PDUs to the second switch.
Let's Build!
We're going to need to install a bunch of programs, and one of those programs is needed before we can reconfigure the network. The bridge-utils has to be installed right away, so now is a good time to just install everything we need.
Red Hat Enterprise Linux Specific Steps
Red Hat's Enterprise Linux is a commercial operating system that includes access to their repositories. This requires purchasing entitlements and then registering machines with their Red Hat Network.
This tutorial uses GFS2, which is provided by their Resilient Storage Add-On. The includes the High-Availability Add-On which provides the rest of the HA cluster stack.
Once you've finished your install, you can quickly register your node with RHN and add the resilient storage add-on with the following two commands.
Note: You need to replace $user and $pass with your RHN account details. You also need to replace $hostname with your node's fully qualified domain name. |
rhnreg_ks --username "$user" --password "$pass" --force --profilename "$hostname"
rhn-channel --add --user "$user" --password "$pass" --channel=rhel-x86_64-server-rs-6
If you get any errors from the above commands, please contact your support representative. They will be able to help sort out any account or entitlement issues.
Update The OS
Before we begin at all, let's update our OS.
yum update
<lots of yum output>
Installing Required Programs
This will install all the software needed to run the Anvil! and configure IPMI for use as a fence device. This won't cover DRBD or apcupsd which will be covered in dedicated sections below.
yum install cman corosync rgmanager ricci gfs2-utils ntp libvirt lvm2-cluster qemu-kvm qemu-kvm-tools virt-install virt-viewer syslinux wget gpm rsync \
freeipmi freeipmi-bmc-watchdog freeipmi-ipmidetectd OpenIPMI OpenIPMI-libs OpenIPMI-perl OpenIPMI-tools fence-agents syslinux vim man ccs \
bridge-utils openssh-clients perl rsync screen dmidecode acpid
<lots of yum output>
Before we go any further, we'll want to destroy the default libvirtd bridge. We're going to be creating our own bridge that gives our servers direct access to the outside network.
cat /dev/null >/etc/libvirt/qemu/networks/default.xml
If you already see virbr0 when you run ifconfig, the the libvirtd bridge has already started. You can stop and disable it with the following commands;
virsh net-destroy default
virsh net-autostart default --disable
virsh net-undefine default
/etc/init.d/iptables stop
Now virbr0 should be gone now and it won't return.
Installing Programs Needed For Monitoring
The alert system will be using is written in perl. Some modules need to be installed from source, which requires the develop environment group and some development libraries to be installed. If you prefer to monitor your nodes another way, then you can skip this section.
yum groupinstall development
<lots of yum output>
yum install perl-CPAN perl-YAML-Tiny perl-Net-SSLeay perl-CGI openssl-devel
<some more yum output>
The next stage installs the perl modules. Specifically, it tells perl to not prompt for input and just do the install. This saves a lot of questions and answers. If you need to do a non-standard CPAN install, skip the first line and you will run interactively.
export PERL_MM_USE_DEFAULT=1
perl -MCPAN -e 'install("YAML")'
perl -MCPAN -e 'install Moose::Role'
perl -MCPAN -e 'install Throwable::Error'
perl -MCPAN -e 'install Email::Sender::Transport::SMTP::TLS'
<a massive amount of CPAN output, test and build messages... go grab a coffee>
Done!
We'll setup the alert system a little later on. Now though, all the dependencies will have been met.
Switch Network Daemons
The new NetworkManager daemon is much more flexible and is perfect for machines like laptops which move around networks a lot. However, it does this by making a lot of decisions for you and changing the network as it sees fit. As good as this is for laptops and the like, it's not appropriate for servers. We will want to use the traditional network service.
yum remove NetworkManager
Now enable network to start with the system.
chkconfig network on
chkconfig --list network
network 0:off 1:off 2:on 3:on 4:on 5:on 6:off
Disable iptables And selinux
Warning: There are non-trivial implications to disabling security on you Anvil! nodes. In production, you might well want to think about skipping this step! |
As mentioned above, we will disable selinux and iptables. This is to simplify the learning process and both should be enabled pre-production.
To disable the firewall (note that I disable both iptables and ip6tables):
chkconfig iptables off
chkconfig ip6tables off
/etc/init.d/iptables stop
iptables: Flushing firewall rules: [ OK ]
iptables: Setting chains to policy ACCEPT: filter [ OK ]
iptables: Unloading modules: [ OK ]
<syntaxhighlight lang="bash">
/etc/init.d/ip6tables stop
ip6tables: Flushing firewall rules: [ OK ]
ip6tables: Setting chains to policy ACCEPT: filter [ OK ]
ip6tables: Unloading modules: [ OK ]
To disable selinux:
sed -i.anvil 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
diff -u /etc/selinux/config.anvil /etc/selinux/config
--- /etc/selinux/config.anvil 2013-10-25 20:03:30.229999983 -0400
+++ /etc/selinux/config 2013-10-27 20:58:21.586766607 -0400
@@ -4,7 +4,7 @@
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
-SELINUX=enforcing
+SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
# targeted - Targeted processes are protected,
# mls - Multi Level Security protection.
Check if selinux is enforcing.
sestatus
SELinux status: enabled
SELinuxfs mount: /selinux
Current mode: enforcing
Mode from config file: disabled
Policy version: 24
Policy from config file: targeted
By default it is, as it is here, so we'll switch it permissive.
setenforce 0
sestatus
SELinux status: enabled
SELinuxfs mount: /selinux
Current mode: permissive
Mode from config file: disabled
Policy version: 24
Policy from config file: targeted
You must reboot in order to disable selinux entirely. There is no rush though as nothing will fail when selinux is permissive.
Mapping Physical Network Interfaces to ethX Device Names
Consistency is the mother of stability.
When you install RHEL, it somewhat randomly assigns an ethX device to each physical network interfaces. Purely technically speaking, this is fine. So long as you know which interface has which device name, you can setup the node's networking.
However!
Consistently assigning the same device names to physical interfaces makes supporting and maintaining nodes a lot easier!
We've got six physical network interfaces, named eth0 through eth5. As you recall from earlier, we want to make sure that each pair of interfaces for each network spans two physical network cards.
Most servers have at least two on-board network cards labelled "1" and "2". These tend to correspond to lights on the front of the server, so we will start by naming these interfaces eth0 and eth1, respectively. After that, you are largely free to assign names to interfaces however you see fit.
What matters most of all is that, whatever order you choose, it's consistent across your Anvil! nodes.
Before we touch anything, let's make a backup of what we have. This way, we have an easy out in case we "oops" a files.
mkdir -p /root/backups/
rsync -av /etc/sysconfig/network-scripts /root/backups/
sending incremental file list
created directory /root/backups
network-scripts/
network-scripts/ifcfg-eth0
network-scripts/ifcfg-eth1
network-scripts/ifcfg-eth2
network-scripts/ifcfg-eth3
network-scripts/ifcfg-eth4
network-scripts/ifcfg-eth5
network-scripts/ifcfg-lo
network-scripts/ifdown -> ../../../sbin/ifdown
network-scripts/ifdown-bnep
network-scripts/ifdown-eth
network-scripts/ifdown-ippp
network-scripts/ifdown-ipv6
network-scripts/ifdown-isdn -> ifdown-ippp
network-scripts/ifdown-post
network-scripts/ifdown-ppp
network-scripts/ifdown-routes
network-scripts/ifdown-sit
network-scripts/ifdown-tunnel
network-scripts/ifup -> ../../../sbin/ifup
network-scripts/ifup-aliases
network-scripts/ifup-bnep
network-scripts/ifup-eth
network-scripts/ifup-ippp
network-scripts/ifup-ipv6
network-scripts/ifup-isdn -> ifup-ippp
network-scripts/ifup-plip
network-scripts/ifup-plusb
network-scripts/ifup-post
network-scripts/ifup-ppp
network-scripts/ifup-routes
network-scripts/ifup-sit
network-scripts/ifup-tunnel
network-scripts/ifup-wireless
network-scripts/init.ipv6-global
network-scripts/net.hotplug
network-scripts/network-functions
network-scripts/network-functions-ipv6
sent 134870 bytes received 655 bytes 271050.00 bytes/sec
total size is 132706 speedup is 0.98
Making Sure All Network Interfaces are Started
What we're going to do is watch /var/log/messages, unplug each cable and see which interface shows a lost link. This will tell us what current name is given to a particular physical interface. We'll write the current name down beside the name of the interface we want. Once we've done this to all interfaces, we'll now how we have to move names around.
Before we can pull cables though, we have to tell the system to start all of the interfaces. By default, all but one or two interfaces will be disabled on boot.
Run this to see which interfaces are up;
ifconfig
eth4 Link encap:Ethernet HWaddr 00:19:99:9C:9B:9E
inet addr:10.255.0.33 Bcast:10.255.255.255 Mask:255.255.0.0
inet6 addr: fe80::219:99ff:fe9c:9b9e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:303118 errors:0 dropped:0 overruns:0 frame:0
TX packets:152952 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:344900765 (328.9 MiB) TX bytes:14424290 (13.7 MiB)
Memory:ce660000-ce680000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:3540 errors:0 dropped:0 overruns:0 frame:0
TX packets:3540 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2652436 (2.5 MiB) TX bytes:2652436 (2.5 MiB)
In this case, only the interface currently named eth4 was started. We'll need to edit the other interface configuration files to tell them to start when the network starts. To do this, we edit the /etc/sysconfig/network-scripts/ifcfg-ethX files and change ONBOOT variable to ONBOOT="yes".
By default, most interfaces will be set to try and acquire an IP address from a DHCP server, We can see that eth4 already has an IP address, so to save time, we're going to tell the other interfaces to start without an IP address at all. If we didn't do this, restarting network would take a long time waiting for DHCP requests to time out.
Note: We skip ifcfg-eth4 in the next step because it's already up. |
Now we can use sed to edit the files. This is a lot faster and easier than editing each file by hand.
# Change eth0 to start on boot with no IP address.
sed -i 's/ONBOOT=.*/ONBOOT="yes"/' /etc/sysconfig/network-scripts/ifcfg-eth0
sed -i 's/BOOTPROTO=.*/BOOTPROTO="none"/' /etc/sysconfig/network-scripts/ifcfg-eth0
# Change eth1 to start on boot with no IP address.
sed -i 's/ONBOOT=.*/ONBOOT="yes"/' /etc/sysconfig/network-scripts/ifcfg-eth1
sed -i 's/BOOTPROTO=.*/BOOTPROTO="none"/' /etc/sysconfig/network-scripts/ifcfg-eth1
# Change eth2 to start on boot with no IP address.
sed -i 's/ONBOOT=.*/ONBOOT="yes"/' /etc/sysconfig/network-scripts/ifcfg-eth2
sed -i 's/BOOTPROTO=.*/BOOTPROTO="none"/' /etc/sysconfig/network-scripts/ifcfg-eth2
# Change eth3 to start on boot with no IP address.
sed -i 's/ONBOOT=.*/ONBOOT="yes"/' /etc/sysconfig/network-scripts/ifcfg-eth3
sed -i 's/BOOTPROTO=.*/BOOTPROTO="none"/' /etc/sysconfig/network-scripts/ifcfg-eth3
# Change eth5 to start on boot with no IP address.
sed -i 's/ONBOOT=.*/ONBOOT="yes"/' /etc/sysconfig/network-scripts/ifcfg-eth5
sed -i 's/BOOTPROTO=.*/BOOTPROTO="none"/' /etc/sysconfig/network-scripts/ifcfg-eth5
You can see how the file was changed by using diff to compare the backed up version against the edited one. Let's look at ifcfg-eth0 to see this;
diff -U0 /root/backups/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth0
--- /root/backups/network-scripts/ifcfg-eth0 2013-10-28 12:30:07.000000000 -0400
+++ /etc/sysconfig/network-scripts/ifcfg-eth0 2013-10-28 17:20:38.978458128 -0400
@@ -2 +2 @@
-BOOTPROTO="dhcp"
+BOOTPROTO="none"
@@ -5 +5 @@
-ONBOOT="no"
+ONBOOT="yes"
Excellent. You can check the other files as well to confirm that they were edited as well, if you wish. Once you are happy with the changes, restart the network initialization script.
Note: You may see [FAILED] while stopping some interfaces, this is not a concern. |
/etc/init.d/network restart
Shutting down interface eth0: [ OK ]
Shutting down interface eth1: [ OK ]
Shutting down interface eth2: [ OK ]
Shutting down interface eth3: [ OK ]
Shutting down interface eth4: [ OK ]
Shutting down interface eth5: [ OK ]
Shutting down loopback interface: [ OK ]
Bringing up loopback interface: [ OK ]
Bringing up interface eth0: [ OK ]
Bringing up interface eth1: [ OK ]
Bringing up interface eth2: [ OK ]
Bringing up interface eth3: [ OK ]
Bringing up interface eth4:
Determining IP information for eth4... done.
[ OK ]
Bringing up interface eth5: [ OK ]
Now if we look at ifconfig again, we'll see all six interfaces have been started!
ifconfig
eth0 Link encap:Ethernet HWaddr 00:1B:21:81:C3:34
inet6 addr: fe80::21b:21ff:fe81:c334/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2433 errors:0 dropped:0 overruns:0 frame:0
TX packets:31 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:150042 (146.5 KiB) TX bytes:3066 (2.9 KiB)
Interrupt:24 Memory:ce240000-ce260000
eth1 Link encap:Ethernet HWaddr 00:1B:21:81:C3:35
inet6 addr: fe80::21b:21ff:fe81:c335/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2416 errors:0 dropped:0 overruns:0 frame:0
TX packets:31 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:148176 (144.7 KiB) TX bytes:3066 (2.9 KiB)
Interrupt:34 Memory:ce2a0000-ce2c0000
eth2 Link encap:Ethernet HWaddr A0:36:9F:02:E0:04
inet6 addr: fe80::a236:9fff:fe02:e004/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3 errors:0 dropped:0 overruns:0 frame:0
TX packets:36 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1026 (1.0 KiB) TX bytes:5976 (5.8 KiB)
Memory:ce400000-ce500000
eth3 Link encap:Ethernet HWaddr A0:36:9F:02:E0:05
inet6 addr: fe80::a236:9fff:fe02:e005/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1606 errors:0 dropped:0 overruns:0 frame:0
TX packets:21 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:98242 (95.9 KiB) TX bytes:2102 (2.0 KiB)
Memory:ce500000-ce600000
eth4 Link encap:Ethernet HWaddr 00:19:99:9C:9B:9E
inet addr:10.255.0.33 Bcast:10.255.255.255 Mask:255.255.0.0
inet6 addr: fe80::219:99ff:fe9c:9b9e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:308572 errors:0 dropped:0 overruns:0 frame:0
TX packets:153402 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:345254511 (329.2 MiB) TX bytes:14520378 (13.8 MiB)
Memory:ce660000-ce680000
eth5 Link encap:Ethernet HWaddr 00:19:99:9C:9B:9F
inet6 addr: fe80::219:99ff:fe9c:9b9f/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:6 errors:0 dropped:0 overruns:0 frame:0
TX packets:23 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2052 (2.0 KiB) TX bytes:3114 (3.0 KiB)
Memory:ce6c0000-ce6e0000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:3540 errors:0 dropped:0 overruns:0 frame:0
TX packets:3540 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2652436 (2.5 MiB) TX bytes:2652436 (2.5 MiB)
Excellent! Now we can start creating the list of what physical interfaces have what current names.
Finding Current Names for Physical Interfaces
Once you know how you want your interfaces, create a little table like this:
Have | Want |
---|---|
eth0 | |
eth1 | |
eth2 | |
eth3 | |
eth4 | |
eth5 |
Now we want to use a program called tail to watch the system log file /var/log/messages and print to screen messages as they're written to the log. To do this, run;
tail -f -n 0 /var/log/messages
When you run this, the cursor will just sit there and nothing will be printed to screen at first. This is fine, this tells us that tail is waiting for new records. We're now going to methodically unplug each network cable, wait a moment and then plug it back in. Each time we do this, we'll write down the interface name that was reported as going down and then coming back up.
The first cable we're going to unplug is the one in the physical interface we want to make eth0.
Oct 28 17:36:06 an-c05n01 kernel: igb: eth4 NIC Link is Down
Oct 28 17:36:19 an-c05n01 kernel: igb: eth4 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Here we see that the physical interface that we want to be eth0 is currently called eth4. So we'll add that to our chart.
Have | Want |
---|---|
eth4 | eth0 |
eth1 | |
eth2 | |
eth3 | |
eth4 | |
eth5 |
Now we'll unplug the cable we want to make eth1:
Oct 28 17:38:01 an-c05n01 kernel: igb: eth5 NIC Link is Down
Oct 28 17:38:04 an-c05n01 kernel: igb: eth5 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
It's currently called eth5, so we'll write that in beside the "Want" column's eth1 entry.
Have | Want |
---|---|
eth4 | eth0 |
eth5 | eth1 |
eth2 | |
eth3 | |
eth4 | |
eth5 |
Keep doing this for the other four cables.
Oct 28 17:39:28 an-c05n01 kernel: e1000e: eth0 NIC Link is Down
Oct 28 17:39:30 an-c05n01 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Oct 28 17:39:35 an-c05n01 kernel: e1000e: eth1 NIC Link is Down
Oct 28 17:39:37 an-c05n01 kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Oct 28 17:39:40 an-c05n01 kernel: igb: eth2 NIC Link is Down
Oct 28 17:39:43 an-c05n01 kernel: igb: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Oct 28 17:39:47 an-c05n01 kernel: igb: eth3 NIC Link is Down
Oct 28 17:39:51 an-c05n01 kernel: igb: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
The finished table is this;
Have | Want |
---|---|
eth4 | eth0 |
eth5 | eth1 |
eth0 | eth2 |
eth1 | eth3 |
eth2 | eth4 |
eth3 | eth5 |
Now we know how we want to move the names around!
Building the MAC Address List
Every network interface has a unique MAC address assigned to it when it is built. Think of this sort of like a globally unique serial number. Because it's guaranteed to be unique, it's a convenient way for the operating system to create a persistent map between real interfaces and names. If we didn't use these, then each time you rebooted your node, it would possibly mean that the names get juggled. Not very good.
RHEL uses two files for creating this map;
- /etc/udev/rules.d/70-persistent-net.rules
- /etc/sysconfig/network-scripts/ifcfg-eth*
The 70-persistent-net.rules can be rebuilt by running a command, so we're not going to worry about it. We'll just delete in a little bit and then recreate it.
The files we care about are the six ifcfg-ethX files. Inside each of these is a variable named HWADDR. The value set here will tell the OS what physical network interface the given file is configuring. We know from the list we created how we want to move the files around.
To recap:
- The HWADDR MAC address in eth4 will be moved to eth0.
- The HWADDR MAC address in eth5 will be moved to eth1.
- The HWADDR MAC address in eth0 will be moved to eth3.
- The HWADDR MAC address in eth1 will be moved to eth4.
- The HWADDR MAC address in eth2 will be moved to eth5.
- The HWADDR MAC address in eth3 will be moved to eth6.
So lets create a new table. This one we will use to write down the MAC addresses we want to set for each device.
Device | New MAC address |
---|---|
eth0 | |
eth1 | |
eth2 | |
eth3 | |
eth4 | |
eth5 |
So we know that the MAC address currently assigned to eth4 is the one we want to move to eth0. We can use ifconfig to show the information for the eth4 interface only.
ifconfig
eth4 Link encap:Ethernet HWaddr 00:19:99:9C:9B:9E
inet addr:10.255.0.33 Bcast:10.255.255.255 Mask:255.255.0.0
inet6 addr: fe80::219:99ff:fe9c:9b9e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:315979 errors:0 dropped:0 overruns:0 frame:0
TX packets:153610 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:345711965 (329.6 MiB) TX bytes:14555290 (13.8 MiB)
Memory:ce660000-ce680000
We want the HWaddr value, 00:19:99:9C:9B:9E. This will be moved to eth0, so lets write that down.
Device | New MAC address |
---|---|
eth0 | 00:19:99:9C:9B:9E |
eth1 | |
eth2 | |
eth3 | |
eth4 | |
eth5 |
Next up, we want to move eth5 to be the new eth1. We can use ifconfig again, but this time we'll do a little bash-fu to reduce the output to just the MAC address.
ifconfig eth5 | grep HWaddr | awk '{print $5}'
00:19:99:9C:9B:9F
This simply reduced the output to just the line with the HWaddr line in it, then it split the line on spaces and printed just the fifth value, which is the MAC address currently assigned to eth5. We'll write this down beside eth1.
Device | New MAC address |
---|---|
eth0 | 00:19:99:9C:9B:9E |
eth1 | 00:19:99:9C:9B:9F |
eth2 | |
eth3 | |
eth4 | |
eth5 |
Next up, we want to move the current eth0 over to eth2. So lets get the current eth0 MAC address and add it to the list as well.
ifconfig eth0 | grep HWaddr | awk '{print $5}'
00:1B:21:81:C3:34
Now we want to move eth1 to eth3;
ifconfig eth1 | grep HWaddr | awk '{print $5}'
00:1B:21:81:C3:35
Second to last one is eth2, which will move to eth4;
ifconfig eth2 | grep HWaddr | awk '{print $5}'
A0:36:9F:02:E0:04
Finally, eth3 moves to eth5;
ifconfig eth3 | grep HWaddr | awk '{print $5}'
A0:36:9F:02:E0:05
Our complete list of new MAC address is;
Device | New MAC address |
---|---|
eth0 | 00:19:99:9C:9B:9E |
eth1 | 00:19:99:9C:9B:9F |
eth2 | 00:1B:21:81:C3:34 |
eth3 | 00:1B:21:81:C3:35 |
eth4 | A0:36:9F:02:E0:04 |
eth5 | A0:36:9F:02:E0:05 |
Excellent! Now we're ready.
Changing The Interface Device Names
Warning: This step is best done when you have direct access to the node. The reason is that the following changes require the network to be totally stopped in order to work without a reboot. If you can't get physical access, then when we get to the start_udev step, reboot the node instead. |
We're about to change which physical interfaces have which device names. If we don't stop the network first, we won't be able to restart them later. If we waited until later, the kernel would see a conflict between what it thinks the MAC-to-name mapping should be compared to what it sees in the configuration. The only way around this is a reboot, which is kind of a waste. So by stopping the network now, we clear the kernel's view of the network and avoid the problem entirely.
So, stop the network.
/etc/init.d/network stop
Shutting down interface eth0: [ OK ]
Shutting down interface eth1: [ OK ]
Shutting down interface eth2: [ OK ]
Shutting down interface eth3: [ OK ]
Shutting down interface eth4: [ OK ]
Shutting down interface eth5: [ OK ]
Shutting down loopback interface: [ OK ]
We can confirm that it's stopped by running ifconfig. It should return nothing at all.
ifconfig
# No output
Good. Next, delete the /etc/udev/rules.d/70-persistent-net.rules file. We'll regenerate it after we're done.
rm /etc/udev/rules.d/70-persistent-net.rules
rm: remove regular file `/etc/udev/rules.d/70-persistent-net.rules'? y
Now we need to edit each of the ifcfg-ethX files and change the HWADDR value to the new addresses we wrote down in our list. Let's start with ifcfg-eth0
vim /etc/sysconfig/network-scripts/ifcfg-eth0
Change the line:
HWADDR="00:1B:21:81:C3:34"
To the new value from our list;
HWADDR="00:19:99:9C:9B:9E"
Save the file and then move on to ifcfg-eth2
vim /etc/sysconfig/network-scripts/ifcfg-eth0
Change the current HWADDR="00:1B:21:81:C3:35" entry to the new MAC address;
HWADDR="00:19:99:9C:9B:9F"
Continue editing the other four ifcfg-eth{2..5} files in the same manner.
Once all the files have been edited, we will regenerate the 70-persistent-net.rules.
start_udev
Starting udev: [ OK ]
Test The New Network Name Mapping
It's time to start networking again and see if the remapping worked!
/etc/init.d/network start
Bringing up loopback interface: [ OK ]
Bringing up interface eth0: [ OK ]
Bringing up interface eth1: [ OK ]
Bringing up interface eth2: [ OK ]
Bringing up interface eth3: [ OK ]
Bringing up interface eth4:
Bringing up interface eth4:
Determining IP information for eth4...PING 10.255.255.254 (10.255.255.254) from 10.255.0.33 eth4: 56(84) bytes of data.
--- 10.255.255.254 ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3000ms
pipe 3
failed.
[FAILED]
Bringing up interface eth5: [ OK ]
What happened!?
If you recall, the old eth4 device was the interface we moved to eth0. The new eth4 is not plugged into a network with access to our DHCP server, so it failed to get an IP address. To fix this, we'll disable DHCP on the new eth4 and enable it on the new eth0 (which used to be eth4).
sed -i 's/BOOTPROTO.*/BOOTPROTO="none"/' /etc/sysconfig/network-scripts/ifcfg-eth4
sed -i 's/BOOTPROTO.*/BOOTPROTO="dhcp"/' /etc/sysconfig/network-scripts/ifcfg-eth0
Now we'll restart the network and this time we should be good.
/etc/init.d/network restart
Shutting down interface eth0: [ OK ]
Shutting down interface eth1: [ OK ]
Shutting down interface eth2: [ OK ]
Shutting down interface eth3: [ OK ]
Shutting down interface eth4: [ OK ]
Shutting down interface eth5: [ OK ]
Shutting down loopback interface: [ OK ]
Bringing up loopback interface: [ OK ]
Bringing up interface eth0:
Determining IP information for eth0... done.
[ OK ]
Bringing up interface eth1: [ OK ]
Bringing up interface eth2: [ OK ]
Bringing up interface eth3: [ OK ]
Bringing up interface eth4: [ OK ]
Bringing up interface eth5: [ OK ]
The last step is to again tail the system log and then unplug and plug-in the cables. If everything went well, they should be in the right order now.
tail -f -n 0 /var/log/messages
Oct 28 18:44:24 an-c05n01 kernel: igb: eth0 NIC Link is Down
Oct 28 18:44:27 an-c05n01 kernel: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Oct 28 18:44:31 an-c05n01 kernel: igb: eth1 NIC Link is Down
Oct 28 18:44:34 an-c05n01 kernel: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Oct 28 18:44:35 an-c05n01 kernel: e1000e: eth2 NIC Link is Down
Oct 28 18:44:38 an-c05n01 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Oct 28 18:44:39 an-c05n01 kernel: e1000e: eth3 NIC Link is Down
Oct 28 18:44:42 an-c05n01 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Oct 28 18:44:45 an-c05n01 kernel: igb: eth4 NIC Link is Down
Oct 28 18:44:49 an-c05n01 kernel: igb: eth4 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Oct 28 18:44:50 an-c05n01 kernel: igb: eth5 NIC Link is Down
Oct 28 18:44:54 an-c05n01 kernel: igb: eth5 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Woohoo! Done!
At this point, I like to refresh the backup. We're going to be making more changes later at it would be nice to not have to redo this step again, should something go wrong.
rsync -av /etc/sysconfig/network-scripts /root/backups/
sending incremental file list
network-scripts/
network-scripts/ifcfg-eth0
network-scripts/ifcfg-eth1
network-scripts/ifcfg-eth2
network-scripts/ifcfg-eth3
network-scripts/ifcfg-eth4
network-scripts/ifcfg-eth5
sent 1955 bytes received 130 bytes 4170.00 bytes/sec
total size is 132711 speedup is 63.65
Repeat this process for the other node. Once both nodes have the matching physical interface to device names, we'll be ready to move on to the next step!
Configuring Our Bridge, Bonds and Interfaces
To setup our network, we will need to edit the ifcfg-ethX, ifcfg-bondX and ifcfg-vbr2 scripts.
The vbr2 device is a bridge, like a virtual network switch, which will be used to route network connections between the virtual machines and the outside world, via the IFN. If you look in the network map, you will see that the vbr2 virtual interface connects to bond2, which links to the outside world, and it connects to all servers. Just like a normal switch does. You will also note that the bridge will have the IP addresses, not the bonded interface bond2. It will instead be slaved to the vbr2 bridge.
The bondX virtual devices work a lot like the network version of RAID level 1 arrays. They take two real links and turn them into one redundant link. In our case, each link in the bond will go to a different switch, protecting our links for interface, cable, port or entire switch failures. Should any of these fail, the bond will switch to the backup link so quickly that the applications on the nodes will not notice anything happened.
We're going to be editing a lot of files. It's best to lay out what we'll be doing in a chart. So our setup will be:
Node | BCN IP and Device | SN IP and Device | IFN IP and Device |
---|---|---|---|
an-c05n01 | 10.20.50.1 on bond0 | 10.10.50.1 on bond1 | 10.255.50.1 on vbr2 (bond2 slaved) |
an-c05n02 | 10.20.50.2 on bond0 | 10.10.50.2 on bond1 | 10.255.50.2 on vbr2 (bond2 slaved) |
Always Have Backups
Warning: Bridge configuration files must have a file name which will sort after the interface and bridge files. The actual device name can be whatever you want though. If the system tries to start a bridge before its slaved interface is up, it will fail. I personally like to use the name vbrX for "virtual machine bridge". You can use whatever makes sense to you, with the above concern in mind. |
If you skipped the previous step, start here by making a backup of the existing network configuration files.
mkdir /root/backups/
rsync -av /etc/sysconfig/network-scripts /root/backups/
sending incremental file list
network-scripts/
network-scripts/ifcfg-eth0
network-scripts/ifcfg-eth1
network-scripts/ifcfg-eth2
network-scripts/ifcfg-eth3
network-scripts/ifcfg-eth4
network-scripts/ifcfg-eth5
network-scripts/ifcfg-lo
network-scripts/ifdown -> ../../../sbin/ifdown
network-scripts/ifdown-bnep
network-scripts/ifdown-eth
network-scripts/ifdown-ippp
network-scripts/ifdown-ipv6
network-scripts/ifdown-isdn -> ifdown-ippp
network-scripts/ifdown-post
network-scripts/ifdown-ppp
network-scripts/ifdown-routes
network-scripts/ifdown-sit
network-scripts/ifdown-tunnel
network-scripts/ifup -> ../../../sbin/ifup
network-scripts/ifup-aliases
network-scripts/ifup-bnep
network-scripts/ifup-eth
network-scripts/ifup-ippp
network-scripts/ifup-ipv6
network-scripts/ifup-isdn -> ifup-ippp
network-scripts/ifup-plip
network-scripts/ifup-plusb
network-scripts/ifup-post
network-scripts/ifup-ppp
network-scripts/ifup-routes
network-scripts/ifup-sit
network-scripts/ifup-tunnel
network-scripts/ifup-wireless
network-scripts/init.ipv6-global
network-scripts/net.hotplug
network-scripts/network-functions
network-scripts/network-functions-ipv6
sent 135156 bytes received 731 bytes 271774.00 bytes/sec
total size is 132711 speedup is 0.98
Creating New Network Configuration Files
The new bond and bridge devices we want to create do not exist at all yet. So we will start by touching the configuration files we will need.
touch /etc/sysconfig/network-scripts/ifcfg-bond{0,1,2}
touch /etc/sysconfig/network-scripts/ifcfg-vbr2
Configuring The Bridge
We'll start in reverse order, crafting the bridge's script first.
an-c05n01 IFN Bridge: | vim /etc/sysconfig/network-scripts/ifcfg-vbr2
# Internet-Facing Network - Bridge
DEVICE="vbr2"
TYPE="Bridge"
NM_CONTROLLED="no"
BOOTPROTO="static"
IPADDR="10.255.50.1"
NETMASK="255.255.0.0"
GATEWAY="10.255.255.254"
DNS1="8.8.8.8"
DNS2="8.8.4.4"
DEFROUTE="yes"
|
---|---|
an-c05n02 IFN Bridge: | vim /etc/sysconfig/network-scripts/ifcfg-vbr2
# Internet-Facing Network - Bridge
DEVICE="vbr2"
TYPE="Bridge"
NM_CONTROLLED="no"
BOOTPROTO="none"
IPADDR="10.255.50.2"
NETMASK="255.255.0.0"
GATEWAY="10.255.255.254"
DNS1="8.8.8.8"
DNS2="8.8.4.4"
DEFROUTE="yes"
|
If you have a Red Hat account, you can read up on what the option above mean, and specifics of bridge devices. In case you don't though, here is a summary:
Variable | Description |
---|---|
DEVICE | This is the actual name given to this device. Generally is matches the file name. In this case, the DEVICE is vbr2 and the file name is ifcfg-vbr2. This matching of file name to device name is by convention and not strictly required. |
TYPE | This is either Ethernet, the default, or Bridge, as we use here. Note that these values are case-sensitive! By setting this here, we're telling the OS that we're creating a bridge device. |
NM_CONTROLLED | This can be yes, which is the default, or no, as we set here. This tells Network Manager that it is not allowed to manage this device. We've removed the NetworkManager package, so this is not strictly needed, but we'll add it just in case it gets installed in the future. |
BOOTPROTO | This can be either none, which we're using here, dhcp or bootp if you want the interface to get an IP from a DHCP or BOOTP server, respectively. We're setting it to static, so we want this set to none. |
IPADDR | This is the dotted-decimal IP address we're assigning to this interface. |
NETMASK | This is the dotted-decimal subnet mask for this interface. |
GATEWAY | This is the IP address the node will contact when we it needs to send traffic to other networks, like the Internet. |
DNS1 | This is the IP address of the primary domain name server to use when the node needs to translate a host or domain name into an IP address which wasn't found in the /etc/hosts file. |
DNS2 | This is the IP address of the backup domain name server, should the primary DNS server specified above fail. |
DEFROUTE | This can be set to yes, as we've set it here, or no. If two or more interfaces has DEFROUTE set, the interface with this variable set to yes will be used. |
Creating the Bonded Interfaces
Next up, we'll can create the three bonding configuration files. This is where two physical network interfaces are tied together to work like a single, highly available network interface. You can think of a bonded interface as being akin to RAID level 1; A new virtual device is created out of two real devices.
We're going to see a long line called "BONDING_OPTS". Let's look at the meaning of these options before we look at the configuration;
updelayVariable | Description |
---|---|
mode | This tells the Linux kernel what kind of bond we're creating here. There are seven modes available, each with a numeric value representing them. We're going use the "Active/Passive" mode, known as mode 1 (active-backup). As of RHEL 6.4, modes 0 (balance-rr) and mode 2 (balance-xor) are supported for use with corosync. Given the proven reliability of surviving numerous tested failure and recovery tests though, AN! still strongly recommends mode 1. |
miimon | This tells the kernel how often, in milliseconds, to check for unreported link failures. We're using 100 which tells the bonding driver to check if the network cable has been unplugged or plugged in every 100 milliseconds. Most modern drivers will report link state via their driver, so this option is not strictly required, but it is recommended for extra safety. |
use_carrier | Setting this to 1 tells the driver to use the driver to maintain the link state. Some drivers don't support that. If you run into trouble where the link shows as up when it's actually down, get a new network card or try changing this to 0. |
Setting this to 120000 tells the driver to delay switching back to the primary interface for 120,000 milliseconds (120 seconds / 2 minutes). This is designed to give the switch connected to the primary interface time to finish booting. Setting this too low may cause the bonding driver to switch back before the network switch is ready to actually move data. Some switches will not provide a link until it is fully booted, so please experiment. | |
downdelay | Setting this to 0 tells the driver not to wait before changing the state of an interface when the link goes down. That is, when the driver detects a fault, it will switch to the backup interface immediately. This is the default behaviour, but setting this here insures that it is reset when the interface is reset, should the delay be somehow set elsewhere. |
The first bond we'll configure is for the Back-Channel Network.
an-c05n01 BCN Bond | vim /etc/sysconfig/network-scripts/ifcfg-bond0
# Back-Channel Network - Bond
DEVICE="bond0"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=eth0"
IPADDR="10.20.50.1"
NETMASK="255.255.0.0"
|
---|---|
an-c05n02 BCN Bond | vim /etc/sysconfig/network-scripts/ifcfg-bond0
# Back-Channel Network - Bond
DEVICE="bond0"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=eth0"
IPADDR="10.20.50.2"
NETMASK="255.255.0.0"
|
Next up is the bond for the Storage Network;
an-c05n01 SN Bond: | vim /etc/sysconfig/network-scripts/ifcfg-bond1
# Storage Network - Bond
DEVICE="bond1"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=eth1"
IPADDR="10.10.50.1"
NETMASK="255.255.0.0"
|
---|---|
an-c05n02 SN Bond: | vim /etc/sysconfig/network-scripts/ifcfg-bond1
# Storage Network - Bond
DEVICE="bond1"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=eth1"
IPADDR="10.10.50.2"
NETMASK="255.255.0.0"
|
Finally, we setup the bond for the Internet-Facing Network.
Here we see a new option:
- BRIDGE="vbr2"; This tells the system that this bond is to be connected to the vbr2 bridge when it is started.
an-c05n01 IFN Bond: | vim /etc/sysconfig/network-scripts/ifcfg-bond2
# Internet-Facing Network - Bond
DEVICE="bond2"
BRIDGE="vbr2"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=eth2"
|-
!<span class="code">an-c05n02</span> IFN Bond:
|style="whitespace: nowrap;"|<syntaxhighlight lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-bond2
# Internet-Facing Network - Bond
DEVICE="bond2"
BRIDGE="vbr2"
NM_CONTROLLED="no"
BOOTPROTO="none"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=eth2"
|}
Done with the bonds!
=== Alter The Interface Configurations ===
With the bridge and bonds in place, we can now alter the interface configurations.
Which two interfaces you use in a given bond is entirely up to you. I've found it easiest to keep straight when I match the <span class="code">bondX</span> to the primary interface's <span class="code">ethX</span> number.
'''<span class="code">an-c05n01</span>''''s <span class="code">eth0</span>, the BCN <span class="code">bond0</span>, Link 1:
<syntaxhighlight lang="bash">
vim /etc/sysconfig/network-scripts/ifcfg-eth0
# Back-Channel Network - Link 1
HWADDR="00:19:99:9C:9B:9E"
DEVICE="eth0"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond0"
SLAVE="yes"
an-c05n01's eth1, the SN bond1, Link 1: vim /etc/sysconfig/network-scripts/ifcfg-eth1
# Storage Network - Link 1
HWADDR="00:19:99:9C:9B:9F"
DEVICE="eth1"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond1"
SLAVE="yes"
an-c05n01's eth2, the IFN bond2, Link 1: vim /etc/sysconfig/network-scripts/ifcfg-eth2
# Internet-Facing Network - Link 1
HWADDR="00:1B:21:81:C3:34"
DEVICE="eth2"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond2"
SLAVE="yes"
an-c05n01's eth3, the BCN bond0, Link 2: vim /etc/sysconfig/network-scripts/ifcfg-eth3
# Back-Channel Network - Link 2
HWADDR="00:1B:21:81:C3:35"
DEVICE="eth3"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond0"
SLAVE="yes" an-c05n01's eth4, the SN bond1, Link 2: vim /etc/sysconfig/network-scripts/ifcfg-eth4 # Storage Network - Link 2
HWADDR="A0:36:9F:02:E0:04"
DEVICE="eth4"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond1"
SLAVE="yes" an-c05n01's eth5, the IFN bond2, Link 2: vim /etc/sysconfig/network-scripts/ifcfg-eth5 # Internet-Facing Network - Link 2
HWADDR="A0:36:9F:02:E0:05"
DEVICE="eth5"
NM_CONTROLLED="no"
ONBOOT="yes"
BOOTPROTO="none"
MASTER="bond2"
SLAVE="yes" Loading The New Network ConfigurationSimple restart the network service. /etc/init.d/network restart Updating /etc/hostsOn both nodes, update the /etc/hosts file to reflect your network configuration. Remember to add entries for your IPMI, switched PDUs and other devices. vim /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
# an-c05n01
10.20.50.1 an-c05n01 an-c05n01.bcn an-c05n01.alteeve.ca
10.20.1.1 an-c05n01.ipmi
10.10.50.1 an-c05n01.sn
10.255.50.1 an-c05n01.ifn
# an-c05n02
10.20.50.2 an-c05n02 an-c05n02.bcn an-c05n02.alteeve.ca
10.20.1.2 an-c05n02.ipmi
10.10.50.2 an-c05n02.sn
10.255.50.2 an-c05n02.ifn
# Fence devices
10.20.2.1 pdu1 pdu1.alteeve.ca
10.20.2.2 pdu2 pdu2.alteeve.ca
# VPN interfaces, if used.
10.30.0.1 an-c05n01.vpn
10.30.0.2 an-c05n02.vpn
Setting up SSHSetting up SSH shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This will be needed later when we want to enable applications like libvirtd and its tools, like virt-manager. SSH is, on its own, a very big topic. If you are not familiar with SSH, please take some time to learn about it before proceeding. A great first step is the Wikipedia entry on SSH, as well as the SSH man page; man ssh. SSH can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user. You will need to create an SSH key for each source user on each node, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the other node's root user's directory. For each user, on each machine you want to connect from, run: # The '2047' is just to screw with brute-forces a bit. :)
ssh-keygen -t rsa -N "" -b 2047 -f ~/.ssh/id_rsa Generating public/private rsa key pair.
Created directory '/root/.ssh'.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
4a:52:a1:c7:60:d5:e8:6d:c4:75:20:dd:62:2b:86:c5 root@an-c05n01.alteeve.ca
The key's randomart image is:
+--[ RSA 2047]----+
| o.o=.ooo. |
| . +..E.+.. |
| ..+= . o |
| oo = . |
| . .oS. |
| o . |
| . |
| |
| |
+-----------------+ This will create two files: the private key called ~/.ssh/id_rsa and the public key called ~/.ssh/id_rsa.pub. The private must never be group or world readable! That is, it should be set to mode 0600. If you look closely when you created the ssh key, the node's fingerprint is show (4a:52:a1:c7:60:d5:e8:6d:c4:75:20:dd:62:2b:86:c5 for an-c05n01 above). Make a note of the fingerprint for each machine, and then compare it to the one presented to you when you ssh to a machine for the first time. If you are presented with a fingerprint that doesn't match, you could be facing a "man in the middle" attack. To look up a fingerprint in the future, you can run the following; ssh-keygen -l -f ~/.ssh/id_rsa 2047 4a:52:a1:c7:60:d5:e8:6d:c4:75:20:dd:62:2b:86:c5 /root/.ssh/id_rsa.pub (RSA) The two newly generated files should look like; Private key: cat ~/.ssh/id_rsa -----BEGIN RSA PRIVATE KEY-----
MIIEnwIBAAKCAQBs+CsWeKegqmtneZcLDvHV4QT1n+ajj98gkmjoLcIFW5g/VFRL
pSMMkwkQBgGDkmKPvYFa5OolL6qBQSAN1NpP8zET+1lZr4OFg/TZTuA8QnhNeh6V
mU2hSoyJfEkKJ6TVYg4s1rsbbTZPLdCDe9CMn/iI824WUu2wA8RwhF2WTqqTrWTW
4h8tYK9Y4eT4IYMXiYZ8+eQfzHyMaNxvUcI1Z8heMn/CEnrA67ja7Czi/ljYnw0I
3MXy9d2ANYjYahBLF2+ok19NS9tkFHDlcZTh0gTQ4vV5fksgdJjsWl5l/aLjnSRf
x2pQrMl3w8U7JBpr0PWJPIuzd4q47+KBI1A9AgEjAoIBADTtkUVtzcMQ8lbUqHMV
4y1eqqMwaLXYKowp2y7xp2GwJWCWrJnFPOjZs/HXCAy00Ml5TXVKnZ0IhgRENCP5
q92wos8w8OJrMUDZsXDdKxX0ZlGEdUFZFxPTwJqM0wTuryXQiorOsqbr5y3Fy62T
6PPYq+q/YVtM2dkmZrpO66DGcTkBA8tq8tTU3TdqZEVfmCzM9DIGz2hprvky+yDU
Pa296CP7+lHFty34K6j/WxD49+aKrdxXxdLbH/3Wfq7a9fu/FuYObPRtXoYRJNGP
ZEzfVoNwVdc3vETuzZPDoidkc4jomA4vM4cTS1EvwEWVHfaSdIE0wF16N1FlDgNA
hKsCgYEA9Xp5vGoPRer3hTSglGrPOTTkGEhXiE/JDMZ7w4fk2lXo+Q7HqxetrS6l
hMxY+x2W0FBfKwJqBuhVv4Y5MPLbC2JazwYDoP85g6RWH72ebsqdYwYvSx808iDs
C8HArWv8RtQ/K1pRVkq0GPhTdc22sYE9aKa5Hc6nd0SEmq+hLoUCgYBxo9c3M28h
jDpxwTkYszMfpIb++tCSrcBw8guqdqjhW6yH9kXva3NjfuzpOisb7cFN6dcSqjaC
HEZjpBWPUGLOPMnL1/mSsTErusgyh2+x8WjRjuqBJrh7CDN8gejMiski5nALQpxt
s6PKI5WHVqPQ395+549LQnoaCROyf4TUWQKBgFQp/doy/ewWC7ikVFAkntHI/b8u
vuzoJ6yb0qlwa7iSe8MbAwaldo8IrcchfZfs40AbjlfjkhD/M1ebu9ZEot9U6+81
QxKgpgE/qH/pPaJUGLQ8ooAn9OVNHbrjWADx0tZ0p/GbTxZFf5OIVyETVJShVuIN
RshkHCjkSrixPpObAoGAPbC2qPAJINcYaaNoI1n3Lm9B+CHBrrYYAsyJ/XOdgabL
X8A0l+nfjciPPMfOQlx+4ScrnGsHpbeT7PKsnkGUuRmvYAeHe4TC69psrbc8om0b
pPXPwnQbAPXSzo+qQybE9bBLc9O0AQm/UHm3kpy/VCHB7R6ePsxQ6Y/mHxIGR2MC
gYEAhW7evwpxUMcW+BV84xIIt7cW2K/mu8nOb2qajFTej+WgvHNT+h4vgs4ZrTkH
rHyUiN/tzTCxBnkoh1w9FmCdnAdr/+br56Zq8oEXzBUUALqeW0xnB0zpTc6Hn0xq
iU0P5cM1sgyCWv83MgeGegcpxt54K5bqUjPKjaUpLNqbtiA=
-----END RSA PRIVATE KEY----- Public key (single line, but wrapped here to make it more readable): cat ~/.ssh/id_rsa.pub ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQBs+CsWeKegqmtneZcLDvHV4QT1n+ajj98gkmjo
LcIFW5g/VFRLpSMMkwkQBgGDkmKPvYFa5OolL6qBQSAN1NpP8zET+1lZr4OFg/TZTuA8QnhN
eh6VmU2hSoyJfEkKJ6TVYg4s1rsbbTZPLdCDe9CMn/iI824WUu2wA8RwhF2WTqqTrWTW4h8t
YK9Y4eT4IYMXiYZ8+eQfzHyMaNxvUcI1Z8heMn/CEnrA67ja7Czi/ljYnw0I3MXy9d2ANYjY
ahBLF2+ok19NS9tkFHDlcZTh0gTQ4vV5fksgdJjsWl5l/aLjnSRfx2pQrMl3w8U7JBpr0PWJ
PIuzd4q47+KBI1A9 root@an-c05n01.alteeve.ca
In order to enable password-less login, we need to create a file called ~/.ssh/authorized_keys and put both nodes' public key in it. To seed the ~/.ssh/authorized_keys file, we'll simply copy the ~/.ssh/id_rsa.pub file. After that, we will append an-c05n02's public key into it over ssh. Once both keys are in it, we'll push it over to an-c05n02. If you want to add your workstation's key as well, this is the best time to do so. From an-c05n01, type: rsync -av ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys sending incremental file list
id_rsa.pub
sent 482 bytes received 31 bytes 1026.00 bytes/sec
total size is 404 speedup is 0.79 Now we'll grab the public key from an-c05n02 over SSH and append it to the new authorized_keys file. I noted when I created an-c05n02's ssh key that its fingerprint was 04:08:37:43:6b:5c:a0:b0:f5:27:a7:46:d4:77:a3:34. This matches the one presented to me in the next step, so I trust that I am talking to the right machine. ssh root@an-c05n02 "cat ~/.ssh/id_rsa.pub" >> ~/.ssh/authorized_keys The authenticity of host 'an-c05n02 (10.20.50.2)' can't be established.
RSA key fingerprint is 04:08:37:43:6b:5c:a0:b0:f5:27:a7:46:d4:77:a3:34.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-c05n02,10.20.50.2' (RSA) to the list of known hosts.
root@an-c05n02's password:
Now push the local copy of authorized_keys with both keys over to an-c05n02. rsync -av ~/.ssh/authorized_keys root@an-c05n02:/root/.ssh/ root@an-c05n02's password:
sending incremental file list
authorized_keys
sent 1704 bytes received 31 bytes 694.00 bytes/sec
total size is 1621 speedup is 0.93 Now log into the remote machine. This time, the connection should succeed without having entered a password! ssh root@an-c05n02 Last login: Sat Dec 10 16:06:21 2011 from 10.20.255.254 Perfect! Once you can log into both nodes, from either node, without a password you will be finished. Populating And Pushing ~/.ssh/known_hostsVarious applications will connect to the other node using different methods and networks. Each connection, when first established, will prompt for you to confirm that you trust the authentication, as we saw above. Many programs can't handle this prompt and will simply fail to connect. So to get around this, lets ssh into both nodes using all host names. This will populate a file called ~/.ssh/known_hosts. Once you do this on one node, you can simply copy the known_hosts to the other nodes and user's ~/.ssh/ directories. I simply paste this into a terminal, answering yes and then immediately exit from the ssh session. This is a bit tedious, I admit, but it only needs to be done one time for all nodes. Take the time to check the fingerprints as they are displayed to you. It is a bad habit to blindly type yes. Alter this to suit your host names. ssh root@an-c05n01 && \
ssh root@an-c05n01.alteeve.ca && \
ssh root@an-c05n01.bcn && \
ssh root@an-c05n01.sn && \
ssh root@an-c05n01.ifn && \
ssh root@an-c05n02 && \
ssh root@an-c05n02.alteeve.ca && \
ssh root@an-c05n02.bcn && \
ssh root@an-c05n02.sn && \
ssh root@an-c05n02.ifn The authenticity of host 'an-c05n01 (10.20.50.1)' can't be established.
RSA key fingerprint is e6:cb:50:41:88:26:c3:a5:aa:85:80:89:02:6f:ae:5e.
Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'an-c05n01,10.20.50.1' (RSA) to the list of known hosts.
Last login: Sun Dec 11 04:45:50 2011 from 10.20.255.254
[root@an-c05n01 ~]# exit logout
Connection to an-c05n01 closed. The authenticity of host 'an-c05n01.alteeve.ca (10.20.50.1)' can't be established.
RSA key fingerprint is e6:cb:50:41:88:26:c3:a5:aa:85:80:89:02:6f:ae:5e.
Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'an-c05n01.alteeve.ca' (RSA) to the list of known hosts.
Last login: Sun Dec 11 04:50:24 2011 from an-c05n01
[root@an-c05n01 ~]# exit logout
Connection to an-c05n01.alteeve.ca closed. The authenticity of host 'an-c05n01.bcn (10.20.50.1)' can't be established.
RSA key fingerprint is e6:cb:50:41:88:26:c3:a5:aa:85:80:89:02:6f:ae:5e.
Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'an-c05n01.bcn' (RSA) to the list of known hosts.
Last login: Sun Dec 11 04:51:14 2011 from an-c05n01
[root@an-c05n01 ~]# exit logout
Connection to an-c05n01.bcn closed. The authenticity of host 'an-c05n01.sn (10.10.50.1)' can't be established.
RSA key fingerprint is e6:cb:50:41:88:26:c3:a5:aa:85:80:89:02:6f:ae:5e.
Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'an-c05n01.sn,10.10.50.1' (RSA) to the list of known hosts.
Last login: Sun Dec 11 04:53:23 2011 from an-c05n01
[root@an-c05n01 ~]# exit logout
Connection to an-c05n01.sn closed. The authenticity of host 'an-c05n01.ifn (10.255.50.1)' can't be established.
RSA key fingerprint is e6:cb:50:41:88:26:c3:a5:aa:85:80:89:02:6f:ae:5e.
Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'an-c05n01.ifn,10.255.50.1' (RSA) to the list of known hosts.
Last login: Sun Dec 11 04:54:30 2011 from an-c05n01.sn
[root@an-c05n01 ~]# exit logout
Connection to an-c05n01.ifn closed. This is the connection to an-c05n02, which we established earlier when we pushed the authorized_keys, so this time we're not asked to verify the key. Last login: Sun Dec 11 05:44:40 2011 from 10.20.255.254
[root@an-c05n02 ~]# exit logout
Connection to an-c05n02 closed. Now we'll be asked to verify keys again, as only the base an-c05n02 hostname had been recorded earlier. The authenticity of host 'an-c05n02.alteeve.ca (10.20.50.2)' can't be established.
RSA key fingerprint is 04:08:37:43:6b:5c:a0:b0:f5:27:a7:46:d4:77:a3:34.
Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'an-c05n02.alteeve.ca' (RSA) to the list of known hosts.
Last login: Sun Dec 11 05:54:44 2011 from an-c05n01
[root@an-c05n02 ~]# exit logout
Connection to an-c05n02.alteeve.ca closed. The authenticity of host 'an-c05n02.bcn (10.20.50.2)' can't be established.
RSA key fingerprint is 04:08:37:43:6b:5c:a0:b0:f5:27:a7:46:d4:77:a3:34.
Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'an-c05n02.bcn' (RSA) to the list of known hosts.
Last login: Sun Dec 11 06:05:58 2011 from an-c05n01
[root@an-c05n02 ~]# exit logout
Connection to an-c05n02.bcn closed. The authenticity of host 'an-c05n02.sn (10.10.50.2)' can't be established.
RSA key fingerprint is 04:08:37:43:6b:5c:a0:b0:f5:27:a7:46:d4:77:a3:34.
Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'an-c05n02.sn,10.10.50.2' (RSA) to the list of known hosts.
Last login: Sun Dec 11 06:07:20 2011 from an-c05n01 exit logout
Connection to an-c05n02.sn closed. The authenticity of host 'an-c05n02.ifn (10.255.50.2)' can't be established.
RSA key fingerprint is 04:08:37:43:6b:5c:a0:b0:f5:27:a7:46:d4:77:a3:34.
Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'an-c05n02.ifn,10.255.50.2' (RSA) to the list of known hosts.
Last login: Sun Dec 11 06:08:11 2011 from an-c05n01.sn
[root@an-c05n02 ~]# exit logout
Connection to an-c05n02.ifn closed. Finally done! Now we can simply copy the ~/.ssh/known_hosts file to the other node. rsync -av root@an-c05n01:/root/.ssh/known_hosts ~/.ssh/ receiving incremental file list
sent 11 bytes received 41 bytes 104.00 bytes/sec
total size is 4413 speedup is 84.87 Now we can connect via SSH to either node, from either node, using any of the networks and we will not be prompted to enter a password or to verify SSH fingerprints any more. Configuring The Cluster FoundationWe need to configure the cluster in two stages. This is because we have something of a chicken-and-egg problem.
Conveniently, clustering has two logical parts;
The first, communication and membership, covers which nodes are part of the cluster and ejecting faulty nodes from the cluster, among other tasks. The second part, resource management, is provided by a second tool called rgmanager. It's this second part that we will set aside for later. Disable the 'qemu' BridgeBy default, libvirtd creates a bridge called virbr0 designed to connect virtual machines to the first eth0 interface. Our system will not need this, so we will remove it now. If libvirtd has started, skip to the next step. If you haven't started libvirtd yet, you can manually disable the bridge by blanking out the config file. cat /dev/null >/etc/libvirt/qemu/networks/default.xml If libvirtd has started, then you will need to first stop the bridge. virsh net-destroy default Network default destroyed To disable and remove it, run the following; virsh net-autostart default --disable Network default unmarked as autostarted virsh net-undefine default Network default has been undefined Keeping Time In SyncIt is very important that time on both nodes be kept in sync. The way to do this is to setup [[[NTP]], the network time protocol. I like to use the tick.redhat.com time server, though you are free to substitute your preferred time source. First, add the timeserver to the NTP configuration file by appending the following lines to the end of it. echo server tick.redhat.com$'\n'restrict tick.redhat.com mask 255.255.255.255 nomodify notrap noquery >> /etc/ntp.conf
tail -n 4 /etc/ntp.conf # Enable writing of statistics records.
#statistics clockstats cryptostats loopstats peerstats
server tick.redhat.com
restrict tick.redhat.com mask 255.255.255.255 nomodify notrap noquery Now make sure that the ntpd service starts on boot, then start it manually. chkconfig ntpd on
/etc/init.d/ntpd start Starting ntpd: [ OK ] Configuration MethodsIn Red Hat Cluster Services, the heart of the cluster is found in the /etc/cluster/cluster.conf XML configuration file. There are three main ways of editing this file. Two are already well documented, so I won't bother discussing them, beyond introducing them. The third way is by directly hand-crafting the cluster.conf file. This method is not very well documented, and directly manipulating configuration files is my preferred method. As my boss loves to say; "The more computers do for you, the more they do to you". The first two, well documented, graphical tools are:
I do like the tools above, but I often find issues that send me back to the command line. I'd recommend setting them aside for now as well. Once you feel comfortable with cluster.conf syntax, then by all means, go back and use them. I'd recommend not relying on them though, which might be the case if you try to use them too early in your studies. The First cluster.conf Foundation ConfigurationThe very first stage of building the cluster is to create a configuration file that is as minimal as possible. We're going to do this on an-c05n01 and, when we're done, copy it over to an-c05n02. Name the Cluster and Set The Configuration VersionThe cluster tag is the parent tag for the entire cluster configuration file. vim /etc/cluster/cluster.conf <?xml version="1.0"?>
<cluster name="an-cluster-A" config_version="1">
</cluster> The cluster element has two attributes that we need to set;
The name="" attribute defines the name of the cluster. It must be unique amongst the clusters on your network. It should be descriptive, but you will not want to make it too long, either. You will see this name in the various cluster tools and you will enter in, for example, when creating a GFS2 partition later on. This tutorial uses the cluster name an-cluster-A. The config_version="" attribute is an integer indicating the version of the configuration file. Whenever you make a change to the cluster.conf file, you will need to increment this version number by 1. If you don't increment this number, then the cluster tools will not know that the file needs to be reloaded. As this is the first version of this configuration file, it will start with 1. Note that this tutorial will increment the version after every change, regardless of whether it is explicitly pushed out to the other nodes and reloaded. The reason is to help get into the habit of always increasing this value. Configuring cman OptionsWe are setting up a special kind of cluster, called a 2-Node cluster. This is a special case because traditional quorum will not be useful. With only two nodes, each having a vote of 1, the total votes is 2. Quorum needs 50% + 1, which means that a single node failure would shut down the cluster, as the remaining node's vote is 50% exactly. That kind of defeats the purpose to having a cluster at all. So to account for this special case, there is a special attribute called two_node="1". This tells the cluster manager to continue operating with only one vote. This option requires that the expected_votes="" attribute be set to 1. Normally, expected_votes is set automatically to the total sum of the defined cluster nodes' votes (which itself is a default of 1). This is the other half of the "trick", as a single node's vote of 1 now always provides quorum (that is, 1 meets the 50% + 1 requirement). In short; this disables quorum. <?xml version="1.0"?>
<cluster name="an-cluster-A" config_version="2">
<cman expected_votes="1" two_node="1" />
</cluster> Take note of the self-closing <... /> tag. This is an XML syntax that tells the parser not to look for any child or a closing tags. Defining Cluster NodesThis example is a little artificial, please don't load it into your cluster as we will need to add a few child tags, but one thing at a time. This introduces two tags, the later a child tag of the former;
The first is the parent clusternodes tag, which takes no attributes of its own. Its sole purpose is to contain the clusternode child tags, of which there will be one per node. <?xml version="1.0"?>
<cluster name="an-cluster-A" config_version="3">
<cman expected_votes="1" two_node="1" />
<clusternodes>
<clusternode name="an-c05n01.alteeve.ca" nodeid="1" />
<clusternode name="an-c05n02.alteeve.ca" nodeid="2" />
</clusternodes>
</cluster> The clusternode tag defines each cluster node. There are many attributes available, but we will look at just the two required ones. The first is the name="" attribute. The value should match the fully qualified domain name, which you can check by running uname -n on each node. This isn't strictly required, mind you, but for simplicity's sake, this is the name we will use. The cluster decides which network to use for cluster communication by resolving the name="..." value. It will take the returned IP address and try to match it to one of the IPs on the system. Once it finds a match, that becomes the network the cluster will use. In our case, an-c05n01.alteeve.ca resolves to 10.20.50.1, which is used by bond0. If you have syslinux installed, you can check this out yourself using the following command; ifconfig |grep -B 1 $(gethostip -d $(uname -n)) | grep HWaddr | awk '{ print $1 }' bond0 Please see the clusternode's name attribute document for details on how name to interface mapping is resolved. The second attribute is nodeid="". This must be a unique integer amongst the <clusternode ...> elements in the cluster. It is what the cluster itself uses to identify the node. Defining Fence DevicesFencing devices are used to forcible eject a node from a cluster if it stops responding. This is generally done by forcing it to power off or reboot. Some SAN switches can logically disconnect a node from the shared storage device, a process called fabric fencing, which has the same effect of guaranteeing that the defective node can not alter the shared storage. A common, third type of fence device is one that cuts the mains power to the server. These are called PDUs and are effectively power bars where each outlet can be independently switched off over the network. In this tutorial, our nodes support IPMI, which we will use as the primary fence device. We also have an APC brand switched PDU which will act as a backup fence device.
All fence devices are contained within the parent fencedevices tag, which has no attributes of its own. Within this parent tag are one or more fencedevice child tags. <?xml version="1.0"?>
<cluster name="an-cluster-A" config_version="4">
<cman expected_votes="1" two_node="1" />
<clusternodes>
<clusternode name="an-c05n01.alteeve.ca" nodeid="1" />
<clusternode name="an-c05n02.alteeve.ca" nodeid="2" />
</clusternodes>
<fencedevices>
<fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-c05n01.ipmi" login="root" passwd="secret" />
<fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-c05n02.ipmi" login="root" passwd="secret" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu2.alteeve.ca" name="pdu2" />
</fencedevices>
</cluster> In our cluster, each fence device used will have its own fencedevice tag. If you are using IPMI, this means you will have a fencedevice entry for each node, as each physical IPMI BMC is a unique fence device. On the other hand, fence devices that support multiple nodes, like switched PDUs, will have just one entry. In our case, we're using both types, so we have three fences devices; The two IPMI BMCs plus the switched PDU. All fencedevice tags share two basic attributes; name="" and agent="".
For those curious, the full details are described in the FenceAgentAPI. If you have two or more of the same fence device, like IPMI, then you will use the same fence agent value a corresponding number of times. Beyond these two attributes, each fence agent will have its own subset of attributes. The scope of which is outside this tutorial, though we will see examples for IPMI and a switched PDU. All fence agents have a corresponding man page that will show you what attributes it accepts and how they are used. The two fence agents we will see here have their attributes defines in the following man pages.
The example above is what this tutorial will use. Using the Fence DevicesNow we have nodes and fence devices defined, we will go back and tie them together. This is done by:
Here is how we implement IPMI as the primary fence device with the APC switched PDU as the backup method. <?xml version="1.0"?>
<cluster name="an-cluster-A" config_version="5">
<cman expected_votes="1" two_node="1" />
<clusternodes>
<clusternode name="an-c05n01.alteeve.ca" nodeid="1">
<fence>
<method name="ipmi">
<device name="ipmi_an01" action="reboot" />
</method>
<method name="pdu">
<device name="pdu2" port="1" action="reboot" />
</method>
</fence>
</clusternode>
<clusternode name="an-c05n02.alteeve.ca" nodeid="2">
<fence>
<method name="ipmi">
<device name="ipmi_an02" action="reboot" />
</method>
<method name="pdu">
<device name="pdu2" port="2" action="reboot" />
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-c05n01.ipmi" login="root" passwd="secret" />
<fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-c05n02.ipmi" login="root" passwd="secret" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu2.alteeve.ca" name="pdu2" />
</fencedevices>
</cluster> First, notice that the fence tag has no attributes. It's merely a parent for the method(s) child elements. There are two method elements, one for each fence device, named ipmi and pdu. These names are merely descriptive and can be whatever you feel is most appropriate. Within each method element is one or more device tags. For a given method to succeed, all defined device elements must themselves succeed. This is very useful for grouping calls to separate PDUs when dealing with nodes having redundant power supplies, as shown in the PDU example above. The actual fence device configuration is the final piece of the puzzle. It is here that you specify per-node configuration options and link these attributes to a given fencedevice. Here, we see the link to the fencedevice via the name, ipmi_an01 in this example. Note that the PDU definition needs a port="" attribute where the IPMI fence devices do not. These are the sorts of differences you will find, varying depending on how the fence device agent works. When a fence call is needed, the fence devices will be called in the order they are found here. If both devices fail, the cluster will go back to the start and try again, looping indefinitely until one device succeeds.
Let's step through an example fence call to help show how the per-cluster and fence device attributes are combined during a fence call.
When the cluster calls the fence agent, it does so by initially calling the fence agent script with no arguments. /usr/sbin/fence_ipmilan Then it will pass to that agent the following arguments: ipaddr=an-c05n02.ipmi
login=root
passwd=secret
action=reboot As you can see then, the first three arguments are from the fencedevice attributes and the last one is from the device attributes under an-c05n02's clusternode's fence tag. If this method fails, then the PDU will be called in a very similar way, but with an extra argument from the device attributes. /usr/sbin/fence_apc_snmp Then it will pass to that agent the following arguments: ipaddr=pdu2.alteeve.ca
port=2
action=reboot Should this fail, the cluster will go back and try the IPMI interface again. It will loop through the fence device methods forever until one of the methods succeeds. Below are snippets from other clusters using different fence device configurations which might help you build your cluster. Example <fencedevice...> Tag For IPMI
As stated above, it is critical to disable the acpid daemon from running with the server. chkconfig acpid off
/etc/init.d/acpid stop
Here we will show what IPMI <fencedevice...> tags look like. ...
<clusternode name="an-c05n01.alteeve.ca" nodeid="1">
<fence>
<method name="ipmi">
<device name="ipmi_an01" action="reboot"/>
</method>
</fence>
</clusternode>
<clusternode name="an-c05n02.alteeve.ca" nodeid="2">
<fence>
<method name="ipmi">
<device name="ipmi_an02" action="reboot"/>
</method>
</fence>
</clusternode>
...
<fencedevices>
<fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-c05n01.ipmi" login="root" passwd="secret" />
<fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-c05n02.ipmi" login="root" passwd="secret" />
</fencedevices>
Example <fencedevice...> Tag For HP iLOHere we will show how to use iLO (integraterd Lights-Out) management devices as <fencedevice...> entries. We won't be using it ourselves, but it is quite popular as a fence device so I wanted to show an example of its use. ...
<clusternode name="an-c05n01.alteeve.ca" nodeid="1">
<fence>
<method name="ilo">
<device action="reboot" name="ilo_an01"/>
</method>
</fence>
</clusternode>
<clusternode name="an-c05n02.alteeve.ca" nodeid="2">
<fence>
<method name="ilo">
<device action="reboot" name="ilo_an02"/>
</method>
</fence>
</clusternode>
...
<fencedevices>
<fencedevice agent="fence_ilo" ipaddr="an-c05n01.ilo" login="root" name="ilo_an01" passwd="secret"/>
<fencedevice agent="fence_ilo" ipaddr="an-c05n02.ilo" login="root" name="ilo_an02" passwd="secret"/>
</fencedevices>
Example <fencedevice...> Tag For Dell's DRAC
Here we will show how to use DRAC (Dell Remote Access Controller) management devices as <fencedevice...> entries. We won't be using it ourselves, but it is another popular as a fence device so I wanted to show an example of its use. ...
<clusternode name="an-c05n01.alteeve.ca" nodeid="1">
<fence>
<method name="drac">
<device action="reboot" name="drac_an01"/>
</method>
</fence>
</clusternode>
<clusternode name="an-c05n02.alteeve.ca" nodeid="2">
<fence>
<method name="ilo">
<device action="reboot" name="drac_an02"/>
</method>
</fence>
</clusternode>
...
<fencedevices>
<fencedevice agent="fence_drac5" cmd_prompt="admin1->" ipaddr="an-c05n01.drac" login="root" name="drac_an01" passwd="secret" secure="1"/>
<fencedevice agent="fence_drac5" cmd_prompt="admin1->" ipaddr="an-c05n02.drac" login="root" name="drac_an02" passwd="secret" secure="1"/>
</fencedevices>
Example <fencedevice...> Tag For APC Switched PDUsHere we will show how to configure APC switched PDU <fencedevice...> tags. There are two agents for these devices; One that uses the telnet or ssh login and one that uses SNMP. This tutorial uses the later, and it is recommended that you do the same. The example below is from a production cluster that uses redundant power supplies and two separate PDUs. This is how you will want to configure any production clusters you build. ...
<clusternode name="an-c05n01.alteeve.ca" nodeid="1">
<fence>
<method name="pdu2">
<device action="reboot" name="pdu1" port="1"/>
<device action="reboot" name="pdu2" port="1"/>
</method>
</fence>
</clusternode>
<clusternode name="an-c05n02.alteeve.ca" nodeid="2">
<fence>
<method name="pdu2">
<device action="reboot" name="pdu1" port="2"/>
<device action="reboot" name="pdu2" port="2"/>
</method>
</fence>
</clusternode>
...
<fencedevices>
<fencedevice agent="fence_apc_snmp" ipaddr="pdu1.alteeve.ca" name="pdu1" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu2.alteeve.ca" name="pdu2" />
</fencedevices>
Give Nodes More Time To StartClusters with more than three nodes will have to gain quorum before they can fence other nodes. As we discussed earlier though, this is not the case when using the two_node="1" attribute in the cman element. What this means in practice is that if you start the cluster on one node and then wait too long to start the cluster on the second node, the first will fence the second. The logic behind this is; When the cluster starts, it will try to talk to its fellow node and then fail. With the special two_node="1" attribute set, the cluster knows that it is allowed to start clustered services, but it has no way to say for sure what state the other node is in. It could well be online and hosting services for all it knows. So it has to proceed on the assumption that the other node is alive and using shared resources. Given that, and given that it can not talk to the other node, its only safe option is to fence the other node. Only then can it be confident that it is safe to start providing clustered services. <?xml version="1.0"?>
<cluster name="an-cluster-A" config_version="6">
<cman expected_votes="1" two_node="1" />
<clusternodes>
<clusternode name="an-c05n01.alteeve.ca" nodeid="1">
<fence>
<method name="ipmi">
<device name="ipmi_an01" action="reboot" />
</method>
<method name="pdu">
<device name="pdu2" port="1" action="reboot" />
</method>
</fence>
</clusternode>
<clusternode name="an-c05n02.alteeve.ca" nodeid="2">
<fence>
<method name="ipmi">
<device name="ipmi_an02" action="reboot" />
</method>
<method name="pdu">
<device name="pdu2" port="2" action="reboot" />
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-c05n01.ipmi" login="root" passwd="secret" />
<fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-c05n02.ipmi" login="root" passwd="secret" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu2.alteeve.ca" name="pdu2" />
</fencedevices>
<fence_daemon post_join_delay="30" />
</cluster> The new tag is fence_daemon, seen near the bottom if the file above. The change is made using the post_join_delay="30" attribute. By default, the cluster will declare the other node dead after just 6 seconds. The reason is that the larger this value, the slower the start-up of the cluster services will be. During testing and development though, I find this value to be far too short and frequently led to unnecessary fencing. Once your cluster is setup and working, it's not a bad idea to reduce this value to the lowest value with which you are comfortable. Configuring TotemThere are many attributes for the totem element. For now though, we're only going to set two of them. We know that cluster communication will be travelling over our private, secured BCN network, so for the sake of simplicity, we're going to disable encryption. We are also offering network redundancy using the bonding drivers, so we're also going to disable totem's redundant ring protocol. <?xml version="1.0"?>
<cluster name="an-cluster-A" config_version="7">
<cman expected_votes="1" two_node="1" />
<clusternodes>
<clusternode name="an-c05n01.alteeve.ca" nodeid="1">
<fence>
<method name="ipmi">
<device name="ipmi_an01" action="reboot" />
</method>
<method name="pdu">
<device name="pdu2" port="1" action="reboot" />
</method>
</fence>
</clusternode>
<clusternode name="an-c05n02.alteeve.ca" nodeid="2">
<fence>
<method name="ipmi">
<device name="ipmi_an02" action="reboot" />
</method>
<method name="pdu">
<device name="pdu2" port="2" action="reboot" />
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-c05n01.ipmi" login="root" passwd="secret" />
<fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-c05n02.ipmi" login="root" passwd="secret" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu2.alteeve.ca" name="pdu2" />
</fencedevices>
<fence_daemon post_join_delay="30" />
<totem rrp_mode="none" secauth="off"/>
</cluster>
RRP is an optional second ring that can be used for cluster communication in the case of a break down in the first ring. However, if you wish to explore it further, please take a look at the clusternode element tag called <altname...>. When altname is used though, then the rrp_mode attribute will need to be changed to either active or passive (the details of which are outside the scope of this tutorial). The second option we're looking at here is the secauth="off" attribute. This controls whether the cluster communications are encrypted or not. We can safely disable this because we're working on a known-private network, which yields two benefits; It's simpler to setup and it's a lot faster. If you must encrypt the cluster communications, then you can do so here. The details of which are also outside the scope of this tutorial though. Validating and Pushing the /etc/cluster/cluster.conf FileOne of the most noticeable changes in RHCS cluster stable 3 is that we no longer have to make a long, cryptic xmllint call to validate our cluster configuration. Now we can simply call ccs_config_validate. ccs_config_validate Configuration validates If there was a problem, you need to go back and fix it. DO NOT proceed until your configuration validates. Once it does, we're ready to move on! With it validated, we need to push it to the other node. As the cluster is not running yet, we will push it out using rsync. rsync -av /etc/cluster/cluster.conf root@an-c05n02:/etc/cluster/ sending incremental file list
cluster.conf
sent 1198 bytes received 31 bytes 2458.00 bytes/sec
total size is 1118 speedup is 0.91 Setting Up ricciAnother change from RHCS stable 2 is how configuration changes are propagated. Before, after a change, we'd push out the updated cluster configuration by calling ccs_tool update /etc/cluster/cluster.conf. Now this is done with cman_tool version -r. More fundamentally though, the cluster needs to authenticate against each node and does this using the local ricci system user. The user has no password initially, so we need to set one. On both nodes: passwd ricci Changing password for user ricci.
New password:
Retype new password:
passwd: all authentication tokens updated successfully. You will need to enter this password once from each node against the other node. We will see this later. Now make sure that the ricci daemon is set to start on boot and is running now. chkconfig ricci on
chkconfig --list ricci ricci 0:off 1:off 2:on 3:on 4:on 5:on 6:off Now start it up. /etc/init.d/ricci start Starting ricci: [ OK ]
We also need to have a daemon called modclusterd running on start. chkconfig modclusterd on
chkconfig --list modclusterd modclusterd 0:off 1:off 2:off 3:off 4:off 5:off 6:off Now start it up. /etc/init.d/modclusterd start Starting Cluster Module - cluster monitor: Setting verbosity level to LogBasic
[ OK ] Starting the Cluster for the First TimeIt's a good idea to open a second terminal on either node and tail the /var/log/messages syslog file. All cluster messages will be recorded here and it will help to debug problems if you can watch the logs. To do this, in the new terminal windows run; clear; tail -f -n 0 /var/log/messages This will clear the screen and start watching for new lines to be written to syslog. When you are done watching syslog, press the <ctrl> + c key combination. How you lay out your terminal windows is, obviously, up to your own preferences. Below is a configuration I have found very useful. With the terminals setup, lets start the cluster!
On both nodes, run: /etc/init.d/cman start Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Starting gfs_controld... [ OK ]
Unfencing self... [ OK ]
Joining fence domain... [ OK ] Here is what you should see in syslog: Dec 13 12:08:44 an-c05n01 kernel: DLM (built Nov 9 2011 08:04:11) installed
Dec 13 12:08:45 an-c05n01 corosync[3434]: [MAIN ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.
Dec 13 12:08:45 an-c05n01 corosync[3434]: [MAIN ] Corosync built-in features: nss dbus rdma snmp
Dec 13 12:08:45 an-c05n01 corosync[3434]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf
Dec 13 12:08:45 an-c05n01 corosync[3434]: [MAIN ] Successfully parsed cman config
Dec 13 12:08:45 an-c05n01 corosync[3434]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Dec 13 12:08:45 an-c05n01 corosync[3434]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Dec 13 12:08:46 an-c05n01 corosync[3434]: [TOTEM ] The network interface [10.20.50.1] is now up.
Dec 13 12:08:46 an-c05n01 corosync[3434]: [QUORUM] Using quorum provider quorum_cman
Dec 13 12:08:46 an-c05n01 corosync[3434]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Dec 13 12:08:46 an-c05n01 corosync[3434]: [CMAN ] CMAN 3.0.12.1 (built Sep 30 2011 03:17:43) started
Dec 13 12:08:46 an-c05n01 corosync[3434]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90
Dec 13 12:08:46 an-c05n01 corosync[3434]: [SERV ] Service engine loaded: openais checkpoint service B.01.01
Dec 13 12:08:46 an-c05n01 corosync[3434]: [SERV ] Service engine loaded: corosync extended virtual synchrony service
Dec 13 12:08:46 an-c05n01 corosync[3434]: [SERV ] Service engine loaded: corosync configuration service
Dec 13 12:08:46 an-c05n01 corosync[3434]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
Dec 13 12:08:46 an-c05n01 corosync[3434]: [SERV ] Service engine loaded: corosync cluster config database access v1.01
Dec 13 12:08:46 an-c05n01 corosync[3434]: [SERV ] Service engine loaded: corosync profile loading service
Dec 13 12:08:46 an-c05n01 corosync[3434]: [QUORUM] Using quorum provider quorum_cman
Dec 13 12:08:46 an-c05n01 corosync[3434]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Dec 13 12:08:46 an-c05n01 corosync[3434]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
Dec 13 12:08:46 an-c05n01 corosync[3434]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 13 12:08:46 an-c05n01 corosync[3434]: [CMAN ] quorum regained, resuming activity
Dec 13 12:08:46 an-c05n01 corosync[3434]: [QUORUM] This node is within the primary component and will provide service.
Dec 13 12:08:46 an-c05n01 corosync[3434]: [QUORUM] Members[1]: 1
Dec 13 12:08:46 an-c05n01 corosync[3434]: [QUORUM] Members[1]: 1
Dec 13 12:08:46 an-c05n01 corosync[3434]: [CPG ] chosen downlist: sender r(0) ip(10.20.50.1) ; members(old:0 left:0)
Dec 13 12:08:46 an-c05n01 corosync[3434]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 13 12:08:47 an-c05n01 corosync[3434]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 13 12:08:47 an-c05n01 corosync[3434]: [QUORUM] Members[2]: 1 2
Dec 13 12:08:47 an-c05n01 corosync[3434]: [QUORUM] Members[2]: 1 2
Dec 13 12:08:47 an-c05n01 corosync[3434]: [CPG ] chosen downlist: sender r(0) ip(10.20.50.1) ; members(old:1 left:0)
Dec 13 12:08:47 an-c05n01 corosync[3434]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 13 12:08:49 an-c05n01 fenced[3490]: fenced 3.0.12.1 started
Dec 13 12:08:49 an-c05n01 dlm_controld[3515]: dlm_controld 3.0.12.1 started
Dec 13 12:08:51 an-c05n01 gfs_controld[3565]: gfs_controld 3.0.12.1 started
Now to confirm that the cluster is operating properly, run cman_tool status; cman_tool status Version: 6.2.0
Config Version: 7
Cluster Name: an-cluster-A
Cluster Id: 24561
Cluster Member: Yes
Cluster Generation: 8
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Node votes: 1
Quorum: 1
Active subsystems: 7
Flags: 2node
Ports Bound: 0
Node name: an-c05n01.alteeve.ca
Node ID: 1
Multicast addresses: 239.192.95.81
Node addresses: 10.20.50.1 We can see that the both nodes are talking because of the Nodes: 2 entry. If you ever want to see the nitty-gritty configuration, you can run corosync-objctl. corosync-objctl cluster.name=an-cluster-A
cluster.config_version=7
cluster.cman.expected_votes=1
cluster.cman.two_node=1
cluster.cman.nodename=an-c05n01.alteeve.ca
cluster.cman.cluster_id=24561
cluster.clusternodes.clusternode.name=an-c05n01.alteeve.ca
cluster.clusternodes.clusternode.nodeid=1
cluster.clusternodes.clusternode.fence.method.name=ipmi
cluster.clusternodes.clusternode.fence.method.device.name=ipmi_an01
cluster.clusternodes.clusternode.fence.method.device.action=reboot
cluster.clusternodes.clusternode.fence.method.name=pdu
cluster.clusternodes.clusternode.fence.method.device.name=pdu2
cluster.clusternodes.clusternode.fence.method.device.port=1
cluster.clusternodes.clusternode.fence.method.device.action=reboot
cluster.clusternodes.clusternode.name=an-c05n02.alteeve.ca
cluster.clusternodes.clusternode.nodeid=2
cluster.clusternodes.clusternode.fence.method.name=ipmi
cluster.clusternodes.clusternode.fence.method.device.name=ipmi_an02
cluster.clusternodes.clusternode.fence.method.device.action=reboot
cluster.clusternodes.clusternode.fence.method.name=pdu
cluster.clusternodes.clusternode.fence.method.device.name=pdu2
cluster.clusternodes.clusternode.fence.method.device.port=2
cluster.clusternodes.clusternode.fence.method.device.action=reboot
cluster.fencedevices.fencedevice.name=ipmi_an01
cluster.fencedevices.fencedevice.agent=fence_ipmilan
cluster.fencedevices.fencedevice.ipaddr=an-c05n01.ipmi
cluster.fencedevices.fencedevice.login=root
cluster.fencedevices.fencedevice.passwd=secret
cluster.fencedevices.fencedevice.name=ipmi_an02
cluster.fencedevices.fencedevice.agent=fence_ipmilan
cluster.fencedevices.fencedevice.ipaddr=an-c05n02.ipmi
cluster.fencedevices.fencedevice.login=root
cluster.fencedevices.fencedevice.passwd=secret
cluster.fencedevices.fencedevice.agent=fence_apc_snmp
cluster.fencedevices.fencedevice.ipaddr=pdu2.alteeve.ca
cluster.fencedevices.fencedevice.name=pdu2
cluster.fence_daemon.post_join_delay=30
cluster.totem.rrp_mode=none
cluster.totem.secauth=off
totem.rrp_mode=none
totem.secauth=off
totem.transport=udp
totem.version=2
totem.nodeid=1
totem.vsftype=none
totem.token=10000
totem.join=60
totem.fail_recv_const=2500
totem.consensus=2000
totem.key=an-cluster-A
totem.interface.ringnumber=0
totem.interface.bindnetaddr=10.20.50.1
totem.interface.mcastaddr=239.192.95.81
totem.interface.mcastport=5405
libccs.next_handle=7
libccs.connection.ccs_handle=3
libccs.connection.config_version=7
libccs.connection.fullxpath=0
libccs.connection.ccs_handle=4
libccs.connection.config_version=7
libccs.connection.fullxpath=0
libccs.connection.ccs_handle=5
libccs.connection.config_version=7
libccs.connection.fullxpath=0
logging.timestamp=on
logging.to_logfile=yes
logging.logfile=/var/log/cluster/corosync.log
logging.logfile_priority=info
logging.to_syslog=yes
logging.syslog_facility=local4
logging.syslog_priority=info
aisexec.user=ais
aisexec.group=ais
service.name=corosync_quorum
service.ver=0
service.name=corosync_cman
service.ver=0
quorum.provider=quorum_cman
service.name=openais_ckpt
service.ver=0
runtime.services.quorum.service_id=12
runtime.services.cman.service_id=9
runtime.services.ckpt.service_id=3
runtime.services.ckpt.0.tx=0
runtime.services.ckpt.0.rx=0
runtime.services.ckpt.1.tx=0
runtime.services.ckpt.1.rx=0
runtime.services.ckpt.2.tx=0
runtime.services.ckpt.2.rx=0
runtime.services.ckpt.3.tx=0
runtime.services.ckpt.3.rx=0
runtime.services.ckpt.4.tx=0
runtime.services.ckpt.4.rx=0
runtime.services.ckpt.5.tx=0
runtime.services.ckpt.5.rx=0
runtime.services.ckpt.6.tx=0
runtime.services.ckpt.6.rx=0
runtime.services.ckpt.7.tx=0
runtime.services.ckpt.7.rx=0
runtime.services.ckpt.8.tx=0
runtime.services.ckpt.8.rx=0
runtime.services.ckpt.9.tx=0
runtime.services.ckpt.9.rx=0
runtime.services.ckpt.10.tx=0
runtime.services.ckpt.10.rx=0
runtime.services.ckpt.11.tx=2
runtime.services.ckpt.11.rx=3
runtime.services.ckpt.12.tx=0
runtime.services.ckpt.12.rx=0
runtime.services.ckpt.13.tx=0
runtime.services.ckpt.13.rx=0
runtime.services.evs.service_id=0
runtime.services.evs.0.tx=0
runtime.services.evs.0.rx=0
runtime.services.cfg.service_id=7
runtime.services.cfg.0.tx=0
runtime.services.cfg.0.rx=0
runtime.services.cfg.1.tx=0
runtime.services.cfg.1.rx=0
runtime.services.cfg.2.tx=0
runtime.services.cfg.2.rx=0
runtime.services.cfg.3.tx=0
runtime.services.cfg.3.rx=0
runtime.services.cpg.service_id=8
runtime.services.cpg.0.tx=4
runtime.services.cpg.0.rx=8
runtime.services.cpg.1.tx=0
runtime.services.cpg.1.rx=0
runtime.services.cpg.2.tx=0
runtime.services.cpg.2.rx=0
runtime.services.cpg.3.tx=16
runtime.services.cpg.3.rx=23
runtime.services.cpg.4.tx=0
runtime.services.cpg.4.rx=0
runtime.services.cpg.5.tx=2
runtime.services.cpg.5.rx=3
runtime.services.confdb.service_id=11
runtime.services.pload.service_id=13
runtime.services.pload.0.tx=0
runtime.services.pload.0.rx=0
runtime.services.pload.1.tx=0
runtime.services.pload.1.rx=0
runtime.services.quorum.service_id=12
runtime.connections.active=6
runtime.connections.closed=110
runtime.connections.fenced:CPG:3490:19.service_id=8
runtime.connections.fenced:CPG:3490:19.client_pid=3490
runtime.connections.fenced:CPG:3490:19.responses=5
runtime.connections.fenced:CPG:3490:19.dispatched=9
runtime.connections.fenced:CPG:3490:19.requests=5
runtime.connections.fenced:CPG:3490:19.sem_retry_count=0
runtime.connections.fenced:CPG:3490:19.send_retry_count=0
runtime.connections.fenced:CPG:3490:19.recv_retry_count=0
runtime.connections.fenced:CPG:3490:19.flow_control=0
runtime.connections.fenced:CPG:3490:19.flow_control_count=0
runtime.connections.fenced:CPG:3490:19.queue_size=0
runtime.connections.fenced:CPG:3490:19.invalid_request=0
runtime.connections.fenced:CPG:3490:19.overload=0
runtime.connections.dlm_controld:CPG:3515:22.service_id=8
runtime.connections.dlm_controld:CPG:3515:22.client_pid=3515
runtime.connections.dlm_controld:CPG:3515:22.responses=5
runtime.connections.dlm_controld:CPG:3515:22.dispatched=8
runtime.connections.dlm_controld:CPG:3515:22.requests=5
runtime.connections.dlm_controld:CPG:3515:22.sem_retry_count=0
runtime.connections.dlm_controld:CPG:3515:22.send_retry_count=0
runtime.connections.dlm_controld:CPG:3515:22.recv_retry_count=0
runtime.connections.dlm_controld:CPG:3515:22.flow_control=0
runtime.connections.dlm_controld:CPG:3515:22.flow_control_count=0
runtime.connections.dlm_controld:CPG:3515:22.queue_size=0
runtime.connections.dlm_controld:CPG:3515:22.invalid_request=0
runtime.connections.dlm_controld:CPG:3515:22.overload=0
runtime.connections.dlm_controld:CKPT:3515:23.service_id=3
runtime.connections.dlm_controld:CKPT:3515:23.client_pid=3515
runtime.connections.dlm_controld:CKPT:3515:23.responses=0
runtime.connections.dlm_controld:CKPT:3515:23.dispatched=0
runtime.connections.dlm_controld:CKPT:3515:23.requests=0
runtime.connections.dlm_controld:CKPT:3515:23.sem_retry_count=0
runtime.connections.dlm_controld:CKPT:3515:23.send_retry_count=0
runtime.connections.dlm_controld:CKPT:3515:23.recv_retry_count=0
runtime.connections.dlm_controld:CKPT:3515:23.flow_control=0
runtime.connections.dlm_controld:CKPT:3515:23.flow_control_count=0
runtime.connections.dlm_controld:CKPT:3515:23.queue_size=0
runtime.connections.dlm_controld:CKPT:3515:23.invalid_request=0
runtime.connections.dlm_controld:CKPT:3515:23.overload=0
runtime.connections.gfs_controld:CPG:3565:26.service_id=8
runtime.connections.gfs_controld:CPG:3565:26.client_pid=3565
runtime.connections.gfs_controld:CPG:3565:26.responses=5
runtime.connections.gfs_controld:CPG:3565:26.dispatched=8
runtime.connections.gfs_controld:CPG:3565:26.requests=5
runtime.connections.gfs_controld:CPG:3565:26.sem_retry_count=0
runtime.connections.gfs_controld:CPG:3565:26.send_retry_count=0
runtime.connections.gfs_controld:CPG:3565:26.recv_retry_count=0
runtime.connections.gfs_controld:CPG:3565:26.flow_control=0
runtime.connections.gfs_controld:CPG:3565:26.flow_control_count=0
runtime.connections.gfs_controld:CPG:3565:26.queue_size=0
runtime.connections.gfs_controld:CPG:3565:26.invalid_request=0
runtime.connections.gfs_controld:CPG:3565:26.overload=0
runtime.connections.fenced:CPG:3490:28.service_id=8
runtime.connections.fenced:CPG:3490:28.client_pid=3490
runtime.connections.fenced:CPG:3490:28.responses=5
runtime.connections.fenced:CPG:3490:28.dispatched=8
runtime.connections.fenced:CPG:3490:28.requests=5
runtime.connections.fenced:CPG:3490:28.sem_retry_count=0
runtime.connections.fenced:CPG:3490:28.send_retry_count=0
runtime.connections.fenced:CPG:3490:28.recv_retry_count=0
runtime.connections.fenced:CPG:3490:28.flow_control=0
runtime.connections.fenced:CPG:3490:28.flow_control_count=0
runtime.connections.fenced:CPG:3490:28.queue_size=0
runtime.connections.fenced:CPG:3490:28.invalid_request=0
runtime.connections.fenced:CPG:3490:28.overload=0
runtime.connections.corosync-objctl:CONFDB:3698:27.service_id=11
runtime.connections.corosync-objctl:CONFDB:3698:27.client_pid=3698
runtime.connections.corosync-objctl:CONFDB:3698:27.responses=444
runtime.connections.corosync-objctl:CONFDB:3698:27.dispatched=0
runtime.connections.corosync-objctl:CONFDB:3698:27.requests=447
runtime.connections.corosync-objctl:CONFDB:3698:27.sem_retry_count=0
runtime.connections.corosync-objctl:CONFDB:3698:27.send_retry_count=0
runtime.connections.corosync-objctl:CONFDB:3698:27.recv_retry_count=0
runtime.connections.corosync-objctl:CONFDB:3698:27.flow_control=0
runtime.connections.corosync-objctl:CONFDB:3698:27.flow_control_count=0
runtime.connections.corosync-objctl:CONFDB:3698:27.queue_size=0
runtime.connections.corosync-objctl:CONFDB:3698:27.invalid_request=0
runtime.connections.corosync-objctl:CONFDB:3698:27.overload=0
runtime.totem.pg.msg_reserved=1
runtime.totem.pg.msg_queue_avail=761
runtime.totem.pg.mrp.srp.orf_token_tx=2
runtime.totem.pg.mrp.srp.orf_token_rx=405
runtime.totem.pg.mrp.srp.memb_merge_detect_tx=53
runtime.totem.pg.mrp.srp.memb_merge_detect_rx=53
runtime.totem.pg.mrp.srp.memb_join_tx=3
runtime.totem.pg.mrp.srp.memb_join_rx=5
runtime.totem.pg.mrp.srp.mcast_tx=45
runtime.totem.pg.mrp.srp.mcast_retx=0
runtime.totem.pg.mrp.srp.mcast_rx=56
runtime.totem.pg.mrp.srp.memb_commit_token_tx=4
runtime.totem.pg.mrp.srp.memb_commit_token_rx=4
runtime.totem.pg.mrp.srp.token_hold_cancel_tx=4
runtime.totem.pg.mrp.srp.token_hold_cancel_rx=7
runtime.totem.pg.mrp.srp.operational_entered=2
runtime.totem.pg.mrp.srp.operational_token_lost=0
runtime.totem.pg.mrp.srp.gather_entered=2
runtime.totem.pg.mrp.srp.gather_token_lost=0
runtime.totem.pg.mrp.srp.commit_entered=2
runtime.totem.pg.mrp.srp.commit_token_lost=0
runtime.totem.pg.mrp.srp.recovery_entered=2
runtime.totem.pg.mrp.srp.recovery_token_lost=0
runtime.totem.pg.mrp.srp.consensus_timeouts=0
runtime.totem.pg.mrp.srp.mtt_rx_token=913
runtime.totem.pg.mrp.srp.avg_token_workload=0
runtime.totem.pg.mrp.srp.avg_backlog_calc=0
runtime.totem.pg.mrp.srp.rx_msg_dropped=0
runtime.totem.pg.mrp.srp.continuous_gather=0
runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure=0
runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(10.20.50.1)
runtime.totem.pg.mrp.srp.members.1.join_count=1
runtime.totem.pg.mrp.srp.members.1.status=joined
runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(10.20.50.2)
runtime.totem.pg.mrp.srp.members.2.join_count=1
runtime.totem.pg.mrp.srp.members.2.status=joined
runtime.blackbox.dump_flight_data=no
runtime.blackbox.dump_state=no
cman_private.COROSYNC_DEFAULT_CONFIG_IFACE=xmlconfig:cmanpreconfig If you want to check what DLM lockspaces, you can use dlm_tool ls to list lock spaces. Given that we're not running and resources or clustered filesystems though, there won't be any at this time. We'll look at this again later. Testing FencingWe need to thoroughly test our fence configuration and devices before we proceed. Should the cluster call a fence, and if the fence call fails, the cluster will hang until the fence finally succeeds. There is no way to abort a fence, so this could effectively hang the cluster. If we have problems, we need to find them now. We need to run two tests from each node against the other node for a total of four tests.
After the lost node is recovered, remember to restart cman before starting the next test. Hanging an-c05n01Be sure to be tailing the /var/log/messages on an-c05n02. Go to an-c05n01's first terminal and run the following command.
On an-c05n01 run: echo c > /proc/sysrq-trigger On an-c05n02's syslog terminal, you should see the following entries in the log. Dec 13 12:42:39 an-c05n02 corosync[2758]: [TOTEM ] A processor failed, forming new configuration.
Dec 13 12:42:41 an-c05n02 corosync[2758]: [QUORUM] Members[1]: 2
Dec 13 12:42:41 an-c05n02 corosync[2758]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 13 12:42:41 an-c05n02 corosync[2758]: [CPG ] chosen downlist: sender r(0) ip(10.20.50.2) ; members(old:2 left:1)
Dec 13 12:42:41 an-c05n02 corosync[2758]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 13 12:42:41 an-c05n02 kernel: dlm: closing connection to node 1
Dec 13 12:42:41 an-c05n02 fenced[2817]: fencing node an-c05n01.alteeve.ca
Dec 13 12:42:56 an-c05n02 fenced[2817]: fence an-c05n01.alteeve.ca success Perfect! If you are watching an-c05n01's display, you should now see it starting to boot back up.
Cutting the Power to an-c05n01As was discussed earlier, IPMI and other out-of-band management interfaces have a fatal flaw as a fence device. Their BMC draws its power from the same power supply as the node itself. Thus, when the power supply itself fails (or the mains connection is pulled/tripped over), fencing via IPMI will fail. This makes the power supply a single point of failure, which is what the PDU protects us against. So to simulate a failed power supply, we're going to use an-c05n02's fence_apc fence agent to turn off the power to an-c05n01. Alternatively, you could also just unplug the power and the fence would still succeed. The fence call only needs to confirm that the node is off to succeed. Whether the node restarts after or not is not important so far as the cluster is concerned. From an-c05n02, pull the power on an-c05n01 with the following call; fence_apc_snmp -a pdu2.alteeve.ca -n 1 -o off Success: Powered OFF Back on an-c05n02's syslog, we should see the following entries; Dec 13 12:45:46 an-c05n02 corosync[2758]: [TOTEM ] A processor failed, forming new configuration.
Dec 13 12:45:48 an-c05n02 corosync[2758]: [QUORUM] Members[1]: 2
Dec 13 12:45:48 an-c05n02 corosync[2758]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 13 12:45:48 an-c05n02 corosync[2758]: [CPG ] chosen downlist: sender r(0) ip(10.20.50.2) ; members(old:2 left:1)
Dec 13 12:45:48 an-c05n02 corosync[2758]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 13 12:45:48 an-c05n02 kernel: dlm: closing connection to node 1
Dec 13 12:45:48 an-c05n02 fenced[2817]: fencing node an-c05n01.alteeve.ca
Dec 13 12:46:08 an-c05n02 fenced[2817]: fence an-c05n01.alteeve.ca dev 0.0 agent fence_ipmilan result: error from agent
Dec 13 12:46:08 an-c05n02 fenced[2817]: fence an-c05n01.alteeve.ca success Hoozah! Notice that there is an error from the fence_ipmilan. This is exactly what we expected because of the IPMI's BMC lost power and couldn't respond. So now we know that an-c05n01 can be fenced successfully from both fence devices. Now we need to run the same tests against an-c05n02. Hanging an-c05n02
Be sure to be tailing the /var/log/messages on an-c05n02. Go to an-c05n01's first terminal and run the following command.
On an-c05n02 run: echo c > /proc/sysrq-trigger On an-c05n01's syslog terminal, you should see the following entries in the log. Dec 13 12:52:34 an-c05n01 corosync[3445]: [TOTEM ] A processor failed, forming new configuration.
Dec 13 12:52:36 an-c05n01 corosync[3445]: [QUORUM] Members[1]: 1
Dec 13 12:52:36 an-c05n01 corosync[3445]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 13 12:52:36 an-c05n01 corosync[3445]: [CPG ] chosen downlist: sender r(0) ip(10.20.50.1) ; members(old:2 left:1)
Dec 13 12:52:36 an-c05n01 corosync[3445]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 13 12:52:36 an-c05n01 kernel: dlm: closing connection to node 2
Dec 13 12:52:36 an-c05n01 fenced[3501]: fencing node an-c05n02.alteeve.ca
Dec 13 12:52:51 an-c05n01 fenced[3501]: fence an-c05n02.alteeve.ca success Again, perfect! Cutting the Power to an-c05n02From an-c05n01, pull the power on an-c05n02 with the following call; fence_apc_snmp -a pdu2.alteeve.ca -n 2 -o off Success: Powered OFF Back on an-c05n01's syslog, we should see the following entries; Dec 13 12:55:58 an-c05n01 corosync[3445]: [TOTEM ] A processor failed, forming new configuration.
Dec 13 12:56:00 an-c05n01 corosync[3445]: [QUORUM] Members[1]: 1
Dec 13 12:56:00 an-c05n01 corosync[3445]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 13 12:56:00 an-c05n01 corosync[3445]: [CPG ] chosen downlist: sender r(0) ip(10.20.50.1) ; members(old:2 left:1)
Dec 13 12:56:00 an-c05n01 kernel: dlm: closing connection to node 2
Dec 13 12:56:00 an-c05n01 corosync[3445]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 13 12:56:00 an-c05n01 fenced[3501]: fencing node an-c05n02.alteeve.ca
Dec 13 12:56:20 an-c05n01 fenced[3501]: fence an-c05n02.alteeve.ca dev 0.0 agent fence_ipmilan result: error from agent
Dec 13 12:56:20 an-c05n01 fenced[3501]: fence an-c05n02.alteeve.ca success Woot! Only now can we safely say that our fencing is setup and working properly. Testing Network RedundancyNext up of the testing block is our network configuration. Seeing as we've build our bonds, we need to now test that they are working properly.
First, we'll test all network cables individually, one node and one bonded interface at a time.
Once all bonds have been tested, we'll do a final test by failing the primary switch.
If all of these steps pass and the cluster doesn't partition, then you can be confident that your network is configured properly for full redundancy. Network Testing Terminal LayoutIf you have a couple of monitors, particularly one with portrait mode, you might be able to open 16 terminals at once. This is how many are needed to run ping floods, watch the bond status files, tail syslog and watch cman_tool all at the same time. This configuration makes it very easy to keep a near real-time, complete view of all network components. On the left window, the top-left terminal shows watch cman_tool status and the top-right terminal shows tail -f -n 0 /var/log/messages for an-c05n01. The bottom two terminals show the same for an-c05n02. On the right, portrait-mode window, the terminal layout used for monitoring the bonded link status and ping floods are shown. There are two columns; an-c05n01 on the left and an-c05n02 on the right. Each column is stacked into six rows, bond0 on the top followed by ping -f an-c05n02.bcn, bond1 in the middle followed by ping -f an-c05n02.sn and bond2 at the bottom followed by ping -f an-c05n02.ifn. The left window shows the standard tail on syslog plus watch cman_tool status. How to Know if the Tests PassedWell, the most obvious answer to this question is if the cluster is still working after a switch is powered off. We can be a little more subtle than that though. The state of each bond is viewable by looking in the special /proc/net/bonding/bondX files, where X is the bond number. Lets take a look at bond0 on an-c05n01. cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth0 (primary_reselect always)
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 120000
Down Delay (ms): 0
Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:e0:81:c7:ec:49
Slave queue ID: 0
Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1b:21:9d:59:fc
Slave queue ID: 0 We can see that the currently active interface is eth0. This is the key bit we're going to be watching for these tests. I know that eth0 on an-c05n01 is connected to by first switch. So when I pull the cable to that switch, or when I fail that switch entirely, I should see eth3 take over. We'll also be watching syslog. If things work right, we should not see any messages from the cluster during failure and recovery. Failing The First InterfaceLet's look at the first test. We'll fail an-c05n01's eth0 interface by pulling its cable. On an-c05n01's syslog, you will see; Dec 13 14:03:19 an-c05n01 kernel: e1000e: eth0 NIC Link is Down
Dec 13 14:03:19 an-c05n01 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Dec 13 14:03:19 an-c05n01 kernel: bonding: bond0: making interface eth3 the new active one. Looking again at an-c05n01's bond0's status; cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth0 (primary_reselect always)
Currently Active Slave: eth3
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 120000
Down Delay (ms): 0
Slave Interface: eth0
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:e0:81:c7:ec:49
Slave queue ID: 0
Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1b:21:9d:59:fc
Slave queue ID: 0 We can see now that eth0 is down and that eth3 has taken over. If you look at the windows running the ping flood, both an-c05n01 and an-c05n02 should show nearly the same number of lost packets; PING an-c05n02 (10.20.50.2) 56(84) bytes of data.
........................ The failure of the link was successful! Recovering The First InterfaceSurviving failure is only half the test. We also need to test the recovery of the interface. When ready, reconnect an-c05n01's eth0. The first thing you should notice is in an-c05n01's syslog; Dec 13 14:06:40 an-c05n01 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 13 14:06:40 an-c05n01 kernel: bonding: bond0: link status up for interface eth0, enabling it in 120000 ms. The bond will still be using eth3, so lets wait two minutes. After the two minutes, you should see the following addition syslog entries. Dec 13 14:08:40 an-c05n01 kernel: bond0: link status definitely up for interface eth0, 1000 Mbps full duplex.
Dec 13 14:08:40 an-c05n01 kernel: bonding: bond0: making interface eth0 the new active one. If we go back to the bond status file, we'll see that the eth0 interface has been restored. cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth0 (primary_reselect always)
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 120000
Down Delay (ms): 0
Slave Interface: eth0
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:e0:81:c7:ec:49
Slave queue ID: 0
Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1b:21:9d:59:fc
Slave queue ID: 0 Note that the only difference from before is that eth0's Link Failure Count has been incremented to 1. The test has passed! Now repeat the test for the other two bonds, then for all three bonds on an-c05n02. Remember to also repeat each test, but pull the backup interface before the 2 minutes delays has completed. The primary interface should immediately take over again. This will confirm that failover for the backup link is also working properly. Failing The First Switch
Check that all bonds on both nodes are using their primary interfaces. Confirm your cabling to ensure that these are all routed to the primary switch and that all backup links are cabled into the backup switch. Once done, pull the power to the primary switch. Both nodes should show similar output in their syslog windows; Dec 13 14:16:17 an-c05n01 kernel: e1000e: eth2 NIC Link is Down
Dec 13 14:16:17 an-c05n01 kernel: e1000e: eth0 NIC Link is Down
Dec 13 14:16:17 an-c05n01 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Dec 13 14:16:17 an-c05n01 kernel: bonding: bond0: making interface eth3 the new active one.
Dec 13 14:16:17 an-c05n01 kernel: bonding: bond2: link status definitely down for interface eth2, disabling it
Dec 13 14:16:17 an-c05n01 kernel: bonding: bond2: making interface eth5 the new active one.
Dec 13 14:16:17 an-c05n01 kernel: device eth2 left promiscuous mode
Dec 13 14:16:17 an-c05n01 kernel: device eth5 entered promiscuous mode
Dec 13 14:16:17 an-c05n01 kernel: e1000e: eth1 NIC Link is Down
Dec 13 14:16:18 an-c05n01 kernel: bonding: bond1: link status definitely down for interface eth1, disabling it
Dec 13 14:16:18 an-c05n01 kernel: bonding: bond1: making interface eth4 the new active one. I can look at an-c05n01's /proc/net/bonding/bond0 file and see: cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth0 (primary_reselect always)
Currently Active Slave: eth3
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 120000
Down Delay (ms): 0
Slave Interface: eth0
MII Status: down
Link Failure Count: 3
Permanent HW addr: 00:e0:81:c7:ec:49
Slave queue ID: 0
Slave Interface: eth3
MII Status: up
Link Failure Count: 2
Permanent HW addr: 00:1b:21:9d:59:fc
Slave queue ID: 0 Notice Currently Active Slave is now eth3? You can also see now that eth0's link is down (MII Status: down). It should be the same story for all the other bonds on both nodes. If we check the status of the cluster, we'll see that all is good. cman_tool status Version: 6.2.0
Config Version: 7
Cluster Name: an-cluster-A
Cluster Id: 24561
Cluster Member: Yes
Cluster Generation: 40
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Node votes: 1
Quorum: 1
Active subsystems: 7
Flags: 2node
Ports Bound: 0
Node name: an-c05n01.alteeve.ca
Node ID: 1
Multicast addresses: 239.192.95.81
Node addresses: 10.20.50.1 Success! We just failed the primary switch without any interruption of clustered services. We're not out of the woods yet, though... Restoring The First SwitchNow that we've confirmed all of the bonds are working on the backup switch, lets restore power to the first switch.
It is very important to wait for a while after restoring power to the switch. Some of the common problems that can break your cluster will not show up immediately. A good example is a misconfiguration of STP. In this case, the switch will come up, a short time will pass and then the switch will trigger an STP reconfiguration. Once this happens, both switches will block traffic for many seconds. This will partition you cluster. So then, lets power it back up. Within a few moments, you should see this in your syslog; Dec 13 14:19:30 an-c05n01 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 13 14:19:30 an-c05n01 kernel: bonding: bond0: link status up for interface eth0, enabling it in 120000 ms.
Dec 13 14:19:30 an-c05n01 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 13 14:19:30 an-c05n01 kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 13 14:19:30 an-c05n01 kernel: bonding: bond2: link status up for interface eth2, enabling it in 120000 ms.
Dec 13 14:19:30 an-c05n01 kernel: bonding: bond1: link status up for interface eth1, enabling it in 120000 ms. As with the individual link test, the backup interfaces will remain in use for two minutes. This is critical because miimon has detected the connection to the switches, but the switches are still a long way from being able to route traffic. After the two minutes, we'll see the primary interfaces return to active state. Dec 13 14:20:25 an-c05n01 kernel: e1000e: eth0 NIC Link is Down
Dec 13 14:20:25 an-c05n01 kernel: bonding: bond0: link status down again after 55000 ms for interface eth0.
Dec 13 14:20:26 an-c05n01 kernel: e1000e: eth1 NIC Link is Down
Dec 13 14:20:26 an-c05n01 kernel: bonding: bond1: link status down again after 55800 ms for interface eth1.
Dec 13 14:20:27 an-c05n01 kernel: e1000e: eth2 NIC Link is Down
Dec 13 14:20:27 an-c05n01 kernel: bonding: bond2: link status down again after 56800 ms for interface eth2.
Dec 13 14:20:27 an-c05n01 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 13 14:20:27 an-c05n01 kernel: bonding: bond0: link status up for interface eth0, enabling it in 120000 ms.
Dec 13 14:20:28 an-c05n01 kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 13 14:20:28 an-c05n01 kernel: bonding: bond1: link status up for interface eth1, enabling it in 120000 ms.
Dec 13 14:20:29 an-c05n01 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 13 14:20:29 an-c05n01 kernel: bonding: bond2: link status up for interface eth2, enabling it in 120000 ms.
Dec 13 14:20:31 an-c05n01 kernel: e1000e: eth0 NIC Link is Down
Dec 13 14:20:31 an-c05n01 kernel: bonding: bond0: link status down again after 3500 ms for interface eth0.
Dec 13 14:20:32 an-c05n01 kernel: e1000e: eth1 NIC Link is Down
Dec 13 14:20:32 an-c05n01 kernel: bonding: bond1: link status down again after 4100 ms for interface eth1.
Dec 13 14:20:32 an-c05n01 kernel: e1000e: eth2 NIC Link is Down
Dec 13 14:20:32 an-c05n01 kernel: bonding: bond2: link status down again after 3500 ms for interface eth2.
Dec 13 14:20:33 an-c05n01 kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 13 14:20:33 an-c05n01 kernel: bonding: bond0: link status up for interface eth0, enabling it in 120000 ms.
Dec 13 14:20:34 an-c05n01 kernel: e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 13 14:20:34 an-c05n01 kernel: bonding: bond1: link status up for interface eth1, enabling it in 120000 ms.
Dec 13 14:20:35 an-c05n01 kernel: e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 13 14:20:35 an-c05n01 kernel: bonding: bond2: link status up for interface eth2, enabling it in 120000 ms. See all that bouncing? That is caused by many switches showing a link (that is the MII status) without actually being able to push traffic. As part of the switches boot sequence, the links will go down and come back up a couple of times. The 2 minute counter will reset with each bounce, so the recovery time is actually quite a bit longer than two minutes. This is fine, no need to rush back to the first switch. Note that you will not see this bouncing on switches that hold back on MII status until finished booting. After a few minutes, the old interfaces will actually be restored. Dec 13 14:22:33 an-c05n01 kernel: bond0: link status definitely up for interface eth0, 1000 Mbps full duplex.
Dec 13 14:22:33 an-c05n01 kernel: bonding: bond0: making interface eth0 the new active one.
Dec 13 14:22:34 an-c05n01 kernel: bond1: link status definitely up for interface eth1, 1000 Mbps full duplex.
Dec 13 14:22:34 an-c05n01 kernel: bonding: bond1: making interface eth1 the new active one.
Dec 13 14:22:35 an-c05n01 kernel: bond2: link status definitely up for interface eth2, 1000 Mbps full duplex.
Dec 13 14:22:35 an-c05n01 kernel: bonding: bond2: making interface eth2 the new active one.
Dec 13 14:22:35 an-c05n01 kernel: device eth5 left promiscuous mode
Dec 13 14:22:35 an-c05n01 kernel: device eth2 entered promiscuous mode Complete success!
Failing The Secondary SwitchBefore we can say that everything is perfect, we need to test failing and recovering the secondary switch. The main purpose of this test is to ensure that there are no problems caused when the secondary switch restarts. To fail the switch, as we did with the primary switch, simply cut its power. We should see the following in both node's syslog; Dec 13 14:30:57 an-c05n01 kernel: e1000e: eth3 NIC Link is Down
Dec 13 14:30:57 an-c05n01 kernel: bonding: bond0: link status definitely down for interface eth3, disabling it
Dec 13 14:30:58 an-c05n01 kernel: e1000e: eth4 NIC Link is Down
Dec 13 14:30:58 an-c05n01 kernel: e1000e: eth5 NIC Link is Down
Dec 13 14:30:58 an-c05n01 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Dec 13 14:30:58 an-c05n01 kernel: bonding: bond2: link status definitely down for interface eth5, disabling it Let's take a look at an-c05n01's bond0 status file. cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth0 (primary_reselect always)
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 120000
Down Delay (ms): 0
Slave Interface: eth0
MII Status: up
Link Failure Count: 3
Permanent HW addr: 00:e0:81:c7:ec:49
Slave queue ID: 0
Slave Interface: eth3
MII Status: down
Link Failure Count: 3
Permanent HW addr: 00:1b:21:9d:59:fc
Slave queue ID: 0 Note that the eth3 interface is shown as down. There should have been no dropped packets in the ping-flood window at all. Restoring The Second SwitchWhen the power is restored to the switch, we'll see the same "bouncing" as the switch goes through its startup process. Notice that the backup link also remains listed as down for 2 minutes, despite the interface not being used by the bonded interface. Dec 13 14:33:36 an-c05n01 kernel: e1000e: eth4 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 13 14:33:36 an-c05n01 kernel: e1000e: eth5 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 13 14:33:36 an-c05n01 kernel: bonding: bond1: link status up for interface eth4, enabling it in 120000 ms.
Dec 13 14:33:36 an-c05n01 kernel: bonding: bond2: link status up for interface eth5, enabling it in 120000 ms.
Dec 13 14:33:37 an-c05n01 kernel: e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 13 14:33:37 an-c05n01 kernel: bonding: bond0: link status up for interface eth3, enabling it in 120000 ms.
Dec 13 14:34:34 an-c05n01 kernel: e1000e: eth5 NIC Link is Down
Dec 13 14:34:34 an-c05n01 kernel: bonding: bond2: link status down again after 58000 ms for interface eth5.
Dec 13 14:34:36 an-c05n01 kernel: e1000e: eth5 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Dec 13 14:34:36 an-c05n01 kernel: bonding: bond2: link status up for interface eth5, enabling it in 120000 ms.
Dec 13 14:34:38 an-c05n01 kernel: e1000e: eth5 NIC Link is Down
Dec 13 14:34:38 an-c05n01 kernel: bonding: bond2: link status down again after 2000 ms for interface eth5.
Dec 13 14:34:40 an-c05n01 kernel: e1000e: eth5 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Dec 13 14:34:40 an-c05n01 kernel: bonding: bond2: link status up for interface eth5, enabling it in 120000 ms. After two minutes from the last bound, we'll see the backup interfaces return to up state in the bond's status file. Dec 13 14:35:36 an-c05n01 kernel: bond1: link status definitely up for interface eth4, 1000 Mbps full duplex.
Dec 13 14:35:37 an-c05n01 kernel: bond0: link status definitely up for interface eth3, 1000 Mbps full duplex.
Dec 13 14:36:40 an-c05n01 kernel: bond2: link status definitely up for interface eth5, 1000 Mbps full duplex. After a full five minutes, the cluster and the network remain stable. We can officially declare our network to be fully highly available! Installing DRBDDRBD is an open-source application for real-time, block-level disk replication created and maintained by Linbit. We will use this to keep the data on our cluster consistent between the two nodes. To install it, we have three choices;
We will be using the 8.3.x version of DRBD. This tracts the Red Hat and Linbit supported versions, providing the most tested combination and providing a painless path to move to a fully supported version, should you decide to do so down the road. Option 1 - Fully Supported by Red Hat and LinbitRed Hat decided to no longer directly support DRBD in EL6 to narrow down what applications they shipped and focus on improving those components. Given the popularity of DRBD, however, Red Hat struck a deal with Linbit, the authors and maintainers of DRBD. You have the option of purchasing a fully supported version of DRBD that is blessed by Red Hat for use under Red Hat Enterprise Linux 6. If you are building a fully supported cluster, please contact Linbit to purchase DRBD. Once done, you will get an email with you login information and, most importantly here, the URL hash needed to access the official repositories. First you will need to add an entry in /etc/yum.repo.d/ for DRBD, but this needs to be hand-crafted as you must specify the URL hash given to you in the email as part of the repo configuration.
This will take you to a new page called Instructions for using the DRBD package repository. The details installation instruction are found here. Lets use the imaginative URL hash of abcdefghijklmnopqrstuvwxyz0123456789ABCD and we're are in fact using x86_64 architecture. Given this, we would create the following repository configuration file. vim /etc/yum.repos.d/linbit.repo [drbd-8]
name=DRBD 8
baseurl=http://packages.linbit.com/abcdefghijklmnopqrstuvwxyz0123456789ABCD/rhel6/x86_64
gpgcheck=0 Once this is saved, you can install DRBD using yum; yum install drbd kmod-drbd Done! Option 2 - Install From ELRepoELRepo is a community-maintained repository of packages for Enterprise Linux; Red Hat Enterprise Linux and its derivatives like CentOS. This is the easiest option for a freely available DRBD package. The main concern with this option is that you are seceding control of DRBD to a community-controlled project. This is a trusted repo, but there are still undeniable security concerns. Check for the latest installation RPM and information; # Install the ELRepo GPG key, add the repo and install DRBD.
rpm --import http://elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://elrepo.org/elrepo-release-6-4.el6.elrepo.noarch.rpm Retrieving http://elrepo.org/elrepo-release-6-4.el6.elrepo.noarch.rpm
Preparing... ########################################### [100%]
1:elrepo-release ########################################### [100%] yum install drbd83-utils kmod-drbd83 This is the method used for this tutorial. Option 3 - Install From SourceIf you do not wish to pay for access to the official DRBD repository and do not feel comfortable adding a public repository, your last option is to install from Linbit's source code. The benefit of this is that you can vet the source before installing it, making it a more secure option. The downside is that you will need to manually install updates and security fixes as they are made available. On Both nodes run: # Download, compile and install DRBD
yum install flex gcc make kernel-devel
wget -c http://oss.linbit.com/drbd/8.3/drbd-8.3.15.tar.gz
tar -xvzf drbd-8.3.15.tar.gz
cd drbd-8.3.15
./configure \
--prefix=/usr \
--localstatedir=/var \
--sysconfdir=/etc \
--with-utils \
--with-km \
--with-udev \
--with-pacemaker \
--with-rgmanager \
--with-bashcompletion
make
make install
chkconfig --add drbd
chkconfig drbd off Hooking DRBD Into The Cluster's Fencing
We will use a script, written by Lon Hohberger of Red Hat. This script will capture fence calls from DRBD and in turn calls the cluster's fence_node against the opposing node. It this way, DRBD will avoid split-brain without the need to maintain two separate fence configurations. On Both nodes run: # Obliterate peer - fence via cman
wget -c https://alteeve.ca/files/an-cluster/sbin/obliterate-peer.sh -O /sbin/obliterate-peer.sh
chmod a+x /sbin/obliterate-peer.sh
ls -lah /sbin/obliterate-peer.sh -rwxr-xr-x 1 root root 2.1K May 4 2011 /sbin/obliterate-peer.sh We'll configure DRBD to use this script shortly. Alternate Fence Handler; rhcs_fence
A new fence handler which ties DRBD into RHCS is now available called rhcs_fence with the goal of replacing obliterate-peer.sh. It aims to extend Lon's script, which hasn't been actively developed in some time. This agent has had minimal testing, so please test thoroughly when using it. This agent addresses the simultaneous fencing issue by automatically adding a delay to the fence call based on the host node's ID number, with the node having ID of 1 having no delay at all. It is also a little more elegant about how it handles the actual fence call with the goal of being more reliable when a fence action takes longer than usual to complete. To install it, run the following on both nodes. wget -c https://raw.github.com/digimer/rhcs_fence/master/rhcs_fence
chmod 755 rhcs_fence
mv rhcs_fence /sbin/
ls -lah /sbin/rhcs_fence -rwxr-xr-x 1 root root 15K Jan 24 22:04 /usr/sbin/rhcs_fence The "Why" of Our LayoutWe will be creating three separate DRBD resources. The reason for this is to minimize the chance of data loss in a split-brain event. We're going to take steps to ensure that a split-brain is exceedingly unlikely, but we always have to plan for the worst case scenario. The biggest concern with recovering from a split-brain is that, by necessity, one of the nodes will lose data. Further, there is no way to automate the recovery, as there is no clear way for DRBD to tell which node has the more valuable data. Consider this scenario;
At this point, you will need to discard the changed on one of the nodes. So now you have to choose;
Neither of these are true, as the node with the older data and smallest amount of changed data is the accounting data which is significantly more valuable. Now imagine that both VMs have equally valuable data. What then? Which side do you discard? The approach we will use is to create two separate DRBD resources. Then we will assign the VMs into two groups; VMs normally designed to run on one node will go one one resource while the VMs designed to normally run on the other resource will share the second resource. With all the VMs on a given resource running on the same DRBD resource, we can fairly easily decide which node to discard changes on, on a per-resource level. To summarize, we're going to create the following three resources;
Creating The Partitions For DRBDIt is possible to use LVM on the hosts, and simply create LVs to back our DRBD resources. However, this causes confusion as LVM will see the PV signatures on both the DRBD backing devices and the DRBD device itself. Getting around this requires editing LVM's filter option, which is somewhat complicated. Not overly so, mind you, but enough to be outside the scope of this document. Also, by working with fdisk directly, it will give us a chance to make sure that the DRBD partitions start on an even 64 KiB boundry. This is important for decent performance on Windows VMs, as we will see later. This is true for both traditional platter and modern solid-state drives. On our nodes, we created three primary disk partitions;
We will create a new extended partition. Then within it we will create three new partitions;
As we create each partition, we will do a little math to ensure that the start sector is on a 64 KiB boundry. Block AlignmentFor performance reasons, we want to ensure that the file systems created within a VM matches the block alignment of the underlying storage stack, clear down to the base partitions on /dev/sda (or what ever your lowest-level block device is). Imagine this misaligned scenario; Note: Not to scale
________________________________________________________________
VM File system |~~~~~|_______|_______|_______|_______|_______|_______|_______|__
|~~~~~|==========================================================
DRBD Partition |~~~~~|_______|_______|_______|_______|_______|_______|_______|__
64 KiB block |_______|_______|_______|_______|_______|_______|_______|_______|
512byte sectors |_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_| Now, when the guest wants to write one block worth of data, it actually causes two blocks to be written, causing avoidable disk I/O. Note: Not to scale
________________________________________________________________
VM File system |~~~~~~~|_______|_______|_______|_______|_______|_______|_______|
|~~~~~~~|========================================================
DRBD Partition |~~~~~~~|_______|_______|_______|_______|_______|_______|_______|
64 KiB block |_______|_______|_______|_______|_______|_______|_______|_______|
512byte sectors |_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_| By changing the start cylinder of our partitions to always start on 64 KiB boundaries, we're sure to keep the guest OS's file system in-line with the DRBD backing device's blocks. Thus, all reads and writes in the guest OS effect a matching number of real blocks, maximizing disk I/O efficiency. Thankfully, as we'll see in a moment, the parted program has a mode that will tell it to always optimally align partitions, so we won't need to do any crazy math.
Special thanks to Pasi Kärkkäinen for his patience in explaining to me the importance of disk alignment. He created two images which I used as templates for the ASCII art images above; Creating the DRBD PartitionsHere I will show you the values I entered to create the three partitions I needed on my nodes. DO NOT DIRECTLY COPY THIS! The values you enter will almost certainly be different. We're going to use a program called parted to configure the disk /dev/sda. Pay close attention to the -a optimal switch. This tells parted to create new partitions with optimal block alignment, which is crucial for virtual machine performance. parted -a optimal /dev/sda GNU Parted 2.1
Using /dev/sda
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) We're now in the parted console. Before we start, let's take a look at the current disk configuration along with the amount of free space available. print free Model: ATA ST9500420ASG (scsi)
Disk /dev/sda: 500GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
32.3kB 1049kB 1016kB Free Space
1 1049kB 269MB 268MB primary ext4 boot
2 269MB 43.2GB 42.9GB primary ext4
3 43.2GB 47.5GB 4295MB primary linux-swap(v1)
47.5GB 500GB 453GB Free Space Before we can create the three DRBD partition, we first need to create an extended partition wherein which we will create the three logical partitions. From the output above, we can see that the free space starts at 47.5GB, and that the drive ends at 500GB. Knowing this, we can now create the extended partition. mkpart extended 47.5GB 500GB Warning: WARNING: the kernel failed to re-read the partition table on /dev/sda
(Device or resource busy). As a result, it may not reflect all of your changes
until after reboot. Don't worry about that message, we will reboot when we finish. So now we can confirm that the new extended partition was create by again printing the partition table and the free space. print free Model: ATA ST9500420ASG (scsi)
Disk /dev/sda: 500GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
32.3kB 1049kB 1016kB Free Space
1 1049kB 269MB 268MB primary ext4 boot
2 269MB 43.2GB 42.9GB primary ext4
3 43.2GB 47.5GB 4295MB primary linux-swap(v1)
4 47.5GB 500GB 453GB extended lba
47.5GB 500GB 453GB Free Space
500GB 500GB 24.6kB Free Space Perfect. So now we're going to create our three logical partitions. We're going to use the same start position as last time, but the end position will be 20 GiB further in. mkpart logical 47.5GB 67.5GB Warning: WARNING: the kernel failed to re-read the partition table on /dev/sda
(Device or resource busy). As a result, it may not reflect all of your changes
until after reboot. We'll check again to see the new partition layout. print free Model: ATA ST9500420ASG (scsi)
Disk /dev/sda: 500GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
32.3kB 1049kB 1016kB Free Space
1 1049kB 269MB 268MB primary ext4 boot
2 269MB 43.2GB 42.9GB primary ext4
3 43.2GB 47.5GB 4295MB primary linux-swap(v1)
4 47.5GB 500GB 453GB extended lba
5 47.5GB 67.5GB 20.0GB logical
67.5GB 500GB 433GB Free Space
500GB 500GB 24.6kB Free Space Again, perfect. Now I have a total of 433GB left free. How you carve this up for your VMs will depend entirely on what kind of VMs you plan to install and what their needs are. For me, I will divide the space evenly into to logical partitions of 216.5GB (433 / 2 = 216.5). The first partition will start at 67.5 and end at 284GB (67.5 + 216.5 = 284) mkpart logical 67.5GB 284GB Warning: WARNING: the kernel failed to re-read the partition table on /dev/sda
(Device or resource busy). As a result, it may not reflect all of your changes
until after reboot. Once again, lets look at the new partition table. print free Model: ATA ST9500420ASG (scsi)
Disk /dev/sda: 500GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
32.3kB 1049kB 1016kB Free Space
1 1049kB 269MB 268MB primary ext4 boot
2 269MB 43.2GB 42.9GB primary ext4
3 43.2GB 47.5GB 4295MB primary linux-swap(v1)
4 47.5GB 500GB 453GB extended lba
5 47.5GB 67.5GB 20.0GB logical
6 67.5GB 284GB 216GB logical
284GB 500GB 216GB Free Space
500GB 500GB 24.6kB Free Space Finally, our last partition will start at 284GB and use the rest of the free space, ending at 500GB. mkpart logical 284GB 500GB Warning: WARNING: the kernel failed to re-read the partition table on /dev/sda
(Device or resource busy). As a result, it may not reflect all of your changes
until after reboot. One last time, let's look at the partition table. print free Model: ATA ST9500420ASG (scsi)
Disk /dev/sda: 500GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
32.3kB 1049kB 1016kB Free Space
1 1049kB 269MB 268MB primary ext4 boot
2 269MB 43.2GB 42.9GB primary ext4
3 43.2GB 47.5GB 4295MB primary linux-swap(v1)
4 47.5GB 500GB 453GB extended lba
5 47.5GB 67.5GB 20.0GB logical
6 67.5GB 284GB 216GB logical
7 284GB 500GB 216GB logical
500GB 500GB 24.6kB Free Space Just as we asked for. Before we finish though, let's be extra careful and do a manual check of our three partitions to ensure that they are, in fact, aligned optimally. There will be no output from the following commands if the partitions are aligned. (parted) align-check opt 5
(parted) align-check opt 6
(parted) align-check opt 7
(parted) Excellent! We can now exit. quit Information: You may need to update /etc/fstab. Now we need to reboot to make the kernel see the new partition table. reboot Done! Do this for both nodes, then proceed. Configuring DRBDDRBD is configured in two parts;
We will be creating three separate DRBD resources, so we will create three separate resource configuration files. More on that in a moment. Configuring DRBD Global and Common OptionsThe first file to edit is /etc/drbd.d/global_common.conf. In this file, we will set global configuration options and set default resource configuration options. These default resource options can be overwritten in the actual resource files which we'll create once we're done here. I'll explain the values we're setting here, and we'll put the explanation of each option in the file itself, as it will be useful to have them should you need to alter the files sometime in the future. The first addition is in the handlers { } directive. We're going to add the fence-peer option and configure it to use the obliterate-peer.sh script we spoke about earlier in the DRBD section. vim /etc/drbd.d/global_common.conf handlers {
# This script is a wrapper for RHCS's 'fence_node' command line
# tool. It will call a fence against the other node and return
# the appropriate exit code to DRBD.
fence-peer "/sbin/obliterate-peer.sh";
}
We're going to add three options to the startup { } directive; We're going to tell DRBD to make both nodes "primary" on start, to wait five minutes on start for its peer to connect and, if the peer never connected last time, to wait onto two minutes. startup {
# This tells DRBD to promote both nodes to Primary on start.
become-primary-on both;
# This tells DRBD to wait five minutes for the other node to
# connect. This should be longer than it takes for cman to
# timeout and fence the other node *plus* the amount of time it
# takes the other node to reboot. If you set this too short,
# you could corrupt your data. If you want to be extra safe, do
# not use this at all and DRBD will wait for the other node
# forever.
wfc-timeout 300;
# This tells DRBD to wait for the other node for three minutes
# if the other node was degraded the last time it was seen by
# this node. This is a way to speed up the boot process when
# the other node is out of commission for an extended duration.
degr-wfc-timeout 120;
} For the disk { } directive, we're going to configure DRBD's behaviour when a split-brain is detected. By setting fencing to resource-and-stonith, we're telling DRBD to stop all disk access and call a fence against its peer node rather than proceeding. disk {
# This tells DRBD to block IO and fence the remote node (using
# the 'fence-peer' helper) when connection with the other node
# is unexpectedly lost. This is what helps prevent split-brain
# condition and it is incredible important in dual-primary
# setups!
fencing resource-and-stonith;
} In the net { } directive, we're going to tell DRBD that it is allowed to run in dual-primary mode and we're going to configure how it behaves if a split-brain has occurred, despite our best efforts. The recovery (or lack there of) requires three options; What to do when neither node had been primary (after-sb-0pri), what to do if only one node had been primary (after-sb-1pri) and finally, what to do if both nodes had been primary (after-sb-2pri), as will most likely be the case for us. This last instance will be configured to tell DRBD just to drop the connection, which will require human intervention to correct. At this point, you might be wondering why we won't simply run Primary/Secondary. The reason is because of live-migration. When we push a VM across to the backup node, there is a short period of time where both nodes need to be writeable. net {
# This tells DRBD to allow two nodes to be Primary at the same
# time. It is needed when 'become-primary-on both' is set.
allow-two-primaries;
# The following three commands tell DRBD how to react should
# our best efforts fail and a split brain occurs. You can learn
# more about these options by reading the drbd.conf man page.
# NOTE! It is not possible to safely recover from a split brain
# where both nodes were primary. This care requires human
# intervention, so 'disconnect' is the only safe policy.
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
} We'll make our usual backup of the configuration file, add the new sections and then create a diff to see exactly how things have changed. cp /etc/drbd.d/global_common.conf /etc/drbd.d/global_common.conf.orig
vim /etc/drbd.d/global_common.conf
diff -u /etc/drbd.d/global_common.conf.orig /etc/drbd.d/global_common.conf --- /etc/drbd.d/global_common.conf.orig 2011-12-13 22:22:30.916128360 -0500
+++ /etc/drbd.d/global_common.conf 2011-12-13 22:26:30.733379609 -0500
@@ -14,22 +14,67 @@
# split-brain "/usr/lib/drbd/notify-split-brain.sh root";
# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
+
# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
+ # This script is a wrapper for RHCS's 'fence_node' command line
+ # tool. It will call a fence against the other node and return
+ # the appropriate exit code to DRBD.
+ fence-peer "/sbin/obliterate-peer.sh";
}
startup {
# wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
+
+ # This tells DRBD to promote both nodes to Primary on start.
+ become-primary-on both;
+
+ # This tells DRBD to wait five minutes for the other node to
+ # connect. This should be longer than it takes for cman to
+ # timeout and fence the other node *plus* the amount of time it
+ # takes the other node to reboot. If you set this too short,
+ # you could corrupt your data. If you want to be extra safe, do
+ # not use this at all and DRBD will wait for the other node
+ # forever.
+ wfc-timeout 300;
+
+ # This tells DRBD to wait for the other node for three minutes
+ # if the other node was degraded the last time it was seen by
+ # this node. This is a way to speed up the boot process when
+ # the other node is out of commission for an extended duration.
+ degr-wfc-timeout 120;
}
disk {
# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
# no-disk-drain no-md-flushes max-bio-bvecs
+
+ # This tells DRBD to block IO and fence the remote node (using
+ # the 'fence-peer' helper) when connection with the other node
+ # is unexpectedly lost. This is what helps prevent split-brain
+ # condition and it is incredible important in dual-primary
+ # setups!
+ fencing resource-and-stonith;
}
net {
# sndbuf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
+
+
+ # This tells DRBD to allow two nodes to be Primary at the same
+ # time. It is needed when 'become-primary-on both' is set.
+ allow-two-primaries;
+
+ # The following three commands tell DRBD how to react should
+ # our best efforts fail and a split brain occurs. You can learn
+ # more about these options by reading the drbd.conf man page.
+ # NOTE! It is not possible to safely recover from a split brain
+ # where both nodes were primary. This care requires human
+ # intervention, so 'disconnect' is the only safe policy.
+ after-sb-0pri discard-zero-changes;
+ after-sb-1pri discard-secondary;
+ after-sb-2pri disconnect;
}
syncer { Configuring the DRBD ResourcesAs mentioned earlier, we are going to create three DRBD resources.
Each resource configuration will be in its own file saved as /etc/drbd.d/rX.res. The three of them will be pretty much the same. So let's take a look at the first GFS2 resource r0.res, then we'll just look at the changes for r1.res and r2.res. These files won't exist initially. vim /etc/drbd.d/r0.res # This is the resource used for the shared GFS2 partition.
resource r0 {
# This is the block device path.
device /dev/drbd0;
# We'll use the normal internal metadisk (takes about 32MB/TB)
meta-disk internal;
# This is the `uname -n` of the first node
on an-c05n01.alteeve.ca {
# The 'address' has to be the IP, not a hostname. This is the
# node's SN (bond1) IP. The port number must be unique amoung
# resources.
address 10.10.50.1:7788;
# This is the block device backing this resource on this node.
disk /dev/sda5;
}
# Now the same information again for the second node.
on an-c05n02.alteeve.ca {
address 10.10.50.2:7788;
disk /dev/sda5;
}
} Now copy this to r1.res and edit for the an-c05n01 VM resource. The main differences are the resource name, r1, the block device, /dev/drbd1, the port, 7790 and the backing block devices, /dev/sda6. cp /etc/drbd.d/r0.res /etc/drbd.d/r1.res
vim /etc/drbd.d/r1.res # This is the resource used for VMs that will normally run on an-c05n01.
resource r1 {
# This is the block device path.
device /dev/drbd1;
# We'll use the normal internal metadisk (takes about 32MB/TB)
meta-disk internal;
# This is the `uname -n` of the first node
on an-c05n01.alteeve.ca {
# The 'address' has to be the IP, not a hostname. This is the
# node's SN (bond1) IP. The port number must be unique amoung
# resources.
address 10.10.50.1:7789;
# This is the block device backing this resource on this node.
disk /dev/sda6;
}
# Now the same information again for the second node.
on an-c05n02.alteeve.ca {
address 10.10.50.2:7789;
disk /dev/sda6;
}
} The last resource is again the same, with the same set of changes. cp /etc/drbd.d/r1.res /etc/drbd.d/r2.res
vim /etc/drbd.d/r2.res # This is the resource used for VMs that will normally run on an-c05n02.
resource r2 {
# This is the block device path.
device /dev/drbd2;
# We'll use the normal internal metadisk (takes about 32MB/TB)
meta-disk internal;
# This is the `uname -n` of the first node
on an-c05n01.alteeve.ca {
# The 'address' has to be the IP, not a hostname. This is the
# node's SN (bond1) IP. The port number must be unique amoung
# resources.
address 10.10.50.1:7790;
# This is the block device backing this resource on this node.
disk /dev/sda7;
}
# Now the same information again for the second node.
on an-c05n02.alteeve.ca {
address 10.10.50.2:7790;
disk /dev/sda7;
}
} The final step is to validate the configuration. This is done by running the following command; drbdadm dump # /etc/drbd.conf
common {
protocol C;
net {
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
disk {
fencing resource-and-stonith;
}
startup {
wfc-timeout 300;
degr-wfc-timeout 120;
become-primary-on both;
}
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
fence-peer /sbin/obliterate-peer.sh;
}
}
# resource r0 on an-c05n01.alteeve.ca: not ignored, not stacked
resource r0 {
on an-c05n01.alteeve.ca {
device /dev/drbd0 minor 0;
disk /dev/sda5;
address ipv4 10.10.50.1:7788;
meta-disk internal;
}
on an-c05n02.alteeve.ca {
device /dev/drbd0 minor 0;
disk /dev/sda5;
address ipv4 10.10.50.2:7788;
meta-disk internal;
}
}
# resource r1 on an-c05n01.alteeve.ca: not ignored, not stacked
resource r1 {
on an-c05n01.alteeve.ca {
device /dev/drbd1 minor 1;
disk /dev/sda6;
address ipv4 10.10.50.1:7789;
meta-disk internal;
}
on an-c05n02.alteeve.ca {
device /dev/drbd1 minor 1;
disk /dev/sda6;
address ipv4 10.10.50.2:7789;
meta-disk internal;
}
}
# resource r2 on an-c05n01.alteeve.ca: not ignored, not stacked
resource r2 {
on an-c05n01.alteeve.ca {
device /dev/drbd2 minor 2;
disk /dev/sda7;
address ipv4 10.10.50.1:7790;
meta-disk internal;
}
on an-c05n02.alteeve.ca {
device /dev/drbd2 minor 2;
disk /dev/sda7;
address ipv4 10.10.50.2:7790;
meta-disk internal;
}
} You'll note that the output is formatted differently from the configuration files we created, but the values themselves are the same. If there had of been errors, you would have seen them printed. Fix any problems before proceeding. Once you get a clean dump, copy the configuration over to the other node. rsync -av /etc/drbd.d root@an-c05n02:/etc/ sending incremental file list
drbd.d/
drbd.d/global_common.conf
drbd.d/global_common.conf.orig
drbd.d/r0.res
drbd.d/r1.res
drbd.d/r2.res
sent 7534 bytes received 129 bytes 5108.67 bytes/sec
total size is 7874 speedup is 1.03 Initializing The DRBD ResourcesNow that we have DRBD configured, we need to initialize the DRBD backing devices and then bring up the resources for the first time.
On both nodes, create the new metadata on the backing devices. You may need to type yes to confirm the action if any data is seen. If DRBD sees an actual file system, it will error and insist that you clear the partition. You can do this by running; dd if=/dev/zero of=/dev/sdaX bs=4M, where X is the partition you want to clear. This is called "zeroing out" a partition. The dd program does not print its progress, and can take a long time. To check the progress, open a new session to the server and run 'kill -USR1 $(pgrep -l '^dd$' | awk '{ print $1 }')'. If DRBD sees old metadata, it will prompt you to type yes before it will proceed. In my case, I had recently zeroed-out my drive so DRBD had no concerns and just created the metadata for the three resources. drbdadm create-md r{0..2} Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success Before you go any further, we'll need to load the drbd kernel module. Note that you won't normally need to do this. Later, after we get everything running the first time, we'll be able to start and stop the DRBD resources using the /etc/init.d/drbd script, which loads and unloads the drbd kernel module as needed. modprobe drbd Now go back to the terminal windows we had used to watch the cluster start. We now want to watch the output of cat /proc/drbd so we can keep tabs on the current state of the DRBD resources. We'll do this by using the watch program, which will refresh the output of the cat call every couple of seconds. watch cat /proc/drbd version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag@Build64R6, 2011-11-20 10:57:03 Back in the first terminal, we need to attach the backing device, /dev/sda{5..7} to their respective DRBD resources, r{0..2}. After running the following command, you will see no output on the first terminal, but the second terminal's /proc/drbd should update. drbdadm attach r{0..2} version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag@Build64R6, 2011-11-20 10:57:03
0: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown r----s
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:19515784
1: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown r----s
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:211418788
2: cs:StandAlone ro:Secondary/Unknown ds:Inconsistent/DUnknown r----s
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:211034800 Take note of the connection state, cs:StandAlone, the current role, ro:Secondary/Unknown and the disk state, ds:Inconsistent/DUnknown. This tells us that our resources are not talking to one another, are not usable because they are in the Secondary state (you can't even read the /dev/drbdX device) and that the backing device does not have an up to date view of the data. This all makes sense of course, as the resources are brand new. So the next step is to connect the two nodes together. As before, we won't see any output from the first terminal, but the second terminal will change.
drbdadm connect r{0..2} version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag@Build64R6, 2011-11-20 10:57:03
0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:19515784
1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:211418788
2: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:211034800 We can now see that the two nodes are talking to one another properly as the connection state has changed to cs:Connected. They can see that their peer node is in the same state as they are; Secondary/Inconsistent. Seeing as the resources are brand new, there is no data to synchronize the two nodes. We're going to issue a special command that will only ever be used this one time. It will tell DRBD to immediately consider the DRBD resources to be up to date. On one node only, run; drbdadm -- --clear-bitmap new-current-uuid r{0..2} As before, look to the second terminal to see the new state of affairs. version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag@Build64R6, 2011-11-20 10:57:03
0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
2: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 Voila! We could promote both sides to Primary by running drbdadm primary r{0..2} on both nodes, but there is no purpose in doing that at this stage as we can safely say our DRBD is ready to go. So instead, let's just stop DRBD entirely. We'll also prevent it from starting on boot as drbd will be managed by the cluster in a later step. On both nodes run; /etc/init.d/drbd stop Stopping all DRBD resources: . Now disable it from starting on boot. chkconfig drbd off
chkconfig --list drbd drbd 0:off 1:off 2:off 3:off 4:off 5:off 6:off The second terminal will start complaining that /proc/drbd no longer exists. This is because the drbd init script unloaded the drbd kernel module. It is expected and not a problem. Configuring Clustered StorageBefore we can provision the first virtual machine, we must first create the storage that will back them. This will take a few steps;
Clustered Logical Volume ManagementWe will assign all three DRBD resources to be managed by clustered LVM. This isn't strictly needed for the GFS2 partition, as it uses DLM directly. However, the flexibility of LVM is very appealing, and will make later growth of the GFS2 partition quite trivial, should the need arise. The real reason for clustered LVM in our cluster is to provide DLM-backed locking to the partitions, or logical volumes in LVM, that will be used to back our VMs. Of course, the flexibility of LVM managed storage is enough of a win to justify using LVM for our VMs in itself, and shouldn't be ignored here. Configuring Clustered LVM LockingBefore we create the clustered LVM, we need to first make three changes to the LVM configuration.
Start by making a backup of lvm.conf and then begin editing it. cp /etc/lvm/lvm.conf /etc/lvm/lvm.conf.orig
vim /etc/lvm/lvm.conf The configuration option to filter out the DRBD backing device is, surprisingly, filter = [ ... ]. By default, it is set to allow everything via the "a/.*/" regular expression. We're only using DRBD in our LVM, so we're going to flip that to reject everything except DRBD by changing the regex to "a|/dev/drbd*|", "r/.*/". If we didn't do this, LVM would see the same signature on the DRBD device and again on the backing devices, at which time it would ignore the DRBD device. This filter allows LVM to only inspect the DRBD devices for LVM signatures. Change; # By default we accept every block device:
filter = [ "a/.*/" ] To; # We're only using LVM on DRBD resource.
filter = [ "a|/dev/drbd*|", "r/.*/" ] For the locking, we're going to change the locking_type from 1 (local locking) to 3, (clustered locking). This is what tells LVM to use DLM. Change; locking_type = 1 To; locking_type = 3 Lastly, we're also going to disallow fall-back to local locking. Normally, LVM would try to access a clustered LVM VG using local locking if DLM is not available. We want to prevent any access to the clustered LVM volumes except when the DLM is itself running. This is done by changing fallback_to_local_locking to 0. Change; fallback_to_local_locking = 1 To; fallback_to_local_locking = 0 Save the changes, then lets run a diff against our backup to see a summary of the changes. diff -u /etc/lvm/lvm.conf.orig /etc/lvm/lvm.conf --- /etc/lvm/lvm.conf.orig 2011-12-14 17:42:16.416094972 -0500
+++ /etc/lvm/lvm.conf 2011-12-14 17:49:15.747097684 -0500
@@ -62,8 +62,8 @@
# If it doesn't do what you expect, check the output of 'vgscan -vvvv'.
- # By default we accept every block device:
- filter = [ "a/.*/" ]
+ # We're only using LVM on DRBD resource.
+ filter = [ "a|/dev/drbd*|", "r/.*/" ]
# Exclude the cdrom drive
# filter = [ "r|/dev/cdrom|" ]
@@ -356,7 +356,7 @@
# Type 3 uses built-in clustered locking.
# Type 4 uses read-only locking which forbids any operations that might
# change metadata.
- locking_type = 1
+ locking_type = 3
# Set to 0 to fail when a lock request cannot be satisfied immediately.
wait_for_locks = 1
@@ -372,7 +372,7 @@
# to 1 an attempt will be made to use local file-based locking (type 1).
# If this succeeds, only commands against local volume groups will proceed.
# Volume Groups marked as clustered will be ignored.
- fallback_to_local_locking = 1
+ fallback_to_local_locking = 0
# Local non-LV directory that holds file-based locks while commands are
# in progress. A directory like /tmp that may get wiped on reboot is OK. Perfect! Now copy the modified lvm.conf file to the other node. rsync -av /etc/lvm/lvm.conf root@an-c05n02:/etc/lvm/ sending incremental file list
lvm.conf
sent 2351 bytes received 283 bytes 5268.00 bytes/sec
total size is 28718 speedup is 10.90 Testing the clvmd DaemonA little later on, we're going to put clustered LVM under the control of rgmanager. Before we can do that though, we need to start it manually so that we can use it to create the LV that will back the GFS2 /shared partition, which we will also be adding to rgmanager when we build our storage services. Before we start the clvmd daemon, we'll want to ensure that the cluster is running. cman_tool status Version: 6.2.0
Config Version: 7
Cluster Name: an-cluster-A
Cluster Id: 24561
Cluster Member: Yes
Cluster Generation: 68
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Node votes: 1
Quorum: 1
Active subsystems: 7
Flags: 2node
Ports Bound: 0
Node name: an-c05n01.alteeve.ca
Node ID: 1
Multicast addresses: 239.192.95.81
Node addresses: 10.20.50.1 It is, and both nodes are members. We can start the clvmd daemon now. /etc/init.d/clvmd start Starting clvmd:
Activating VG(s): No volume groups found
[ OK ] We've not created any clustered volume groups yet, so that complaint about not finding volume groups is expected. We don't want clvmd to start at boot, as we will be putting it under the cluster's control. So we need to make sure that clvmd is disabled at boot, and then we'll stop clvmd for now. chkconfig clvmd off
chkconfig --list clvmd clvmd 0:off 1:off 2:off 3:off 4:off 5:off 6:off Now stop it entirely. /etc/init.d/clvmd stop Signaling clvmd to exit [ OK ]
clvmd terminated [ OK ] Initialize our DRBD Resource for use as LVM PVsThis is the first time we're actually going to use DRBD and clustered LVM, so we need to make sure that both are started. Earlier we stopped them, so if they're not running now, we need to restart them. First, check (and start if needed) drbd. /etc/init.d/drbd status drbd not loaded It's stopped, so we'll start it on both nodes now. /etc/init.d/drbd start Starting DRBD resources: [ d(r0) d(r1) d(r2) n(r0) n(r1) n(r2) ]. It looks like it started, but let's confirm that the resources are all Connected, Primary and UpToDate. /etc/init.d/drbd status drbd driver loaded OK; device status:
version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag@Build64R6, 2011-11-20 10:57:03
m:res cs ro ds p mounted fstype
0:r0 Connected Primary/Primary UpToDate/UpToDate C
1:r1 Connected Primary/Primary UpToDate/UpToDate C
2:r2 Connected Primary/Primary UpToDate/UpToDate C Excellent, now to check on clvmd. /etc/init.d/clvmd status clvmd is stopped It's also stopped, so lets start it now. /etc/init.d/clvmd start Starting clvmd:
Activating VG(s): No volume groups found
[ OK ] Now we're ready to start! Before we can use LVM, clustered or otherwise, we need to initialize one or more raw storage devices. This is done using the pvcreate command. We're going to do this on an-c05n01, then run pvscan on an-c05n02. We should see the newly initialized DRBD resources appear. Running pvscan first, we'll see that no PVs have been created. pvscan No matching physical volumes found On an-c05n01, initialize the PVs; pvcreate /dev/drbd{0..2} Writing physical volume data to disk "/dev/drbd0"
Physical volume "/dev/drbd0" successfully created
Writing physical volume data to disk "/dev/drbd1"
Physical volume "/dev/drbd1" successfully created
Writing physical volume data to disk "/dev/drbd2"
Physical volume "/dev/drbd2" successfully created On both nodes, re-run pvscan and the new PVs should show. This works because DRBD is keeping the data in sync, including the new LVM signatures. pvscan PV /dev/drbd0 lvm2 [18.61 GiB]
PV /dev/drbd1 lvm2 [201.62 GiB]
PV /dev/drbd2 lvm2 [201.26 GiB]
Total: 3 [421.49 GiB] / in use: 0 [0 ] / in no VG: 3 [421.49 GiB] Done. Creating Cluster Volume GroupsAs with initializing the DRBD resource above, we will create out volume groups, VGs, on an-c05n01 only, but we will then see them on both nodes. Check to confirm that no VGs exist; vgdisplay No volume groups found Now to create the VGs, we'll use the vgcreate command with the -c y switch, which tells LVM to make the VG a clustered VG. Note that when the clvmd daemon is running, -c y is implied. However, I like to get into the habit of using it because it will trigger an error if, for some reason, clvmd wasn't actually running. On an-c05n01, create the three VGs.
vgcreate -c y shared-vg0 /dev/drbd0 Clustered volume group "shared-vg0" successfully created
vgcreate -c y an01-vg0 /dev/drbd1 Clustered volume group "an01-vg0" successfully created
vgcreate -c y an02-vg0 /dev/drbd2 Clustered volume group "an02-vg0" successfully created Now on both nodes, we should see the three new volume groups. vgscan Reading all physical volumes. This may take a while...
Found volume group "an02-vg0" using metadata type lvm2
Found volume group "an01-vg0" using metadata type lvm2
Found volume group "shared-vg0" using metadata type lvm2 Creating a Logical VolumeAt this stage, we're going to create only one LV for the GFS2 partition. We'll create the rest later when we're ready to provision the VMs. This will be the /shared partiton, which we will discuss further in the next section. As before, we'll create the LV on an-c05n01 and then verify it exists on both nodes. Before we create our first LV, check lvscan. lvscan Nothing is returned. On an-c05n01, create the the LV on the shared-vg0 VG, using all of the available space. lvcreate -l 100%FREE -n shared shared-vg0 Logical volume "shared" created Now on both nodes, check that the new LV exists. lvscan ACTIVE '/dev/shared-vg0/shared' [18.61 GiB] inherit Perfect. We can now create our GFS2 partition. The GFS2-formatted /shared partition will be used for four main purposes;
Make sure that both drbd and clvmd are running. The mkfs.gfs2 call uses a few switches that are worth explaining;
Then, on an-c05n01, run; mkfs.gfs2 -p lock_dlm -j 2 -t an-cluster-A:shared /dev/shared-vg0/shared This will destroy any data on /dev/shared-vg0/shared.
It appears to contain: symbolic link to `../dm-0' Are you sure you want to proceed? [y/n] y Device: /dev/shared-vg0/shared
Blocksize: 4096
Device Size 18.61 GB (4878336 blocks)
Filesystem Size: 18.61 GB (4878333 blocks)
Journals: 2
Resource Groups: 75
Locking Protocol: "lock_dlm"
Lock Table: "an-cluster-A:shared"
UUID: 162a80eb-59b3-08bd-5d69-740cbb60aa45 On both nodes, run all of the following commands. mkdir /shared
mount /dev/shared-vg0/shared /shared/ Confirm that /shared is now mounted. df -hP /shared Filesystem Size Used Avail Use% Mounted on
/dev/mapper/shared--vg0-shared 19G 259M 19G 2% /shared Note that the path under Filesystem is different from what we used when creating the GFS2 partition. This is an effect of Device Mapper, which is used by LVM to create symlinks to actual block device paths. If we look at our /dev/shared-vg0/shared device and the device from df, /dev/mapper/shared--vg0-shared, we'll see that they both point to the same actual block device. ls -lah /dev/shared-vg0/shared /dev/mapper/shared--vg0-shared lrwxrwxrwx 1 root root 7 Oct 23 16:35 /dev/mapper/shared--vg0-shared -> ../dm-0
lrwxrwxrwx 1 root root 7 Oct 23 16:35 /dev/shared-vg0/shared -> ../dm-0 ls -lah /dev/dm-0 brw-rw---- 1 root disk 253, 0 Oct 23 16:35 /dev/dm-0 This next step uses some command-line voodoo. It takes the output from gfs2_tool sb /dev/shared-vg0/shared uuid, parses out the UUID, converts it to lower-case and spits out a string that can be used in /etc/fstab. We'll run it twice; The first time to confirm that the output is what we expect and the second time to append it to /etc/fstab. The gfs2 daemon can only work on GFS2 partitions that have been defined in /etc/fstab, so this is a required step on both nodes. We use defaults,noatime,nodiratime instead of just defaults for performance reasons. Normally, every time a file or directory is accessed, its atime (or diratime) is updated, which requires a disk write, which requires an exclusive DLM lock, which is expensive. If you need to know when a file or directory was accessed, remove ,noatime,nodiratime. echo `gfs2_tool sb /dev/shared-vg0/shared uuid | awk '/uuid =/ { print $4; }' | sed -e "s/\(.*\)/UUID=\L\1\E \/shared\t\tgfs2\tdefaults,noatime,nodiratime\t0 0/"` UUID=162a80eb-59b3-08bd-5d69-740cbb60aa45 /shared gfs2 defaults,noatime,nodiratime 0 0 This looks good, so now re-run it but redirect the output to append to /etc/fstab. We'll confirm it worked by checking the status of the gfs2 daemon. echo `gfs2_tool sb /dev/shared-vg0/shared uuid | awk '/uuid =/ { print $4; }' | sed -e "s/\(.*\)/UUID=\L\1\E \/shared\t\tgfs2\tdefaults,noatime,nodiratime\t0 0/"` >> /etc/fstab
/etc/init.d/gfs2 status Configured GFS2 mountpoints:
/shared
Active GFS2 mountpoints:
/shared Perfect, gfs2 can see the partition now! We're ready to setup our directories. On an-c05n01 mkdir /shared/{definitions,provision,archive,files} On both nodes, confirm that all of the new directories exist and are visible. ls -lah /shared/ total 24K
drwxr-xr-x 6 root root 3.8K Dec 14 19:05 .
dr-xr-xr-x. 24 root root 4.0K Dec 14 18:44 ..
drwxr-xr-x 2 root root 0 Dec 14 19:05 archive
drwxr-xr-x 2 root root 0 Dec 14 19:05 definitions
drwxr-xr-x 2 root root 0 Dec 14 19:05 files
drwxr-xr-x 2 root root 0 Dec 14 19:05 provision Wonderful! As with drbd and clvmd, we don't want to have gfs2 start at boot as we're going to put it under the control of the cluster. chkconfig gfs2 off
chkconfig --list gfs2 gfs2 0:off 1:off 2:off 3:off 4:off 5:off 6:off Renaming a GFS2 Partition
If you ever need to rename your cluster, you will need to update your GFS2 partition before you can remount it. Unmount the partition from all nodes and run: gfs2_tool sb /dev/shared-vg0/shared table "new_cluster_name:shared" You shouldn't change any of these values if the filesystem is mounted.
Are you sure? [y/n] y
current lock table name = "an-cluster-A:shared"
new lock table name = "new_cluster_name:shared"
Done Then you can change the cluster's name in cluster.conf and then remount the GFS2 partition. You can use the same command, changing the GFS2 partition name, if you want to change the name of the filesystem instead of (or at the same time as) the cluster's name. Stopping All Clustered Storage ComponentsBefore we can put storage under the cluster's control, we need to make sure that the gfs2, clvmd and drbd daemons are stopped. On both nodes, run; /etc/init.d/gfs2 stop && /etc/init.d/clvmd stop && /etc/init.d/drbd stop Unmounting GFS2 filesystem (/shared): [ OK ]
Deactivating clustered VG(s): 0 logical volume(s) in volume group "an02-vg0" now active
0 logical volume(s) in volume group "an01-vg0" now active
0 logical volume(s) in volume group "shared-vg0" now active
[ OK ]
Signaling clvmd to exit [ OK ]
clvmd terminated [ OK ]
Stopping all DRBD resources: . Managing Storage In The ClusterA little while back, we spoke about how the cluster is split into two components; cluster communication managed by cman and resource management provided by rgmanager. It's the later which we will now begin to configure. In the cluster.conf, the rgmanager component is contained within the <rm /> element tags. Within this element are three types of child elements. They are:
We'll look at each of these components in more detail shortly. A Note On Daemon StartingThere are four daemons we will be putting under cluster control;
The reason we do not want to start these daemons with the system is so that we can let the cluster do it. This way, should any fail, the cluster will detect the failure and fail the entire service tree. For example, lets say that drbd failed to start, rgmanager would fail the storage service and give up, rather than continue trying to start clvmd and the rest. With libvirtd being the last daemon, it will not be possible to start a VM unless the storage started successfully. If we had left these daemons to boot on start, the failure of the drbd would not effect the start-up of clvmd, which would then not find its PVs given that DRBD is down. Next, the system would try to start the gfs2 daemon which would also fail as the LV backing the partition would not be available. Finally, the system would start libvirtd, which would allow the start of virtual machine, which would also be missing their "hard drives" as their backing LVs would also not be available. Pretty messy situation to clean up from. Defining The ResourcesLets start by first defining our clustered resources. As stated before, the addition of these resources does not, in itself, put the defined resources under the cluster's management. Instead, it defines services, like init.d scripts. These can then be used by one or more <service /> elements, as we will see shortly. For now, it is enough to know what, until a resource is defined, it can not be used in the cluster. Given that this is the first component of rgmanager being added to cluster.conf, we will be creating the parent <rm /> elements here as well. Let's take a look at the new section, then discuss the parts. <?xml version="1.0"?>
<cluster name="an-cluster-A" config_version="8">
<cman expected_votes="1" two_node="1" />
<clusternodes>
<clusternode name="an-c05n01.alteeve.ca" nodeid="1">
<fence>
<method name="ipmi">
<device name="ipmi_an01" action="reboot" />
</method>
<method name="pdu">
<device name="pdu2" port="1" action="reboot" />
</method>
</fence>
</clusternode>
<clusternode name="an-c05n02.alteeve.ca" nodeid="2">
<fence>
<method name="ipmi">
<device name="ipmi_an02" action="reboot" />
</method>
<method name="pdu">
<device name="pdu2" port="2" action="reboot" />
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-c05n01.ipmi" login="root" passwd="secret" />
<fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-c05n02.ipmi" login="root" passwd="secret" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu2.alteeve.ca" name="pdu2" />
</fencedevices>
<fence_daemon post_join_delay="30" />
<totem rrp_mode="none" secauth="off"/>
<rm>
<resources>
<script file="/etc/init.d/drbd" name="drbd"/>
<script file="/etc/init.d/clvmd" name="clvmd"/>
<script file="/etc/init.d/gfs2" name="gfs2"/>
<script file="/etc/init.d/libvirtd" name="libvirtd"/>
</resources>
</rm>
</cluster> First and foremost; Note that we've incremented the version to 8. As always, increment and then edit. Let's focus on the new section; <rm>
<resources>
<script file="/etc/init.d/drbd" name="drbd"/>
<script file="/etc/init.d/clvmd" name="clvmd"/>
<script file="/etc/init.d/gfs2" name="gfs2"/>
<script file="/etc/init.d/libvirtd" name="libvirtd"/>
</resources>
</rm> The <resources>...</resources> element contains our four <script .../> resources. This is a particular type of resource which specifically handles that starting and stopping of init.d style scripts. That is, the script must exit with LSB compliant codes. They must also properly react to being called with the sole argument of start, stop and status. There are many other types of resources which, with the exception of <vm .../>, we will not be looking at in this tutorial. Should you be interested in them, please look in /usr/share/cluster for the various scripts (executable files that end with .sh). Each of our four <script ... /> resources have two attributes;
Other resources are more involved, but the <script .../> resources are quite simple. Creating Failover DomainsFail-over domains are, at their most basic, a collection of one or more nodes in the cluster with a particular set of rules associated with them. Services can then be configured to operate within the context of a given fail-over domain. There are a few key options to be aware of. Fail-over domains are optional and can be left out of the cluster, generally speaking. However, in our cluster, we will need them for our storage services, as we will later see, so please do not skip this step.
What we need to do at this stage is to create something of a hack. Let me explain; As discussed earlier, we need to start a set of local daemons on all nodes. These aren't really clustered resources though as they can only ever run on their host node. They will never be relocated or restarted elsewhere in the cluster as as such, are not highly available. So to work around this desire to "cluster the unclusterable", we're going to create a fail-over domain for each node in the cluster. Each of these domains will have only one of the cluster nodes as members of the domain and the domain will be restricted, unordered and have no fail-back. With this configuration, any service group using it will only ever run on the one node in the domain. In the next step, we will create a service group, then replicate it once for each node in the cluster. The only difference will be the failoverdomain each is set to use. With our configuration of two nodes then, we will have two fail-over domains, one for each node, and we will define the clustered storage service twice, each one using one of the two fail-over domains. Let's look at the complete updated cluster.conf, then we will focus closer on the new section. <?xml version="1.0"?>
<cluster name="an-cluster-A" config_version="9">
<cman expected_votes="1" two_node="1" />
<clusternodes>
<clusternode name="an-c05n01.alteeve.ca" nodeid="1">
<fence>
<method name="ipmi">
<device name="ipmi_an01" action="reboot" />
</method>
<method name="pdu">
<device name="pdu2" port="1" action="reboot" />
</method>
</fence>
</clusternode>
<clusternode name="an-c05n02.alteeve.ca" nodeid="2">
<fence>
<method name="ipmi">
<device name="ipmi_an02" action="reboot" />
</method>
<method name="pdu">
<device name="pdu2" port="2" action="reboot" />
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-c05n01.ipmi" login="root" passwd="secret" />
<fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-c05n02.ipmi" login="root" passwd="secret" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu2.alteeve.ca" name="pdu2" />
</fencedevices>
<fence_daemon post_join_delay="30" />
<totem rrp_mode="none" secauth="off"/>
<rm>
<resources>
<script file="/etc/init.d/drbd" name="drbd"/>
<script file="/etc/init.d/clvmd" name="clvmd"/>
<script file="/etc/init.d/gfs2" name="gfs2"/>
<script file="/etc/init.d/libvirtd" name="libvirtd"/>
</resources>
<failoverdomains>
<failoverdomain name="only_an01" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-c05n01.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="only_an02" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-c05n02.alteeve.ca"/>
</failoverdomain>
</failoverdomains>
</rm>
</cluster> As always, the version was incremented, this time to 9. We've also added the new <failoverdomains>...</failoverdomains> element. Let's take a closer look at this new element. <failoverdomains>
<failoverdomain name="only_an01" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-c05n01.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="only_an02" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-c05n02.alteeve.ca"/>
</failoverdomain>
</failoverdomains> The first thing to node is that there are two <failoverdomain...>...</failoverdomain> child elements.
The <failoverdomain ...> element has four attributes;
Each of the <failoverdomain...> elements has a single <failoverdomainnode .../> child element. This is a very simple element which has, at this time, only one attribute;
At this point, we're ready to finally create our clustered storage services. Creating Clustered Storage ServicesWith the resources defined and the fail-over domains created, we can set about creating our services. Generally speaking, services can have one or more resources within them. When two or more resources exist, then can be put into a dependency tree, they can used in parallel or a combination of parallel and dependent resources. When you create a service dependency tree, you put each dependent resource as a child element of its parent. The resources are then started in order, starting at the top of the tree and working its way down to the deepest child resource. If at any time one of the resources should fail, the entire service will be declared failed and no attempt will be made to try and start any further child resources. Conversely, stopping the service will cause the deepest child resource to be stopped first. Then the second deepest and on upwards towards the top resource. This is exactly the behaviour we want, as we will see shortly. When resources are defined in parallel, all defined resources will be started at the same time. Should any one of the resources fail to start, the entire resource will declared failed. Stopping the service will likewise cause a simultaneous call to stop all resources. As before, let's take a look at the entire updated cluster.conf file, then we'll focus in on the new service section. <?xml version="1.0"?>
<cluster name="an-cluster-A" config_version="10">
<cman expected_votes="1" two_node="1" />
<clusternodes>
<clusternode name="an-c05n01.alteeve.ca" nodeid="1">
<fence>
<method name="ipmi">
<device name="ipmi_an01" action="reboot" />
</method>
<method name="pdu">
<device name="pdu2" port="1" action="reboot" />
</method>
</fence>
</clusternode>
<clusternode name="an-c05n02.alteeve.ca" nodeid="2">
<fence>
<method name="ipmi">
<device name="ipmi_an02" action="reboot" />
</method>
<method name="pdu">
<device name="pdu2" port="2" action="reboot" />
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="ipmi_an01" agent="fence_ipmilan" ipaddr="an-c05n01.ipmi" login="root" passwd="secret" />
<fencedevice name="ipmi_an02" agent="fence_ipmilan" ipaddr="an-c05n02.ipmi" login="root" passwd="secret" />
<fencedevice agent="fence_apc_snmp" ipaddr="pdu2.alteeve.ca" name="pdu2" />
</fencedevices>
<fence_daemon post_join_delay="30" />
<totem rrp_mode="none" secauth="off"/>
<rm>
<resources>
<script file="/etc/init.d/drbd" name="drbd"/>
<script file="/etc/init.d/clvmd" name="clvmd"/>
<script file="/etc/init.d/gfs2" name="gfs2"/>
<script file="/etc/init.d/libvirtd" name="libvirtd"/>
</resources>
<failoverdomains>
<failoverdomain name="only_an01" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-c05n01.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="only_an02" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-c05n02.alteeve.ca"/>
</failoverdomain>
</failoverdomains>
<service name="storage_an01" autostart="1" domain="only_an01" exclusive="0" recovery="restart">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd"/>
</script>
</script>
</script>
</service>
<service name="storage_an02" autostart="1" domain="only_an02" exclusive="0" recovery="restart">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd"/>
</script>
</script>
</script>
</service>
</rm>
</cluster> With the version now at 10, we have added two <service...>...</service> elements. Each containing a four <script ...> type resources in a service tree configuration. Let's take a closer look. <service name="storage_an01" autostart="1" domain="only_an01" exclusive="0" recovery="restart">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd"/>
</script>
</script>
</script>
</service>
<service name="storage_an02" autostart="1" domain="only_an02" exclusive="0" recovery="restart">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd"/>
</script>
</script>
</script>
</service> The <service ...>...</service> elements have five attributes each;
Within each of the two <service ...>...</service> attributes are four <script...> type resources. These are configured as a service tree in the order;
Each of these <script ...> elements has just one attribute; ref="..." which points to a corresponding script resource. The logic for this particular resource tree is;
From the other direction, we need the stop order to be organized in the reverse order.
All in all, it's a surprisingly simple and effective configuration. Validating And Pushing The ChangesWe've made a big change, so it's all the more important that we validate the config before proceeding. ccs_config_validate Configuration validates We need to now tell the cluster to use the new configuration file. Unlike last time, we won't use rsync. Now that the cluster is up and running, we can use it to push out the updated configuration file using cman_tool. This is the first time we've used the cluster to push out an updated cluster.conf file, so we will have to enter the password we set earlier for the ricci user on both nodes. cman_tool version -r You have not authenticated to the ricci daemon on an-c05n01.alteeve.ca Password: You have not authenticated to the ricci daemon on an-c05n02.alteeve.ca Password: If you were watching syslog, you will have seen an entries like the ones below. Dec 14 20:39:08 an-c05n01 modcluster: Updating cluster.conf
Dec 14 20:39:12 an-c05n01 corosync[2360]: [QUORUM] Members[2]: 1 2 Now we can confirm that both nodes are using the new configuration by re-running the cman_tool version command, but without the -r switch. On both; cman_tool version 6.2.0 config 10 Checking The Cluster's StatusNow let's look at a new tool; clustat, cluster status. We'll be using clustat extensively from here on out to monitor the status of the cluster members and managed services. It does not manage the cluster in any way, it is simply a status tool. We'll see how Here is what it should look like when run from an-c05n01. clustat Cluster Status for an-cluster-A @ Wed Dec 14 20:45:04 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local
an-c05n02.alteeve.ca 2 Online At this point, we're only running the foundation of the cluster, so we can only see which nodes are in the cluster. We've added resources to the cluster configuration though, so it's time to start the resource layer as well, which is managed by the rgmanager daemon. At this time, we're still starting the cluster manually after each node boots, so we're going to make sure that rgmanager is disabled at boot. chkconfig rgmanager off
chkconfig --list rgmanager rgmanager 0:off 1:off 2:off 3:off 4:off 5:off 6:off Now let's start it.
/etc/init.d/rgmanager start Starting Cluster Service Manager: [ OK ] Now let's run clustat again, and see what's new. clustat Cluster Status for an-cluster-A @ Wed Dec 14 20:52:11 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started What we see are two section; The top section shows the cluster members and the lower part covers the managed resources. We can see that both members, an-c05n01.alteeve.ca and an-c05n02.alteeve.ca are Online, meaning that cman is running and that they've joined the cluster. It also shows us that both members are running rgmanager. You will always see Local beside the name of the node you ran the actual clustat command from. Under the services, you can see the two new services we created with the service: prefix. We can see that each service is started, meaning that all four of the resources are up and running properly and which node each service is running on. Note that the two storage services are running, despite not having started them? That is because the rgmanager service was started earlier. When we pushed out the updated configuration, rgmanager saw the two new storage services had autostart="1" and started them. If you check your storage services now, you will see that they are all online. DRBD; /etc/init.d/drbd status version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag@Build64R6, 2011-11-20 10:57:03
m:res cs ro ds p mounted fstype
0:r0 Connected Primary/Primary UpToDate/UpToDate C
1:r1 Connected Primary/Primary UpToDate/UpToDate C
2:r2 Connected Primary/Primary UpToDate/UpToDate C Clustered LVM; pvscan; vgscan; lvscan PV /dev/drbd2 VG an02-vg0 lvm2 [201.25 GiB / 201.25 GiB free]
PV /dev/drbd1 VG an01-vg0 lvm2 [201.62 GiB / 201.62 GiB free]
PV /dev/drbd0 VG shared-vg0 lvm2 [18.61 GiB / 0 free]
Total: 3 [421.48 GiB] / in use: 3 [421.48 GiB] / in no VG: 0 [0 ]
Reading all physical volumes. This may take a while...
Found volume group "an02-vg0" using metadata type lvm2
Found volume group "an01-vg0" using metadata type lvm2
Found volume group "shared-vg0" using metadata type lvm2
ACTIVE '/dev/shared-vg0/shared' [18.61 GiB] inherit GFS2; /etc/init.d/gfs2 status Configured GFS2 mountpoints:
Configured GFS2 mountpoints:
/shared
Active GFS2 mountpoints:
/shared Nice, eh? Managing Cluster ResourcesManaging services in the cluster is done with a fairly simple tool called clusvcadm. The main commands we're going to look at shortly are:
There are other ways to use clusvcadm which we will look at after the virtual servers are provisioned and under cluster control. Stopping Clustered Storage - A Preview To Cold-Stopping The ClusterTo stop the storage services, we'll use the rgmanager command line tool clusvcadm, the cluster service administrator. Specifically, we'll use its -d switch, which tells rgmanager to disable the service.
As always, confirm the current state of affairs before starting. On both nodes, run clustat to confirm that the storage services are up. clustat Cluster Status for an-cluster-A @ Tue Dec 20 20:37:42 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started They are, so now lets gracefully shut them down. On an-c05n01, run: clusvcadm -d storage_an01 Local machine disabling service:storage_an01...Success If we now run clustat from either node, we should see this; clustat Cluster Status for an-cluster-A @ Tue Dec 20 20:38:28 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 (an-c05n01.alteeve.ca) disabled
service:storage_an02 an-c05n02.alteeve.ca started Notice how service:storage_an01 is now in the disabled state? If you check the status of drbd now on an-c05n02 you will see that an-c05n01 is indeed down. /etc/init.d/drbd status drbd driver loaded OK; device status:
version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag@Build64R6, 2011-11-20 10:57:03
m:res cs ro ds p mounted fstype
0:r0 WFConnection Primary/Unknown UpToDate/Outdated C
1:r1 WFConnection Primary/Unknown UpToDate/Outdated C
2:r2 WFConnection Primary/Unknown UpToDate/Outdated C If you want to shut down the entire cluster, you will need to stop the storage_an02 service as well. For fun, let's do this, but lets stop the service from an-c05n01; clusvcadm -d storage_an02 Local machine disabling service:storage_an02...Success Now on both nodes, we should see this from clustat; clustat Cluster Status for an-cluster-A @ Tue Dec 20 20:39:55 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 (an-c05n01.alteeve.ca) disabled
service:storage_an02 (an-c05n02.alteeve.ca) disabled
We can now, if we wanted to, stop the rgmanager and cman daemons. This is, in fact, how we will cold-stop the cluster from now on. We'll cover cold stopping the cluster after we finish provisioning VMs. Starting Clustered StorageNormally from now on, the clustered storage will start automatically. However, it's a good exercise to look at how to manually start them, just in case. The main difference from stopping the service is that we swap the -d switch for the -e, enable, switch. We will also add the target cluster member name using the -m switch. We didn't need to use the member switch while stopping because the cluster could tell where the service was running and, thus, which member to contact to stop the service. Should you omit the member name, the cluster will try to use the local node as the target member. Note though that a target service will start on the node the command was issued on, regardless of the fail-over domain's ordered policy. That is to say, a service will not start on another node in the cluster when the member option is not specified, despite the fail-over configuration set to prefer another node.
On an-c05n01, run; clusvcadm -e storage_an01 -m an-c05n01.alteeve.ca Member an-c05n01.alteeve.ca trying to enable service:storage_an01...Success
service:storage_an01 is now running on an-c05n01.alteeve.ca On an-c05n02, run; clusvcadm -e storage_an02 -m an-c05n02.alteeve.ca Member an-c05n02.alteeve.ca trying to enable service:storage_an02...Success
service:storage_an02 is now running on an-c05n02.alteeve.ca Now clustat on either node should again show the storage services running again. clustat Cluster Status for an-cluster-A @ Tue Dec 20 21:09:19 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started A Note On Resource Management With DRBDWhen the cluster starts for the first time, where neither node's DRBD storage was up, the first node to start will wait for /etc/drbd.d/global_common.conf's wfc-timeout seconds (300 in our case) for the second node to start. For this reason, we want to ensure that we enable the storage resources more or less at the same time and from two different terminals. The reason for two terminals is that the clusvcadm -e ... command won't return until all resources have started, so you need the second terminal window to start the other node's clustered storage service while the first one waits. If the clustered storage service ever fails, look in syslog's /var/log/messages for a split-brain error. Look for a message like: Mar 29 20:24:37 an-c05n01 kernel: block drbd2: helper command: /sbin/drbdadm initial-split-brain minor-2
Mar 29 20:24:37 an-c05n01 kernel: block drbd2: helper command: /sbin/drbdadm initial-split-brain minor-2 exit code 0 (0x0)
Mar 29 20:24:37 an-c05n01 kernel: block drbd2: Split-Brain detected but unresolved, dropping connection!
Mar 29 20:24:37 an-c05n01 kernel: block drbd2: helper command: /sbin/drbdadm split-brain minor-2
Mar 29 20:24:37 an-c05n01 kernel: block drbd2: helper command: /sbin/drbdadm split-brain minor-2 exit code 0 (0x0)
Mar 29 20:24:37 an-c05n01 kernel: block drbd2: conn( WFReportParams -> Disconnecting ) With the fencing hook into the cluster, this should be a very hard problem to run into. If you do though, Linbit has the authoritative guide to recover from this situation. Provisioning Virtual MachinesNow we're getting to the purpose of our cluster; Provision virtual machines! We have two steps left;
"Provisioning" a virtual machine simple means to create it; Assign a collection of emulated hardware, connected to physical devices, to a given virtual machine and begin the process of installing the operating system on it. This tutorial is more about clustering than it is about virtual machine administration, so some experience with managing virtual machines has to be assumed. If you need to brush up, here are some resources;
When you feel comfortable, proceed. Before We Begin - Setting Up Our WorkstationThe virtual machines are, for obvious reasons, headless. That is, they have no real video card into which we can plug a monitor and watch the progress of the install. This would, left unresolved, make it pretty hard to install the operating systems as there is simply no network in the early stages of most operating system installations. Part of the libvirtd package is a program called virt-manager which is available on most all modern Linux distributions. This application makes it very easy to connect to our virtual machines, regardless of their network state. How you install this will depend on your workstation. On RPM-based systems, try: yum install virt-manager On deb based systems, try: apt-get install virt-manager On SUSE-based systems, try; zypper install virt-manager Once it is installed, you need to determine whether your workstation is on the IFN or BCN. I've got my laptop on the BCN, so I will connect to the nodes using just their short host names. If you're on the same IFN as the nodes, you will need to append .ifn to the host names. To connect to the the cluster nodes;
Once your two nodes have been added to virt-manager, you should see both nodes as connected, but no VMs will be shown as we've not yet provisioned any yet. We'll come back to virt-manager shortly. Provision PlanningBefore we can start creating virtual machines, we need to take stock of what resources we have available and how we want to divy them out to the VMs. In my cluster, I've got 200 GiB available on each of my two nodes. vgdisplay |grep -i -e free -e "vg name" VG Name an02-vg0
Free PE / Size 51521 / 201.25 GiB
VG Name an01-vg0
Free PE / Size 51615 / 201.62 GiB
VG Name shared-vg0
Free PE / Size 0 / 0 I know I have 8 GiB of memory, but I have to slice off a certain amount of that for the host OS. I've got my nodes sitting about where they will be normally, so I can check how much memory is in use fairly easily. cat /proc/meminfo |grep -e MemTotal -e MemFree MemTotal: 8050312 kB
MemFree: 7432288 kB I'm sitting about about 604 MiB used (8,050,312 KiB - 7,432,288 KiB == 618,024 KiB / 1,024 == 603.54 MiB). I think I can safely operate within 1 GiB, leaving me 7 GiB of RAM to allocate to VMs. Next up, I need to confirm how many CPU cores I have available. cat /proc/cpuinfo |grep processor processor : 0
processor : 1
processor : 2
processor : 3 I've got four, and I like to dedicate the first one to the host OS, so I've got three to allocate to my VMs. On the network front, I know I've got two bridges, one to the IFN and one to the BCN. So let's summarize:
With this list in mind, we can now start planning out the VMs. The network can share the same subnet as the IFN if you wish, but I prefer to isolate my VMs from the IFN using a different subnet, 10.254.0.0/16. This is, admittedly, "security by obscurity" and in no way is it a replacement for proper isolation. In production, you will want to setup firewalls on you nodes to prevent access from virtual machines. With that said, here is what we will install now. Obviously, you will have other needs and goals. Mine is an admittedly artificial network.
Now to divvy up the resources;
Notice that we've over-allocated the CPU cores? This is ok. We're going to restrict the VMs to CPU cores number 1 through 3, leaving core number 0 for the host OS. When all of the VMs are running on one node, the hypervisor's scheduler will handle shuffling jobs from the VMs' cores to the real cores that are least loaded at a given time. As for the RAM though, we can not use more than we have. We're going to leave 1 GiB for the host, so we'll divvy the remaining 7 GiB between the VMs. Remember, we have to plan for when all four VMs will run on just one node. A Note on VM ConfigurationIt would be a questionably valueable divertion to cover the setup of each VM. It will be up to you, reader, to setup each VM however you like. Provisioning vm01-dev
Before we can provision, we need to gather whatever install source we'll need for the VM. This can be a simple ISO file, as we'll see on the windows install later, or it can be files on a web server, which we'll use here. We'll also need to create the "hard drive" for the VM, which will be a new LV. Finally, we'll craft the virt-install command which will begin the actual OS install. This being a Linux machine, we can provision this using a network. Conveniently, I've got a PXE server setup with the CentOS install files available on my local network at http://10.255.255.254/c6/x86_64/img/. You don't need to have a full PXE server setup, mounting the install ISO and pointing a web server at the mounted directory would work just fine. I'm also going to further customize my install by using a kickstart file which, effectively, pre-answers the installation questions so that the install is fully automated. So, let's create the new LV. I know that this machine will be primarily run on an-c05n01 and that it will be 150 GiB. I personally always name the LVs as vmXXXX-Y, where X is the VM's name and the Y is a simple integer. You are obviously free to use whatever makes most sense to you. Creating vm01-dev's StorageWith that, the lvcreate call is; On an-c05n01, run; lvcreate -L 150G -n vm0001-1 /dev/an01-vg0 Logical volume "vm0001-1" created Creating vm01-dev's virt-install CallNow with the storage created, we can craft the virt-install command. I like to put this into a file under the /shared/provision/ directory for future reference. Let's take a look at the command, then we'll discuss what the switches are for. touch /shared/provision/vm01-dev.sh
chmod 755 /shared/provision/vm01-dev.sh
vim /shared/provision/vm01-dev.sh virt-install --connect qemu:///system \
--name vm01-dev \
--ram 1024 \
--arch x86_64 \
--vcpus 1 \
--location http://10.255.255.254/c6/x86_64/img/ \
--extra-args "ks=http://10.255.255.254/c6/x86_64/ks/c6_minimal.ks" \
--os-type linux \
--os-variant rhel6 \
--disk path=/dev/an01-vg0/vm0001-1 \
--network bridge=vbr2 \
--vnc
Let's break it down;
This tells virt-install to use the QEMU hardware emulator (as opposed to Xen) and to install the VM on to local system.
This sets the name of the VM. It is the name we will use in the cluster configuration and whenever we use the libvirtd tools, like virsh.
This sets the amount of RAM, in MiB, to allocate to this VM. Here, we're allocating 1 GiB (1,024 MiB).
This sets the emulated CPU's architecture to 64-bit. This can be used even when you plan to install a 32-bit OS, but not the other way around, of course.
This sets the number of CPU cores to allocate to this VM. Here, we're setting just one.
This tells virt-install to pull the installation files from the URL specified.
This is an optional command used to pass the install kernel arguments. Here, I'm using it to tell the kernel to grab the specified kickstart file for use during the installation.
This broadly sets hardware emulation for optimal use with Linux-based virtual machines.
This further refines tweaks to the hardware emulation to maximize performance for RHEL6 (and derivative) installs.
This tells the installer to use the LV we created earlier as the backing storage device for the new virtual machine.
This tells the installer to create a network card in the VM and to then connect it to the vbr2 bridge, thus connecting the VM to the IFN. Optionally, you could add ,model=e1000 option to tells the emulator to mimic an Intel e1000 hardware NIC. The default is to use the virtio virtualized network card. If you have two or more bridges, you can repeat the --network switch as many times as you need.
This tells virt-manager to create a VNC server on the VM and, if possible, immediately connect it the just-provisioned VM. With a minimal install on the nodes, the automatically spawned client will fail. This is fine, just use virt-manager from my workstation.
Initializing vm01-dev's InstallWell, time to start the install! On an-c05n01, run; /shared/provision/vm01-dev.sh Starting install...
Retrieving file .treeinfo... | 676 B 00:00 ...
Retrieving file vmlinuz... | 7.5 MB 00:00 ...
Retrieving file initrd.img... | 59 MB 00:02 ...
Creating domain... | 0 B 00:00
WARNING Unable to connect to graphical console: virt-viewer not installed. Please install the 'virt-viewer' package.
Domain installation still in progress. You can reconnect to
the console to complete the installation process. And it's off! Progressing nicely. And done! Note that, depending on your kickstart file, it may have automatically rebooted or you may need to reboot manually.
Defining vm01-dev On an-c05n02We can use virsh to see that the new virtual machine exists and what state it is in. Note that I've gotten into the habit of using --all to get around virsh's default behaviour of hiding VMs that are off. On an-c05n01; virsh list --all Id Name State
----------------------------------
2 vm01-dev running On an-c05n02; virsh list --all Id Name State
---------------------------------- As we see, the new vm01-dev is only known to an-c05n01. This is, in and of itself, just fine. We're going to need to put the virtual machine's XML definition file in a common place accessible on both nodes. This could be matching but separate directories on either node, or it can be a common shared location. As we've got the cluster's /shared GFS2 partition, we're going to use the /shared/definitions directory we create earlier. This avoids the need to remember to keep two copies of the file in sync across both nodes. To backup the VM's configuration, we'll again use virsh, but this time with the dumpxml command. On an-c05n01; virsh dumpxml vm01-dev > /shared/definitions/vm01-dev.xml
cat /shared/definitions/vm01-dev.xml <domain type='kvm' id='2'>
<name>vm01-dev</name>
<uuid>2512b2dd-a1a8-f990-2a0d-6c41968ab3f8</uuid>
<memory>1048576</memory>
<currentMemory>1048576</currentMemory>
<vcpu>1</vcpu>
<os>
<type arch='x86_64' machine='rhel6.2.0'>hvm</type>
<boot dev='network'/>
<boot dev='cdrom'/>
<boot dev='hd'/>
<bootmenu enable='yes'/>
</os>
<features>
<acpi/>
<apic/>
<pae/>
</features>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>restart</on_crash>
<devices>
<emulator>/usr/libexec/qemu-kvm</emulator>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/an01-vg0/vm0001-1'/>
<target dev='vda' bus='virtio'/>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk>
<interface type='bridge'>
<mac address='52:54:00:9b:3c:f7'/>
<source bridge='vbr2'/>
<target dev='vnet0'/>
<model type='virtio'/>
<alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>
<serial type='pty'>
<source path='/dev/pts/2'/>
<target port='0'/>
<alias name='serial0'/>
</serial>
<console type='pty' tty='/dev/pts/2'>
<source path='/dev/pts/2'/>
<target type='serial' port='0'/>
<alias name='serial0'/>
</console>
<input type='tablet' bus='usb'>
<alias name='input0'/>
</input>
<input type='mouse' bus='ps2'/>
<graphics type='vnc' port='5900' autoport='yes'/>
<video>
<model type='cirrus' vram='9216' heads='1'/>
<alias name='video0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</video>
<memballoon model='virtio'>
<alias name='balloon0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</memballoon>
</devices>
</domain> There we go; That is the emulated hardware on which your virtual machine exists. Pretty neat, eh? I like to keep all of my VMs defined on all of my nodes. This is entirely optional, as the cluster will define the VM on a target node when needed. It is, though, a good chance to examine how this is done manually. On an-c05n02; virsh define /shared/definitions/vm01-dev.xml Domain vm01-dev defined from /shared/definitions/vm01-dev.xml We can confirm that it now exists by re-running virsh list --all. virsh list --all Id Name State
----------------------------------
- vm01-dev shut off You should also now be able to see vm01-dev under an-c05n02 in your virt-manager window. It will be listed as shutoff, which is expected. Do not try to turn it on while it's running on the other node! Provisioning vm02-webThis installation will be pretty much the same as it was for vm01-dev, so we'll look mainly at the differences. Creating vm02-web's StorageWe'll use lvcreate again, but this time we won't specify a specific size, but instead a percentage of the remainin free space will be defined. Note that the -L switch changes to -l; On an-c05n01, run; lvcreate -l 100%FREE -n vm0002-1 /dev/an01-vg0 Logical volume "vm0002-1" created Creating vm02-web's virt-install CallThe virt-install command will be quite similar to the previous one. touch /shared/provision/vm02-web.sh
chmod 755 /shared/provision/vm02-web.sh
vim /shared/provision/vm02-web.sh virt-install --connect qemu:///system \
--name vm02-web \
--ram 2048 \
--arch x86_64 \
--vcpus 2 \
--location http://10.255.255.254/c6/x86_64/img/ \
--extra-args "ks=http://10.255.255.254/c6/x86_64/ks/c6_minimal.ks" \
--os-type linux \
--os-variant rhel6 \
--disk path=/dev/an01-vg0/vm0002-1 \
--network bridge=vbr2 \
--vnc Lets look at the differences;
Note that the same kickstart file from before is used. This is fine as it doesn't specify a specific IP address and it is smart enough to adapt to the new virtual disk size. Initializing vm02-web's InstallWell, time to start the install! On an-c05n01, run; /shared/provision/vm02-web.sh Starting install...
Retrieving file .treeinfo... | 676 B 00:00 ...
Retrieving file vmlinuz... | 7.5 MB 00:00 ...
Retrieving file initrd.img... | 59 MB 00:02 ...
Creating domain... | 0 B 00:00
WARNING Unable to connect to graphical console: virt-viewer not installed. Please install the 'virt-viewer' package.
Domain installation still in progress. You can reconnect to
the console to complete the installation process. The install should proceed more or less the same as it did for vm01-dev. Defining vm02-web On an-c05n02We can use virsh to see that the new virtual machine exists and what state it is in. Note that I've gotten into the habit of using --all to get around virsh's default behaviour of hiding VMs that are off. On an-c05n01; virsh list --all Id Name State
----------------------------------
2 vm01-dev running
4 vm02-web running On an-c05n02; virsh list --all Id Name State
----------------------------------
- vm01-dev shut off As before, the new vm02-web is only known to an-c05n01. On an-c05n01; virsh dumpxml vm02-web > /shared/definitions/vm02-web.xml
cat /shared/definitions/vm02-web.xml <domain type='kvm' id='4'>
<name>vm02-web</name>
<uuid>02f967ab-103f-c276-c40f-9eaa47339df4</uuid>
<memory>2097152</memory>
<currentMemory>2097152</currentMemory>
<vcpu>2</vcpu>
<os>
<type arch='x86_64' machine='rhel6.2.0'>hvm</type>
<boot dev='hd'/>
</os>
<features>
<acpi/>
<apic/>
<pae/>
</features>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>restart</on_crash>
<devices>
<emulator>/usr/libexec/qemu-kvm</emulator>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/an01-vg0/vm0002-1'/>
<target dev='vda' bus='virtio'/>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk>
<interface type='bridge'>
<mac address='52:54:00:65:39:60'/>
<source bridge='vbr2'/>
<target dev='vnet1'/>
<model type='virtio'/>
<alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>
<serial type='pty'>
<source path='/dev/pts/3'/>
<target port='0'/>
<alias name='serial0'/>
</serial>
<console type='pty' tty='/dev/pts/3'>
<source path='/dev/pts/3'/>
<target type='serial' port='0'/>
<alias name='serial0'/>
</console>
<input type='tablet' bus='usb'>
<alias name='input0'/>
</input>
<input type='mouse' bus='ps2'/>
<graphics type='vnc' port='5901' autoport='yes'/>
<video>
<model type='cirrus' vram='9216' heads='1'/>
<alias name='video0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</video>
<memballoon model='virtio'>
<alias name='balloon0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</memballoon>
</devices>
</domain> There we go; That is the emulated hardware on which your virtual machine exists. Pretty neat, eh? I like to keep all of my VMs defined on all of my nodes. This is entirely optional, as the cluster will define the VM on a target node when needed. It is, though, a good chance to examine how this is done manually. On an-c05n02; virsh define /shared/definitions/vm02-web.xml Domain vm02-web defined from /shared/definitions/vm02-web.xml We can confirm that it now exists by re-running virsh list --all. virsh list --all Id Name State
----------------------------------
- vm01-dev shut off
- vm02-web shut off Provisioning vm03-dbThis installation will, again, be pretty much the same as it was for vm01-dev and vm02-web, so we'll again look mainly at the differences. Creating vm03-db's StorageWe'll use lvcreate again, but being the first LV on the an02-vg0, we'll specify the specific size again. On an-c05n01, run; lvcreate -L 100G -n vm0003-1 /dev/an02-vg0 Logical volume "vm0003-1" created Creating vm03-db's virt-install CallThe virt-install command will be quite similar to the previous one. touch /shared/provision/vm03-db.sh
chmod 755 /shared/provision/vm03-db.sh
vim /shared/provision/vm03-db.sh virt-install --connect qemu:///system \
--name vm03-db \
--ram 2048 \
--arch x86_64 \
--vcpus 2 \
--location http://10.255.255.254/c6/x86_64/img/ \
--extra-args "ks=http://10.255.255.254/c6/x86_64/ks/c6_minimal.ks" \
--os-type linux \
--os-variant rhel6 \
--disk path=/dev/an02-vg0/vm0003-1 \
--network bridge=vbr2 \
--vnc Lets look at the differences;
Initializing vm03-db's InstallThis time we're going to provision the new VM on an-c05n02, as that is where it will live normally. On an-c05n02, run; /shared/provision/vm03-db.sh Starting install...
Retrieving file .treeinfo... | 676 B 00:00 ...
Retrieving file vmlinuz... | 7.5 MB 00:00 ...
Retrieving file initrd.img... | 59 MB 00:02 ...
Creating domain... | 0 B 00:00
WARNING Unable to connect to graphical console: virt-viewer not installed. Please install the 'virt-viewer' package.
Domain installation still in progress. You can reconnect to
the console to complete the installation process. The install should proceed more or less the same as it did for vm01-dev and vm02-web. Defining vm03-db On an-c05n01We can use virsh to see that the new virtual machine exists and what state it is in. Note that I've gotten into the habit of using --all to get around virsh's default behaviour of hiding VMs that are off. On an-c05n02; virsh list --all Id Name State
----------------------------------
2 vm03-db running
- vm01-dev shut off
- vm02-web shut off On an-c05n01; virsh list --all Id Name State
----------------------------------
2 vm01-dev running
4 vm02-web running To backup the VM's configuration, we'll again use virsh, but this time with the dumpxml command. On an-c05n02; virsh dumpxml vm03-db > /shared/definitions/vm03-db.xml
cat /shared/definitions/vm03-db.xml <domain type='kvm' id='2'>
<name>vm03-db</name>
<uuid>a7018001-b433-b739-bbd9-d4d3285f0a72</uuid>
<memory>2097152</memory>
<currentMemory>2097152</currentMemory>
<vcpu>2</vcpu>
<os>
<type arch='x86_64' machine='rhel6.2.0'>hvm</type>
<boot dev='hd'/>
</os>
<features>
<acpi/>
<apic/>
<pae/>
</features>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>restart</on_crash>
<devices>
<emulator>/usr/libexec/qemu-kvm</emulator>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/an02-vg0/vm0003-1'/>
<target dev='vda' bus='virtio'/>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk>
<interface type='bridge'>
<mac address='52:54:00:44:83:ec'/>
<source bridge='vbr2'/>
<target dev='vnet0'/>
<model type='virtio'/>
<alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>
<serial type='pty'>
<source path='/dev/pts/2'/>
<target port='0'/>
<alias name='serial0'/>
</serial>
<console type='pty' tty='/dev/pts/2'>
<source path='/dev/pts/2'/>
<target type='serial' port='0'/>
<alias name='serial0'/>
</console>
<input type='tablet' bus='usb'>
<alias name='input0'/>
</input>
<input type='mouse' bus='ps2'/>
<graphics type='vnc' port='5900' autoport='yes'/>
<video>
<model type='cirrus' vram='9216' heads='1'/>
<alias name='video0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</video>
<memballoon model='virtio'>
<alias name='balloon0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</memballoon>
</devices>
</domain> On an-c05n01; virsh define /shared/definitions/vm03-db.xml Domain vm03-db defined from /shared/definitions/vm03-db.xml We can confirm that it now exists by re-running virsh list --all. virsh list --all Id Name State
----------------------------------
2 vm01-dev running
4 vm02-web running
- vm03-db shut off Provisioning vm04-msNow for something a little different! This will be the Windows 2008 R2 virtual machine. The biggest difference this time will be that we're going to install from the ISO file rather than from a web-accessible store. Another difference is that we're going to specify what kind of storage bus to use with this VM. We'll be using a special, virtualized bus called virtio which requires that the drivers be available to the OS at install time. These drivers will, in turn, be made available to the installer as a virtual floppy disk. It will make for quite the interesting virt-install call, as we'll see. Preparing vm04-ms's StorageAs before, we need to create the backing storage LV before we can provision the machine. As we planned, this will be a 100 GiB partition and will be on the an02-vg0 VG. Seeing as this LV will use up the rest of the free space in the VG, we'll again use the lvcreate -l 100%FREE instead of -L 100G as sometimes the numbers don't work out to be exactly the size we intend. On an-c05n02, run; lvcreate -l 100%FREE -n vm0004-1 /dev/an02-vg0 Logical volume "vm0004-1" created Before we proceed, we now need to put a copy of the install media, the OS's ISO and the virtual floppy disk, somewhere that the installer can access. I like to put files like this into the /shared/files/ directory we created earlier. How you put them there will be an exercise for the reader. If you do not have a copy of Microsoft's server operating system, you can download a 30-day free trial here; The driver for the virtio bus can be found from Red Hat here. Note that there is an ISO and a vfd (virtual floppy disk) file. You can use the ISO and mount it as a second CD-ROM if you wish. This tutorial will use the virtual floppy disk to show how floppy images can be used in VMs:
For those wishing to use the floppy image:
Creating vm04-ms's virt-install CallLets look at the virt-install command, then we'll discuss the main differences from the previous call for the firewall. As before, we'll put this command into a small shell script for later reference. touch /shared/provision/vm04-ms.sh
chmod 755 /shared/provision/vm04-ms.sh
vim /shared/provision/vm04-ms.sh virt-install --connect qemu:///system \
--name vm04-ms \
--ram 2048 \
--arch x86_64 \
--vcpus 2 \
--cdrom /shared/files/Windows_Server_2008_R2_64Bit_SP1.iso \
--disk path=/dev/an02-vg0/vm0004-1,device=disk,bus=virtio \
--disk path=/shared/files/virtio-win-1.1.16.vfd,device=floppy \
--os-type windows \
--os-variant win2k8 \
--network bridge=vbr2 \
--vnc Let's look at the main differences;
Here we've swapped out the --location and --extra-args arguments for the --cdrom switch. This will create an emulated DVD-ROM drive and boot from it. The path and file is an ISO image of the installation media we want to use.
This is the same line we used before, pointing to the new LV of course, but we've added options to it. Specifically, we've told the hardware emulator, QEMU, to not create the standard (ide or scsi) bus. This is a special bus that improves storage I/O on windows (and other) guests. Windows does not support this bus natively, which brings us to the next option.
This mounts the emulated floppy disk with the virtio drivers that we'll need to allow windows to see the hard drive during the install. The rest is more or less the same as before. Initializing vm04-ms's InstallAs before, we'll run the script with the virt-install command in it. On an-c05n02, run; /shared/provision/vm04-ms.sh Starting install...
Creating domain... | 0 B 00:00
WARNING Unable to connect to graphical console: virt-viewer not installed. Please install the 'virt-viewer' package.
Domain installation still in progress. Waiting for installation to complete. This install isn't automated like the previous installs were, so we'll need to hand-hold the VM through the install. After you get click to select the Custom (advanced) installation method, you will Click on the Load Driver option on the bottom left. You will be presented with a window telling you your options for loading the drivers. Click on the OK button and the installer will automatically find the virtual floppy disk and present you with the available drivers. Click to highlight Red Hat VirtIO SCSI Controller (A:\amd64\Win2008\viostor.inf) and click the Next button. At this point, the windows installer will see the virtual hard drive and you can proceed with the install as you would normally install Windows 2008 R2 server. Once the install is complete, reboot. Post-Install HousekeepingWe have to be careful to "eject" the virtual floppy and DVD disks from the VM. If you neglect to do so, then later delete the files, virsh will fail to boot the VMs and undefine them entirely. (Yes, that is dumb, in this author's opinion). How to recover from this issue can be found below.
To "eject" the DVD-ROM and floppy drive, we will use the virt-manager graphical program. You will need to either run virt-manager on one of the nodes, or use a version of it from your workstation by connecting to the host node over SSH. This later method is what I like to do. Using virt-manager, connect to the vm04-ms VM. Click on View then Details and you will see the virtual machine's emulated hardware. First, let's eject the virtual floppy disk. In the left panel, click to select the Floppy 1 device. Click on the Disconnect button and the disk will be unmounted. Now to eject the emulated DVD-ROM, again on the left panel, click to select the IDE CDROM 1 device. Click on Disconnect again to unmount the ISO image. Now both the floppy disk and DVD image have been unmounted from the VM. We can return to the console view (View -> Console) and we will see that both the floppy disk and DVD drive no longer show any media as mounted within them. Done! Defining vm04-ms On an-c05n02Now with the installation media unmounted, and as we did before, we will use virsh dumpxml to write out the XML definition file for the new VM and then virsh define it on an-c05n01. On an-c05n02; virsh list --all Id Name State
----------------------------------
2 vm03-db running
4 vm04-ms running
- vm01-dev shut off
- vm02-web shut off On an-c05n01; virsh list --all Id Name State
----------------------------------
2 vm01-dev running
4 vm02-web running
- vm03-db shut off As before, our new VM is only defined on the node we installed it on. We'll fix this now. On an-c05n02; virsh dumpxml vm04-ms > /shared/definitions/vm04-ms.xml
cat /shared/definitions/vm04-ms.xml <domain type='kvm' id='4'>
<name>vm04-ms</name>
<uuid>4c537551-96f4-3b5e-209a-0e41cab41d44</uuid>
<memory>2097152</memory>
<currentMemory>2097152</currentMemory>
<vcpu>2</vcpu>
<os>
<type arch='x86_64' machine='rhel6.2.0'>hvm</type>
<boot dev='hd'/>
</os>
<features>
<acpi/>
<apic/>
<pae/>
</features>
<clock offset='localtime'>
<timer name='rtc' tickpolicy='catchup'/>
</clock>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>restart</on_crash>
<devices>
<emulator>/usr/libexec/qemu-kvm</emulator>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/an02-vg0/vm0004-1'/>
<target dev='vda' bus='virtio'/>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk>
<disk type='file' device='floppy'>
<driver name='qemu' type='raw' cache='none'/>
<target dev='fda' bus='fdc'/>
<alias name='fdc0-0-0'/>
<address type='drive' controller='0' bus='0' unit='0'/>
</disk>
<disk type='file' device='cdrom'>
<driver name='qemu' type='raw'/>
<target dev='hdc' bus='ide'/>
<readonly/>
<alias name='ide0-1-0'/>
<address type='drive' controller='0' bus='1' unit='0'/>
</disk>
<controller type='fdc' index='0'>
<alias name='fdc0'/>
</controller>
<controller type='ide' index='0'>
<alias name='ide0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
</controller>
<interface type='bridge'>
<mac address='52:54:00:5e:b1:47'/>
<source bridge='vbr2'/>
<target dev='vnet1'/>
<alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>
<serial type='pty'>
<source path='/dev/pts/3'/>
<target port='0'/>
<alias name='serial0'/>
</serial>
<console type='pty' tty='/dev/pts/3'>
<source path='/dev/pts/3'/>
<target type='serial' port='0'/>
<alias name='serial0'/>
</console>
<input type='tablet' bus='usb'>
<alias name='input0'/>
</input>
<input type='mouse' bus='ps2'/>
<graphics type='vnc' port='5901' autoport='yes'/>
<video>
<model type='vga' vram='9216' heads='1'/>
<alias name='video0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</video>
<memballoon model='virtio'>
<alias name='balloon0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</memballoon>
</devices>
</domain> As before, defining the VM on both nodes is optional, but a habit I like to do. On an-c05n01; virsh define /shared/definitions/vm04-ms.xml Domain vm04-ms defined from /shared/definitions/vm04-ms.xml We can confirm that it now exists by re-running virsh list --all. virsh list --all Id Name State
----------------------------------
2 vm01-dev running
4 vm02-web running
- vm03-db shut off
- vm04-ms shut off With that, all our VMs exist and we're ready to make them highly available! Making Our VMs Highly Available Cluster ServicesWe're ready to start the final step; Making our VMs highly available cluster services! This involves two main steps:
Creating the Ordered Fail-Over DomainsWe have planned for two VMs, vm01-dev and vm02-web to normally run on an-c05n01 while vm03-db and vm04-ms to run on an-c05n02. Of course, should one of the nodes fail, the lost VMs will be restarted on the surviving node. For this, we will use an ordered fail-over domain. The idea here is that each new fail-over domain will have one node with a higher priority than the other. That is, one will have an-c05n01 with the highest priority and the other will have an-c05n02 as the highest. This way, VMs that we want to normally run on a given node will be added to the matching fail-over domain.
Here are the two new domains we will create in /etc/cluster/cluster.conf; <failoverdomains>
...
<failoverdomain name="primary_an01" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-c05n01.alteeve.ca" priority="1"/>
<failoverdomainnode name="an-c05n02.alteeve.ca" priority="2"/>
</failoverdomain>
<failoverdomain name="primary_an02" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-c05n01.alteeve.ca" priority="2"/>
<failoverdomainnode name="an-c05n02.alteeve.ca" priority="1"/>
</failoverdomain>
</failoverdomains> The two major pieces of the puzzle here are the <failoverdomain ...>'s ordered="1" attribute and the <failoverdomainnode ...>'s priority="x" attributes. The former tells the cluster that there is a preference for which node should be used when both are available. The later, which is the difference between the two new domains, tells the cluster which specific node is preferred. The first of the new fail-over domains is primary_an01. Any service placed in this domain will prefer to run on an-c05n01, as its priority of 1 is higher than an-c05n02's priority of 2. The second of the new domains is primary_an02 which reverses the preference, making an-c05n02 preferred over an-c05n01. Let's look at the complete cluster.conf with the new domain, and the version updated to 11 of course. <?xml version="1.0"?>
<cluster config_version="11" name="an-cluster-A">
<cman expected_votes="1" two_node="1"/>
<clusternodes>
<clusternode name="an-c05n01.alteeve.ca" nodeid="1">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an01"/>
</method>
<method name="pdu">
<device action="reboot" name="pdu2" port="1"/>
</method>
</fence>
</clusternode>
<clusternode name="an-c05n02.alteeve.ca" nodeid="2">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an02"/>
</method>
<method name="pdu">
<device action="reboot" name="pdu2" port="2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_ipmilan" ipaddr="an-c05n01.ipmi" login="root" name="ipmi_an01" passwd="secret"/>
<fencedevice agent="fence_ipmilan" ipaddr="an-c05n02.ipmi" login="root" name="ipmi_an02" passwd="secret"/>
<fencedevice agent="fence_apc_snmp" ipaddr="pdu2.alteeve.ca" name="pdu2"/>
</fencedevices>
<fence_daemon post_join_delay="30"/>
<totem rrp_mode="none" secauth="off"/>
<rm>
<resources>
<script file="/etc/init.d/drbd" name="drbd"/>
<script file="/etc/init.d/clvmd" name="clvmd"/>
<script file="/etc/init.d/gfs2" name="gfs2"/>
<script file="/etc/init.d/libvirtd" name="libvirtd"/>
</resources>
<failoverdomains>
<failoverdomain name="only_an01" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-c05n01.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="only_an02" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-c05n02.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="primary_an01" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-c05n01.alteeve.ca" priority="1"/>
<failoverdomainnode name="an-c05n02.alteeve.ca" priority="2"/>
</failoverdomain>
<failoverdomain name="primary_an02" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-c05n01.alteeve.ca" priority="2"/>
<failoverdomainnode name="an-c05n02.alteeve.ca" priority="1"/>
</failoverdomain>
</failoverdomains>
<service autostart="1" domain="only_an01" exclusive="0" name="storage_an01" recovery="restart">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd"/>
</script>
</script>
</script>
</service>
<service autostart="1" domain="only_an02" exclusive="0" name="storage_an02" recovery="restart">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd"/>
</script>
</script>
</script>
</service>
</rm>
</cluster> Let's validate it now, but we won't bother to push it out just yet. ccs_config_validate Configuration validates Good, now to create the new VM services! Making Our VMs Clustered ServicesThe final piece of the puzzle, and the whole purpose of this exercise is in sight! There is a special service in rgmanager for virtual machines which uses the vm: prefix. We will need to create four of these services; One for each of the virtual machines.
Creating The vm: ServicesWe'll create four new services, one for each VM. These are simple single-element entries. Lets increment the version to 12 and take a look at the new entries. <rm>
...
<vm name="vm01-dev" domain="primary_an01" path="/shared/definitions/" autostart="0"
exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm02-web" domain="primary_an01" path="/shared/definitions/" autostart="0"
exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm03-db" domain="primary_an02" path="/shared/definitions/" autostart="0"
exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm04-ms" domain="primary_an02" path="/shared/definitions/" autostart="0"
exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
</rm> Let's look at each of the attributes now;
So let's take a look at the final, complete cluster.conf; <?xml version="1.0"?>
<cluster config_version="12" name="an-cluster-A">
<cman expected_votes="1" two_node="1"/>
<clusternodes>
<clusternode name="an-c05n01.alteeve.ca" nodeid="1">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an01"/>
</method>
<method name="pdu">
<device action="reboot" name="pdu2" port="1"/>
</method>
</fence>
</clusternode>
<clusternode name="an-c05n02.alteeve.ca" nodeid="2">
<fence>
<method name="ipmi">
<device action="reboot" name="ipmi_an02"/>
</method>
<method name="pdu">
<device action="reboot" name="pdu2" port="2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_ipmilan" ipaddr="an-c05n01.ipmi" login="root" name="ipmi_an01" passwd="secret"/>
<fencedevice agent="fence_ipmilan" ipaddr="an-c05n02.ipmi" login="root" name="ipmi_an02" passwd="secret"/>
<fencedevice agent="fence_apc_snmp" ipaddr="pdu2.alteeve.ca" name="pdu2"/>
</fencedevices>
<fence_daemon post_join_delay="30"/>
<totem rrp_mode="none" secauth="off"/>
<rm>
<resources>
<script file="/etc/init.d/drbd" name="drbd"/>
<script file="/etc/init.d/clvmd" name="clvmd"/>
<script file="/etc/init.d/gfs2" name="gfs2"/>
<script file="/etc/init.d/libvirtd" name="libvirtd"/>
</resources>
<failoverdomains>
<failoverdomain name="only_an01" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-c05n01.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="only_an02" nofailback="1" ordered="0" restricted="1">
<failoverdomainnode name="an-c05n02.alteeve.ca"/>
</failoverdomain>
<failoverdomain name="primary_an01" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-c05n01.alteeve.ca" priority="1"/>
<failoverdomainnode name="an-c05n02.alteeve.ca" priority="2"/>
</failoverdomain>
<failoverdomain name="primary_an02" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="an-c05n01.alteeve.ca" priority="2"/>
<failoverdomainnode name="an-c05n02.alteeve.ca" priority="1"/>
</failoverdomain>
</failoverdomains>
<service autostart="1" domain="only_an01" exclusive="0" name="storage_an01" recovery="restart">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd"/>
</script>
</script>
</script>
</service>
<service autostart="1" domain="only_an02" exclusive="0" name="storage_an02" recovery="restart">
<script ref="drbd">
<script ref="clvmd">
<script ref="gfs2">
<script ref="libvirtd"/>
</script>
</script>
</script>
</service>
<vm name="vm01-dev" domain="primary_an01" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm02-web" domain="primary_an01" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm03-db" domain="primary_an02" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
<vm name="vm04-ms" domain="primary_an02" path="/shared/definitions/" autostart="0" exclusive="0" recovery="restart" max_restarts="2" restart_expire_time="600"/>
</rm>
</cluster> Let's validate one more time. ccs_config_validate Configuration validates She's a beaut', eh? Making The VM Services ActiveBefore we push the last cluster.conf out, lets take a look at the current state of affairs. On an-c05n01; clustat Cluster Status for an-cluster-A @ Tue Dec 27 14:06:38 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started virsh list --all Id Name State
----------------------------------
2 vm01-dev running
4 vm02-web running
- vm03-db shut off
- vm04-ms shut off On an-c05n02; clustat Cluster Status for an-cluster-A @ Tue Dec 27 14:07:32 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started virsh list --all Id Name State
----------------------------------
2 vm03-db running
4 vm04-ms running
- vm01-dev shut off
- vm02-web shut off So we can see that the cluster doesn't know about the VMs yet, as we've not yet pushed out the changes. We can also see that vm01-dev and vm02-web are currently running on an-c05n01 and vm03-db and vm04-ms are running on an-c05n02. So let's push out the new configuration and see what happens! cman_tool version -r
cman_tool version 6.2.0 config 12 Let's take a look at what showed up in syslog; Dec 27 14:18:20 an-c05n01 modcluster: Updating cluster.conf
Dec 27 14:18:20 an-c05n01 corosync[2362]: [QUORUM] Members[2]: 1 2
Dec 27 14:18:20 an-c05n01 rgmanager[2579]: Reconfiguring
Dec 27 14:18:22 an-c05n01 rgmanager[2579]: Initializing vm:vm01-dev
Dec 27 14:18:22 an-c05n01 rgmanager[2579]: vm:vm01-dev was added to the config, but I am not initializing it.
Dec 27 14:18:22 an-c05n01 rgmanager[2579]: Initializing vm:vm02-web
Dec 27 14:18:22 an-c05n01 rgmanager[2579]: vm:vm02-web was added to the config, but I am not initializing it.
Dec 27 14:18:22 an-c05n01 rgmanager[2579]: Initializing vm:vm03-db
Dec 27 14:18:22 an-c05n01 rgmanager[2579]: vm:vm03-db was added to the config, but I am not initializing it.
Dec 27 14:18:23 an-c05n01 rgmanager[2579]: Initializing vm:vm04-ms
Dec 27 14:18:23 an-c05n01 rgmanager[2579]: vm:vm04-ms was added to the config, but I am not initializing it. Indeed, if we check again with clustat, we'll see the new VM services, but all four will show as disabled, despite the VMs themselves being up and running. clustat Cluster Status for an-cluster-A @ Tue Dec 27 14:20:10 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev (none) disabled
vm:vm02-web (none) disabled
vm:vm03-db (none) disabled
vm:vm04-ms (none) disabled This highlights how the state of the VMs is not intrinsically tied to the cluster's status. The VMs were started outside of the cluster, so the cluster thinks they are off-line. We know they're running though, so we can tell the cluster to enable them now. Note that the VMs will not be rebooted or in any way effected, provided you tell the cluster to enable the VM on the node it's currently running on. Let's start by enabling vm01-dev, which we know is running on an-c05n01. Be aware that the vm: prefix is required when using clusvcadm! clusvcadm -e vm:vm01-dev -m an-c05n01.alteeve.ca vm:vm01-dev is now running on an-c05n01.alteeve.ca Now we can see that the VM is under the cluster's control! clustat Cluster Status for an-cluster-A @ Tue Dec 27 14:25:08 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web (none) disabled
vm:vm03-db (none) disabled
vm:vm04-ms (none) disabled Perfect! Now to add the other three VMs. Note that all of these commands can be run from whichever node you wish, because we're specifying the target node by using the "member" switch. clusvcadm -e vm:vm02-web -m an-c05n01.alteeve.ca vm:vm02-web is now running on an-c05n01.alteeve.ca clusvcadm -e vm:vm03-db -m an-c05n02.alteeve.ca vm:vm03-db is now running on an-c05n02.alteeve.ca clusvcadm -e vm:vm04-ms -m an-c05n02.alteeve.ca vm:vm04-ms is now running on an-c05n02.alteeve.ca Let's do a final check of the cluster's status; clustat Cluster Status for an-cluster-A @ Tue Dec 27 14:28:19 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started The Last Step - Automatic Cluster StartThe last step is to enable automatic starting of the cman and rgmanager services when the host node boots. This is quite simple; On both nodes, run; chkconfig cman on && chkconfig rgmanager on
chkconfig --list | grep -e cman -e rgmanager cman 0:off 1:off 2:on 3:on 4:on 5:on 6:off
rgmanager 0:off 1:off 2:on 3:on 4:on 5:on 6:off The next time you restart the nodes, you will be able to run clustat and you should find your cluster up and running! We're Done! Or, Are We?That's it, ladies and gentlemen. Our cluster is completed! In theory now, any failure in the cluster will result in no lost data and, at worst, no more than a minute or two of downtime. "In theory" just isn't good enough in clustering though. Time to take "theory" and make it a tested, known fact. Testing; Taking Theory And Putting It Into PracticeYou may have thought that we were done. Indeed, the cluster has been built, but we don't know if things actually work. Enter testing. In practice, when preparing production clusters for deployment, you should plan to spend at least twice as long in testing as you did in building the cluster. You need to imagine all failure scenarios, trigger those failures and see what happens. A Note On The Importance Of FencingIt may be tempting to think that you were careful and don't really need to test you cluster thoroughly. You are wrong Baring you being absolutely obsessive with testing every step of the way, you will almost certain make mistakes. Now I make no claims to genius, but I do like to think I am pretty comfortable building 2-node clusters. Despite that, while writing this testing portion of the tutorial, I found the following problems with my cluster;
You simply can't make assumptions. Test your cluster in every failure mode you can imagine. Until you do, you won't know what you might have missed! Controlled VM Migration And Node WithdrawalThis testing will ensure that live migration works in both directions, and that each node can be cleanly removed from and then rejoin the cluster. The test will consist of the following steps;
With all of these tests completed, we will be able to ensure that order and controlled migration of VM services work as expected. Live Migration - vm01-dev And vm0002-dev To an-c05n02First up, we will use the special clusvcadm switch -M, which tells the cluster to use "live migration". This is, the VM will move to the target member without shutting down. Users of the VM should notice, and worst, a brief network interruption when the cut-over occurs, without any adverse effect on their services or dropped connections. Let's take a quick look at the state of affairs; On an-c05n02, run; clustat Cluster Status for an-cluster-A @ Sat Dec 31 13:49:41 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Lets start by live migrating vm01-dev. Before we do though, let's ssh into it and start a ping against a target on the internet. We'll leave this running throughout the live migration. On vm01-dev; Now back on an-c05n01, let's migrate vm01-dev over to an-c05n02. This will take a little while as the VM's RAM gets copied across the BCN. clusvcadm -M vm:vm01-dev -m an-c05n02.alteeve.ca Trying to migrate vm:vm01-dev to an-c05n02.alteeve.ca...Success Once complete, check the new status of clustat; clustat Cluster Status for an-cluster-A @ Sat Dec 31 14:11:43 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n02.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started If we look again at vm01-dev's ping, we'll see that a few packets were dropped but our ssh session remained intact. Any other active TCP session should have survived this just fine as well. Wonderful! Now let's live migrate vm02-web to an-c05n02. clusvcadm -M vm:vm02-web -m an-c05n02.alteeve.ca Trying to migrate vm:vm02-web to an-c05n02.alteeve.ca...Success Again, check the new status of clustat; clustat Cluster Status for an-cluster-A @ Sat Dec 31 14:17:35 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n02.alteeve.ca started
vm:vm02-web an-c05n02.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started We can see now that all four VMs are running on an-c05n02! This is possible because of our careful planning of the VM resources earlier. This will mean more load on the host node's CPU, so things might not be as fast as we would like, but all services are on-line! Withdraw an-c05n01 From The ClusterSo imagine now that we need to do some work on an-c05n01, like replace a bad network card or add some RAM. We've moved the VMs off, so now the only remaining service is service:storage_an01. We don't want to manually disable this service, because if we did, the service would not automatically start when the node rejoined the cluster. So we're going to just stop rgmanager and let it disable the storage_an01 service. Check the state of the cluster; clustat Cluster Status for an-cluster-A @ Sun Jan 1 16:11:56 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n02.alteeve.ca started
vm:vm02-web an-c05n02.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Just as we expect, so now we will stop rgmanager, then stop cman. On an-c05n01; /etc/init.d/rgmanager stop Stopping Cluster Service Manager: [ OK ] /etc/init.d/cman stop Stopping cluster:
Leaving fence domain... [ OK ]
Stopping gfs_controld... [ OK ]
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Waiting for corosync to shutdown: [ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ] Checking on an-c05n02, we can see that all four VMs are running fine and that an-c05n01 is gone. clustat Cluster Status for an-cluster-A @ Sun Jan 1 16:13:23 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Offline
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 (an-c05n01.alteeve.ca) stopped
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n02.alteeve.ca started
vm:vm02-web an-c05n02.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Test passed! You can now power off and restart an-c05n01. Rejoining an-c05n01 To The ClusterIf you haven't already, reboot an-c05n01. As we set earlier, cman and rgmanager will start automatically. The easiest thing to do for this test is to watch clustat on an-c05n02. If all goes well, you should see an-c05n01 rejoin the cluster automatically. Connected to cluster; Storage coming on-line; Back in business! You should be able to log back into an-c05n01 and see that everything is back on-line. DRBD should be UpToDate, or be in the process of synchronizing.
Migrating vm01-dev And vm02-web Back To an-c05n01If we were putting the cluster back into its normal state, all that would be left to do is to migrate an-c05n01's VMs back. So let's do that. As always, start with a check of the current cluster status. clustat Cluster Status for an-cluster-A @ Sun Jan 1 16:31:06 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n02.alteeve.ca started
vm:vm02-web an-c05n02.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Now confirm that the underlying storage is ready. Remember that DRBD resource r1 backs the VMs used by the an01-vg0 volume groups. cat /proc/drbd version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag@Build64R6, 2011-11-20 10:57:03
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:12552 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:2428 dw:2428 dr:9776 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
2: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:510 dw:510 dr:9744 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 All systems ready; Let's migrate vm01-dev and vm02-web now. clusvcadm -M vm:vm01-dev -m an-c05n01.alteeve.ca Trying to migrate vm:vm01-dev to an-c05n01.alteeve.ca...Success clusvcadm -M vm:vm02-web -m an-c05n01.alteeve.ca Trying to migrate vm:vm02-web to an-c05n01.alteeve.ca...Success Check the new status; clustat Cluster Status for an-cluster-A @ Sun Jan 1 16:32:11 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started With that, the cluster is back in business! Live Migration - vm03-db And vm04-ms To an-c05n01Let's start the process of taking an-c05n02 out of the cluster. The first step is to move vm03-db and vm04-ms over to an-c05n01. clustat Cluster Status for an-cluster-A @ Sun Jan 1 16:42:10 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Ready to migrate. clusvcadm -M vm:vm03-db -m an-c05n01.alteeve.ca Trying to migrate vm:vm03-db to an-c05n01.alteeve.ca...Success clusvcadm -M vm:vm04-ms -m an-c05n01.alteeve.ca Trying to migrate vm:vm04-ms to an-c05n01.alteeve.ca...Success Confirm; clustat Cluster Status for an-cluster-A @ Sun Jan 1 16:42:42 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n01.alteeve.ca started
vm:vm04-ms an-c05n01.alteeve.ca started Done! Withdraw an-c05n02 From The ClusterDouble-check that all the VMs are off of an-c05n02 prior to withdrawal. clustat Cluster Status for an-cluster-A @ Sun Jan 1 16:45:30 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n01.alteeve.ca started
vm:vm04-ms an-c05n01.alteeve.ca started As before, we will not disable the storage_an02 service. If we did, the service would not automatically restart when the node rejoined the cluster. So now that an-c05n01 is hosting all of the VMs and is running independently. Now we can stop rgmanager and cman. On an-c05n02; /etc/init.d/rgmanager stop Stopping Cluster Service Manager: [ OK ] /etc/init.d/cman stop Stopping cluster:
Leaving fence domain... [ OK ]
Stopping gfs_controld... [ OK ]
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Waiting for corosync to shutdown: [ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ] Confirm; clustat Cluster Status for an-cluster-A @ Sun Jan 1 16:49:14 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Offline
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 (an-c05n02.alteeve.ca) stopped
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n01.alteeve.ca started
vm:vm04-ms an-c05n01.alteeve.ca started Done! We can now shut down and reboot an-c05n02 entirely. Rejoining an-c05n02 To The ClusterExactly as we did with an-c05n01, we will reboot an-c05n02. The cman and rgmanager services should start automatically, so once again, we will just watch clustat on an-c05n01. If all goes well, you should see an-c05n02 rejoin the cluster automatically. Connected to cluster; Storage coming on-line; Back in business! You should be able to log back into an-c05n02 and see that everything is back on-line. DRBD should be UpToDate, or be in the process of synchronizing.
Migrating vm03-db And vm04-ms Back To an-c05n02The last step to restore the cluster to its ideal state is to migrate vm03-db and vm04-ms back to an-c05n02. As always, start with a check of the current cluster status. clustat Cluster Status for an-cluster-A @ Sun Jan 1 16:57:19 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n01.alteeve.ca started
vm:vm04-ms an-c05n01.alteeve.ca started Now confirm that the underlying storage is ready. Remember that DRBD resource r2 backs the VMs used by the an02-vg0 volume groups. cat /proc/drbd version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag@Build64R6, 2011-11-20 10:57:03
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:8788 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:376 dw:376 dr:5876 al:0 bm:7 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
2: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:671 dw:671 dr:5844 al:0 bm:16 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 All systems ready; Let's migrate vm03-db and vm04-ms now. clusvcadm -M vm:vm03-db -m an-c05n02.alteeve.ca Trying to migrate vm:vm03-db to an-c05n02.alteeve.ca...Success clusvcadm -M vm:vm04-ms -m an-c05n02.alteeve.ca Trying to migrate vm:vm04-ms to an-c05n02.alteeve.ca...Success Check the new status; clustat Cluster Status for an-cluster-A @ Sun Jan 1 16:59:22 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started All controlled migration, withdrawal and re-joining tests completed! Uncontrolled VM Migration and Node FailureThis test will be more violent than the previous tests. Here we will test failing the VMs and ensuring that the cluster will recover the VMs by restarting them on the hosts. We will repeatedly fail the VMs three times within ten minutes to ensure that the relocate policy kicks in, as we expect it to. Once we complete the VM failure testing, we will fail and recover both nodes, one at a time of course, and rejoin them to the cluster. This will confirm that the VMs recover on the surviving node. The tests will be;
Failure Testing vm01-devConfirm that vm01-dev is running on an-c05n01. clustat Cluster Status for an-cluster-A @ Sun Jan 1 18:29:10 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started It is, perfect. Now before I kill a VM, I like to start a ping against it. It acts both as an indication of when the node is back up and acts as a crude method of timing how long it took the VM to fully recover.
ping 10.254.0.1 PING 10.254.0.1 (10.254.0.1) 56(84) bytes of data.
64 bytes from 10.254.0.1: icmp_seq=1 ttl=64 time=0.737 ms
64 bytes from 10.254.0.1: icmp_seq=2 ttl=64 time=0.530 ms
64 bytes from 10.254.0.1: icmp_seq=3 ttl=64 time=0.589 ms Now, on an-c05n01, forcefully shut down vm01-dev; virsh destroy vm01-dev Domain vm01-dev destroyed Within a few seconds (10, maximum), the cluster will detect that the VM has failed and will restart it. We can see in an-c05n01's syslog that the failure was detected and automatically recovered. Jan 1 18:38:25 an-c05n01 kernel: vbr2: port 2(vnet0) entering disabled state
Jan 1 18:38:25 an-c05n01 kernel: device vnet0 left promiscuous mode
Jan 1 18:38:25 an-c05n01 kernel: vbr2: port 2(vnet0) entering disabled state
Jan 1 18:38:27 an-c05n01 ntpd[2190]: Deleting interface #19 vnet0, fe80::fc54:ff:fe9b:3cf7#123, interface stats: received=0, sent=0, dropped=0, active_time=3058 secs
Jan 1 18:38:35 an-c05n01 rgmanager[2430]: status on vm "vm01-dev" returned 7 (unspecified)
Jan 1 18:38:35 an-c05n01 rgmanager[2430]: Stopping service vm:vm01-dev
Jan 1 18:38:36 an-c05n01 rgmanager[2430]: Service vm:vm01-dev is recovering
Jan 1 18:38:36 an-c05n01 rgmanager[2430]: Recovering failed service vm:vm01-dev
Jan 1 18:38:37 an-c05n01 kernel: device vnet0 entered promiscuous mode
Jan 1 18:38:37 an-c05n01 kernel: vbr2: port 2(vnet0) entering learning state
Jan 1 18:38:37 an-c05n01 rgmanager[2430]: Service vm:vm01-dev started
Jan 1 18:38:39 an-c05n01 ntpd[2190]: Listening on interface #20 vnet0, fe80::fc54:ff:fe9b:3cf7#123 Enabled
Jan 1 18:38:49 an-c05n01 kernel: kvm: 12390: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xabcd
Jan 1 18:38:52 an-c05n01 kernel: vbr2: port 2(vnet0) entering forwarding state The first four entries are related to the VM's network being torn down after it was killed. The fifth through eighth lines show the detection and recovery of the node! Going back to the ping, we can see that the VM was down for roughly 36 seconds (time between network loss and recovery, add a bit more time for all services to start. PING 10.254.0.1 (10.254.0.1) 56(84) bytes of data.
64 bytes from 10.254.0.1: icmp_seq=1 ttl=64 time=0.737 ms
64 bytes from 10.254.0.1: icmp_seq=2 ttl=64 time=0.530 ms
64 bytes from 10.254.0.1: icmp_seq=3 ttl=64 time=0.589 ms
64 bytes from 10.254.0.1: icmp_seq=4 ttl=64 time=0.589 ms
64 bytes from 10.254.0.1: icmp_seq=5 ttl=64 time=0.477 ms
64 bytes from 10.254.0.1: icmp_seq=6 ttl=64 time=0.482 ms
64 bytes from 10.254.0.1: icmp_seq=7 ttl=64 time=0.489 ms
64 bytes from 10.254.0.1: icmp_seq=8 ttl=64 time=0.495 ms
64 bytes from 10.254.0.1: icmp_seq=9 ttl=64 time=0.503 ms
64 bytes from 10.254.0.1: icmp_seq=10 ttl=64 time=0.513 ms
64 bytes from 10.254.0.1: icmp_seq=11 ttl=64 time=0.516 ms
64 bytes from 10.254.0.1: icmp_seq=12 ttl=64 time=0.524 ms
64 bytes from 10.254.0.1: icmp_seq=13 ttl=64 time=0.405 ms
64 bytes from 10.254.0.1: icmp_seq=14 ttl=64 time=0.536 ms
64 bytes from 10.254.0.1: icmp_seq=15 ttl=64 time=0.441 ms
64 bytes from 10.254.0.1: icmp_seq=16 ttl=64 time=0.552 ms
# Node died here, 36 pings lost at ~1 ping/sec.
64 bytes from 10.254.0.1: icmp_seq=52 ttl=64 time=0.816 ms
64 bytes from 10.254.0.1: icmp_seq=53 ttl=64 time=0.440 ms
64 bytes from 10.254.0.1: icmp_seq=54 ttl=64 time=0.354 ms
64 bytes from 10.254.0.1: icmp_seq=55 ttl=64 time=0.342 ms
64 bytes from 10.254.0.1: icmp_seq=56 ttl=64 time=0.446 ms
64 bytes from 10.254.0.1: icmp_seq=57 ttl=64 time=0.418 ms
64 bytes from 10.254.0.1: icmp_seq=58 ttl=64 time=0.441 ms
^C
--- 10.254.0.1 ping statistics ---
58 packets transmitted, 23 received, 60% packet loss, time 57949ms
rtt min/avg/max/mdev = 0.342/0.505/0.816/0.109 ms Not bad at all! Now let's kill it two more times and confirm that the third recovery happens on an-c05n02. We'll use the ping as an indicator of when the VM is back on-line before killing it the third time. Second failure; virsh destroy vm01-dev Domain vm01-dev destroyed Checking syslog again; Jan 1 18:45:07 an-c05n01 kernel: vbr2: port 2(vnet0) entering disabled state
Jan 1 18:45:07 an-c05n01 kernel: device vnet0 left promiscuous mode
Jan 1 18:45:07 an-c05n01 kernel: vbr2: port 2(vnet0) entering disabled state
Jan 1 18:45:09 an-c05n01 ntpd[2190]: Deleting interface #20 vnet0, fe80::fc54:ff:fe9b:3cf7#123, interface stats: received=0, sent=0, dropped=0, active_time=390 secs
Jan 1 18:45:46 an-c05n01 rgmanager[2430]: status on vm "vm01-dev" returned 7 (unspecified)
Jan 1 18:45:46 an-c05n01 rgmanager[2430]: Stopping service vm:vm01-dev
Jan 1 18:45:46 an-c05n01 rgmanager[2430]: Service vm:vm01-dev is recovering
Jan 1 18:45:47 an-c05n01 rgmanager[2430]: Recovering failed service vm:vm01-dev
Jan 1 18:45:47 an-c05n01 kernel: device vnet0 entered promiscuous mode
Jan 1 18:45:47 an-c05n01 kernel: vbr2: port 2(vnet0) entering learning state
Jan 1 18:45:47 an-c05n01 rgmanager[2430]: Service vm:vm01-dev started
Jan 1 18:45:50 an-c05n01 ntpd[2190]: Listening on interface #21 vnet0, fe80::fc54:ff:fe9b:3cf7#123 Enabled
Jan 1 18:45:59 an-c05n01 kernel: kvm: 17874: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xabcd
Jan 1 18:46:02 an-c05n01 kernel: vbr2: port 2(vnet0) entering forwarding state We can see that the vm01-dev VM is still on an-c05n01; clustat Cluster Status for an-cluster-A @ Sun Jan 1 18:47:01 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Now the third crash. This time it should come up on an-c05n02. virsh destroy vm01-dev Domain vm01-dev destroyed Checking an-c05n01's syslog again, we'll see something different. Jan 1 18:47:26 an-c05n01 kernel: vbr2: port 2(vnet0) entering disabled state
Jan 1 18:47:26 an-c05n01 kernel: device vnet0 left promiscuous mode
Jan 1 18:47:26 an-c05n01 kernel: vbr2: port 2(vnet0) entering disabled state
Jan 1 18:47:27 an-c05n01 ntpd[2190]: Deleting interface #21 vnet0, fe80::fc54:ff:fe9b:3cf7#123, interface stats: received=0, sent=0, dropped=0, active_time=97 secs
Jan 1 18:47:46 an-c05n01 rgmanager[2430]: status on vm "vm01-dev" returned 7 (unspecified)
Jan 1 18:47:46 an-c05n01 rgmanager[2430]: Stopping service vm:vm01-dev
Jan 1 18:47:46 an-c05n01 rgmanager[2430]: Service vm:vm01-dev is recovering
Jan 1 18:47:46 an-c05n01 rgmanager[2430]: Restart threshold for vm:vm01-dev exceeded; attempting to relocate
Jan 1 18:47:47 an-c05n01 rgmanager[2430]: Service vm:vm01-dev is now running on member 2 The difference is the "Restart threshold for vm:vm01-dev exceeded; attempting to relocate" line. Indeed, if we check clustat, we will in fact see it running on an-c05n02! clustat Cluster Status for an-cluster-A @ Sun Jan 1 18:49:38 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n02.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Success! This test is complete, so we'll finish my migrating the VM back to an-c05n01. clusvcadm -M vm:vm01-dev -m an-c05n01.alteeve.ca Trying to migrate vm:vm01-dev to an-c05n01.alteeve.ca...Success As always, confirm. clustat Cluster Status for an-cluster-A @ Sun Jan 1 18:51:05 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Excellent. Failure Testing vm02-webWe'll go through the same process here as we just did with vm01-dev, but we won't cover all the details here as much. After each crash of the VM, we'll check clustat and look at the syslog on an-c05n01. Not shown here is a background ping running to indicate when the VM is back up enough to crash again. Confirm that vm02-web is on an-c05n01. clustat Cluster Status for an-cluster-A @ Sun Jan 1 19:06:21 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Good, we're ready. On an-c05n01, kill the VM. virsh destroy vm02-web Domain vm02-web destroyed As we expect, an-c05n01 restarts the VM within a few seconds. Jan 1 19:07:16 an-c05n01 kernel: vbr2: port 3(vnet1) entering disabled state
Jan 1 19:07:16 an-c05n01 kernel: device vnet1 left promiscuous mode
Jan 1 19:07:16 an-c05n01 kernel: vbr2: port 3(vnet1) entering disabled state
Jan 1 19:07:18 an-c05n01 ntpd[2190]: Deleting interface #11 vnet1, fe80::fc54:ff:fe65:3960#123, interface stats: received=0, sent=0, dropped=0, active_time=9315 secs
Jan 1 19:07:27 an-c05n01 rgmanager[2430]: status on vm "vm02-web" returned 7 (unspecified)
Jan 1 19:07:27 an-c05n01 rgmanager[2430]: Stopping service vm:vm02-web
Jan 1 19:07:27 an-c05n01 rgmanager[2430]: Service vm:vm02-web is recovering
Jan 1 19:07:28 an-c05n01 rgmanager[2430]: Recovering failed service vm:vm02-web
Jan 1 19:07:28 an-c05n01 kernel: device vnet1 entered promiscuous mode
Jan 1 19:07:28 an-c05n01 kernel: vbr2: port 3(vnet1) entering learning state
Jan 1 19:07:29 an-c05n01 rgmanager[2430]: Service vm:vm02-web started
Jan 1 19:07:31 an-c05n01 ntpd[2190]: Listening on interface #23 vnet1, fe80::fc54:ff:fe65:3960#123 Enabled
Jan 1 19:07:38 an-c05n01 kernel: kvm: 1994: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xabcd
Jan 1 19:07:43 an-c05n01 kernel: vbr2: port 3(vnet1) entering forwarding state Checking clustat, I can see the VM is back on-line. clustat Cluster Status for an-cluster-A @ Sun Jan 1 19:09:03 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Let's kill it for the second time. virsh destroy vm02-web Domain vm02-web destroyed We can again see that an-c05n01 recovered it locally. Jan 1 19:12:08 an-c05n01 kernel: vbr2: port 3(vnet1) entering disabled state
Jan 1 19:12:08 an-c05n01 kernel: device vnet1 left promiscuous mode
Jan 1 19:12:08 an-c05n01 kernel: vbr2: port 3(vnet1) entering disabled state
Jan 1 19:12:10 an-c05n01 ntpd[2190]: Deleting interface #23 vnet1, fe80::fc54:ff:fe65:3960#123, interface stats: received=0, sent=0, dropped=0, active_time=279 secs
Jan 1 19:12:17 an-c05n01 rgmanager[2430]: status on vm "vm02-web" returned 7 (unspecified)
Jan 1 19:12:17 an-c05n01 rgmanager[2430]: Stopping service vm:vm02-web
Jan 1 19:12:18 an-c05n01 rgmanager[2430]: Service vm:vm02-web is recovering
Jan 1 19:12:18 an-c05n01 rgmanager[2430]: Recovering failed service vm:vm02-web
Jan 1 19:12:19 an-c05n01 kernel: device vnet1 entered promiscuous mode
Jan 1 19:12:19 an-c05n01 kernel: vbr2: port 3(vnet1) entering learning state
Jan 1 19:12:19 an-c05n01 rgmanager[2430]: Service vm:vm02-web started
Jan 1 19:12:22 an-c05n01 ntpd[2190]: Listening on interface #24 vnet1, fe80::fc54:ff:fe65:3960#123 Enabled
Jan 1 19:12:28 an-c05n01 kernel: kvm: 6113: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xabcd
Jan 1 19:12:34 an-c05n01 kernel: vbr2: port 3(vnet1) entering forwarding state Confirm with clustat; clustat Cluster Status for an-cluster-A @ Sun Jan 1 19:13:45 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started This time, it should recover on an-c05n02; virsh destroy vm02-web Domain vm02-web destroyed Looking in syslog, we can see the counter was tripped. Jan 1 19:14:26 an-c05n01 kernel: vbr2: port 3(vnet1) entering disabled state
Jan 1 19:14:26 an-c05n01 kernel: device vnet1 left promiscuous mode
Jan 1 19:14:26 an-c05n01 kernel: vbr2: port 3(vnet1) entering disabled state
Jan 1 19:14:27 an-c05n01 rgmanager[2430]: status on vm "vm02-web" returned 7 (unspecified)
Jan 1 19:14:27 an-c05n01 rgmanager[2430]: Stopping service vm:vm02-web
Jan 1 19:14:28 an-c05n01 rgmanager[2430]: Service vm:vm02-web is recovering
Jan 1 19:14:28 an-c05n01 rgmanager[2430]: Restart threshold for vm:vm02-web exceeded; attempting to relocate
Jan 1 19:14:28 an-c05n01 ntpd[2190]: Deleting interface #24 vnet1, fe80::fc54:ff:fe65:3960#123, interface stats: received=0, sent=0, dropped=0, active_time=126 secs
Jan 1 19:14:29 an-c05n01 rgmanager[2430]: Service vm:vm02-web is now running on member 2 Indeed, this is confirmed with clustat. clustat Cluster Status for an-cluster-A @ Sun Jan 1 19:15:57 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n02.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Excellent, this test has passed as well! Now migrate the VM back and we'll be ready to test the third VM. clusvcadm -M vm:vm02-web -m an-c05n01.alteeve.ca Trying to migrate vm:vm02-web to an-c05n01.alteeve.ca...Success clustat Cluster Status for an-cluster-A @ Sun Jan 1 19:17:41 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Done. Failure Testing vm03-dbThis should be getting familiar now. The main difference is that the VM is now running on an-c05n02, so that is where will will kill the VM from and that is where we will watch syslog. Confirm that vm03-db is on an-c05n02. clustat Cluster Status for an-cluster-A @ Sun Jan 1 19:25:55 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Good, we're ready. On an-c05n02, kill the VM. virsh destroy vm03-db Domain vm03-db destroyed As we expect, an-c05n02 restarts the VM within a few seconds. Jan 1 19:26:21 an-c05n02 kernel: vbr2: port 2(vnet0) entering disabled state
Jan 1 19:26:21 an-c05n02 kernel: device vnet0 left promiscuous mode
Jan 1 19:26:21 an-c05n02 kernel: vbr2: port 2(vnet0) entering disabled state
Jan 1 19:26:22 an-c05n02 ntpd[2200]: Deleting interface #10 vnet0, fe80::fc54:ff:fe44:83ec#123, interface stats: received=0, sent=0, dropped=0, active_time=8863 secs
Jan 1 19:26:35 an-c05n02 rgmanager[2439]: status on vm "vm03-db" returned 7 (unspecified)
Jan 1 19:26:36 an-c05n02 rgmanager[2439]: Stopping service vm:vm03-db
Jan 1 19:26:36 an-c05n02 rgmanager[2439]: Service vm:vm03-db is recovering
Jan 1 19:26:36 an-c05n02 rgmanager[2439]: Recovering failed service vm:vm03-db
Jan 1 19:26:37 an-c05n02 kernel: device vnet0 entered promiscuous mode
Jan 1 19:26:37 an-c05n02 kernel: vbr2: port 2(vnet0) entering learning state
Jan 1 19:26:37 an-c05n02 rgmanager[2439]: Service vm:vm03-db started
Jan 1 19:26:40 an-c05n02 ntpd[2200]: Listening on interface #15 vnet0, fe80::fc54:ff:fe44:83ec#123 Enabled Checking clustat, I can see the VM is back on-line. clustat Cluster Status for an-cluster-A @ Sun Jan 1 19:27:06 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Let's kill it for the second time. virsh destroy vm03-db Domain vm03-db destroyed We can again see that an-c05n02 recovered it locally. Jan 1 19:27:40 an-c05n02 kernel: vbr2: port 2(vnet0) entering disabled state
Jan 1 19:27:40 an-c05n02 kernel: device vnet0 left promiscuous mode
Jan 1 19:27:40 an-c05n02 kernel: vbr2: port 2(vnet0) entering disabled state
Jan 1 19:27:41 an-c05n02 ntpd[2200]: Deleting interface #15 vnet0, fe80::fc54:ff:fe44:83ec#123, interface stats: received=0, sent=0, dropped=0, active_time=61 secs
Jan 1 19:27:45 an-c05n02 rgmanager[2439]: status on vm "vm03-db" returned 7 (unspecified)
Jan 1 19:27:46 an-c05n02 rgmanager[2439]: Stopping service vm:vm03-db
Jan 1 19:27:46 an-c05n02 rgmanager[2439]: Service vm:vm03-db is recovering
Jan 1 19:27:46 an-c05n02 rgmanager[2439]: Recovering failed service vm:vm03-db
Jan 1 19:27:47 an-c05n02 kernel: device vnet0 entered promiscuous mode
Jan 1 19:27:47 an-c05n02 kernel: vbr2: port 2(vnet0) entering learning state
Jan 1 19:27:47 an-c05n02 rgmanager[2439]: Service vm:vm03-db started
Jan 1 19:27:50 an-c05n02 ntpd[2200]: Listening on interface #16 vnet0, fe80::fc54:ff:fe44:83ec#123 Enabled Confirm with clustat; clustat Cluster Status for an-cluster-A @ Sun Jan 1 19:28:21 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started This time, it should recover on an-c05n01; virsh destroy vm03-db Domain vm03-db destroyed Looking in syslog, we can see the counter was tripped. Jan 1 19:28:36 an-c05n02 kernel: vbr2: port 2(vnet0) entering disabled state
Jan 1 19:28:36 an-c05n02 kernel: device vnet0 left promiscuous mode
Jan 1 19:28:36 an-c05n02 kernel: vbr2: port 2(vnet0) entering disabled state
Jan 1 19:28:37 an-c05n02 ntpd[2200]: Deleting interface #16 vnet0, fe80::fc54:ff:fe44:83ec#123, interface stats: received=0, sent=0, dropped=0, active_time=47 secs
Jan 1 19:28:55 an-c05n02 rgmanager[2439]: status on vm "vm03-db" returned 7 (unspecified)
Jan 1 19:28:56 an-c05n02 rgmanager[2439]: Stopping service vm:vm03-db
Jan 1 19:28:56 an-c05n02 rgmanager[2439]: Service vm:vm03-db is recovering
Jan 1 19:28:56 an-c05n02 rgmanager[2439]: Restart threshold for vm:vm03-db exceeded; attempting to relocate
Jan 1 19:28:57 an-c05n02 rgmanager[2439]: Service vm:vm03-db is now running on member 1 Again, this is confirmed with clustat. clustat Cluster Status for an-cluster-A @ Sun Jan 1 19:29:42 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n01.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started This test has passed as well! As before, migrate the VM back and we'll be ready to test the last VM. clusvcadm -M vm:vm03-db -m an-c05n02.alteeve.ca Trying to migrate vm:vm03-db to an-c05n02.alteeve.ca...Success clustat Cluster Status for an-cluster-A @ Sun Jan 1 19:30:32 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Done. Failure Testing vm04-ms
This is the last VM to test. This testing is repetitive and boring, but it is also critical. Good on you for sticking it out. Right then, let's get to it. Confirm that vm04-ms is on an-c05n02. clustat Cluster Status for an-cluster-A @ Sun Jan 1 19:43:41 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Good, we're ready. On an-c05n02, kill the VM. virsh destroy vm04-ms Domain vm04-ms destroyed As we expect, an-c05n02 restarts the VM within a few seconds. Jan 1 19:43:52 an-c05n02 kernel: vbr2: port 3(vnet1) entering disabled state
Jan 1 19:43:52 an-c05n02 kernel: device vnet1 left promiscuous mode
Jan 1 19:43:52 an-c05n02 kernel: vbr2: port 3(vnet1) entering disabled state
Jan 1 19:43:53 an-c05n02 ntpd[2200]: Deleting interface #11 vnet1, fe80::fc54:ff:fe5e:b147#123, interface stats: received=0, sent=0, dropped=0, active_time=9895 secs
Jan 1 19:44:06 an-c05n02 rgmanager[2439]: status on vm "vm04-ms" returned 7 (unspecified)
Jan 1 19:44:07 an-c05n02 rgmanager[2439]: Stopping service vm:vm04-ms
Jan 1 19:44:07 an-c05n02 rgmanager[2439]: Service vm:vm04-ms is recovering
Jan 1 19:44:07 an-c05n02 rgmanager[2439]: Recovering failed service vm:vm04-ms
Jan 1 19:44:08 an-c05n02 kernel: device vnet1 entered promiscuous mode
Jan 1 19:44:08 an-c05n02 kernel: vbr2: port 3(vnet1) entering learning state
Jan 1 19:44:08 an-c05n02 rgmanager[2439]: Service vm:vm04-ms started
Jan 1 19:44:11 an-c05n02 ntpd[2200]: Listening on interface #18 vnet1, fe80::fc54:ff:fe5e:b147#123 Enabled
Jan 1 19:44:23 an-c05n02 kernel: vbr2: port 3(vnet1) entering forwarding state Checking clustat, I can see the VM is back on-line. clustat Cluster Status for an-cluster-A @ Sun Jan 1 19:44:38 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Let's kill it for the second time. virsh destroy vm04-ms Domain vm04-ms destroyed We can again see that an-c05n02 recovered it locally. Jan 1 19:44:54 an-c05n02 kernel: vbr2: port 3(vnet1) entering disabled state
Jan 1 19:44:54 an-c05n02 kernel: device vnet1 left promiscuous mode
Jan 1 19:44:54 an-c05n02 kernel: vbr2: port 3(vnet1) entering disabled state
Jan 1 19:44:55 an-c05n02 ntpd[2200]: Deleting interface #18 vnet1, fe80::fc54:ff:fe5e:b147#123, interface stats: received=0, sent=0, dropped=0, active_time=44 secs
Jan 1 19:45:16 an-c05n02 rgmanager[2439]: status on vm "vm04-ms" returned 7 (unspecified)
Jan 1 19:45:17 an-c05n02 rgmanager[2439]: Stopping service vm:vm04-ms
Jan 1 19:45:17 an-c05n02 rgmanager[2439]: Service vm:vm04-ms is recovering
Jan 1 19:45:17 an-c05n02 rgmanager[2439]: Recovering failed service vm:vm04-ms
Jan 1 19:45:18 an-c05n02 kernel: device vnet1 entered promiscuous mode
Jan 1 19:45:18 an-c05n02 kernel: vbr2: port 3(vnet1) entering learning state
Jan 1 19:45:18 an-c05n02 rgmanager[2439]: Service vm:vm04-ms started
Jan 1 19:45:21 an-c05n02 ntpd[2200]: Listening on interface #19 vnet1, fe80::fc54:ff:fe5e:b147#123 Enabled
Jan 1 19:45:33 an-c05n02 kernel: vbr2: port 3(vnet1) entering forwarding state Confirm with clustat; clustat Cluster Status for an-cluster-A @ Sun Jan 1 19:46:17 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started This time, it should recover on an-c05n01; virsh destroy vm04-ms Domain vm04-ms destroyed Looking in syslog, we can see the counter was tripped. Jan 1 19:45:33 an-c05n02 kernel: vbr2: port 3(vnet1) entering forwarding state
Jan 1 19:46:30 an-c05n02 kernel: vbr2: port 3(vnet1) entering disabled state
Jan 1 19:46:30 an-c05n02 kernel: device vnet1 left promiscuous mode
Jan 1 19:46:30 an-c05n02 kernel: vbr2: port 3(vnet1) entering disabled state
Jan 1 19:46:32 an-c05n02 ntpd[2200]: Deleting interface #19 vnet1, fe80::fc54:ff:fe5e:b147#123, interface stats: received=0, sent=0, dropped=0, active_time=71 secs
Jan 1 19:46:36 an-c05n02 rgmanager[2439]: status on vm "vm04-ms" returned 7 (unspecified)
Jan 1 19:46:37 an-c05n02 rgmanager[2439]: Stopping service vm:vm04-ms
Jan 1 19:46:37 an-c05n02 rgmanager[2439]: Service vm:vm04-ms is recovering
Jan 1 19:46:37 an-c05n02 rgmanager[2439]: Restart threshold for vm:vm04-ms exceeded; attempting to relocate
Jan 1 19:46:38 an-c05n02 rgmanager[2439]: Service vm:vm04-ms is now running on member 1 Indeed, this is confirmed with clustat. clustat Cluster Status for an-cluster-A @ Sun Jan 1 19:48:23 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n01.alteeve.ca started Wonderful! All four VMs fail and recover as we expected them to. Move the VM back and we're ready to crash the nodes! clusvcadm -M vm:vm04-ms -m an-c05n02.alteeve.ca Trying to migrate vm:vm04-ms to an-c05n02.alteeve.ca...Success clustat Cluster Status for an-cluster-A @ Sun Jan 1 19:49:32 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Done and done! Failing and Recovery of an-c05n01The final stage of testing is also the most brutal. We're going to hang an-c05n01 in such a way that it stops responding to messages from an-c05n02. Within a few seconds, an-c05n01 should be fenced, then shortly after the two lost VMs should boot up on an-c05n02. The is a particularly important test for a somewhat non-obvious reason.
We could just shut off an-c05n01, but we tested this earlier when we setup fencing. What we have not yet tested is how the cluster recovers from a hung node. To hang the host, we're going to trigger a special event in the kernel, using magic SysRq triggers. We'll do this by sending the letter c to the /proc/sysrq-trigger file. This will "Reboot kexec and output a crashdump". The node should be fenced before a memory dump can complete, so don't expect to see anything in /var/crashed unless your system is extremely fast.
So, on an-c05n01, issue the following command to crash the node. echo c > /proc/sysrq-trigger This command will not return. Watching syslog on an-c05n02, we'll see output like this; Jan 1 21:26:00 an-c05n02 kernel: block drbd1: PingAck did not arrive in time.
Jan 1 21:26:00 an-c05n02 kernel: block drbd1: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Jan 1 21:26:00 an-c05n02 kernel: block drbd1: asender terminated
Jan 1 21:26:00 an-c05n02 kernel: block drbd1: Terminating asender thread
Jan 1 21:26:00 an-c05n02 kernel: block drbd1: Connection closed
Jan 1 21:26:00 an-c05n02 kernel: block drbd1: conn( NetworkFailure -> Unconnected )
Jan 1 21:26:00 an-c05n02 kernel: block drbd1: helper command: /sbin/drbdadm fence-peer minor-1
Jan 1 21:26:00 an-c05n02 kernel: block drbd1: receiver terminated
Jan 1 21:26:00 an-c05n02 kernel: block drbd1: Restarting receiver thread
Jan 1 21:26:00 an-c05n02 kernel: block drbd1: receiver (re)started
Jan 1 21:26:00 an-c05n02 kernel: block drbd1: conn( Unconnected -> WFConnection )
Jan 1 21:26:00 an-c05n02 /sbin/obliterate-peer.sh: Local node ID: 2 / Remote node: an-c05n01.alteeve.ca
Jan 1 21:26:01 an-c05n02 kernel: block drbd2: PingAck did not arrive in time.
Jan 1 21:26:01 an-c05n02 kernel: block drbd2: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Jan 1 21:26:01 an-c05n02 kernel: block drbd2: asender terminated
Jan 1 21:26:01 an-c05n02 kernel: block drbd2: Terminating asender thread
Jan 1 21:26:01 an-c05n02 kernel: block drbd2: Connection closed
Jan 1 21:26:01 an-c05n02 kernel: block drbd2: conn( NetworkFailure -> Unconnected )
Jan 1 21:26:01 an-c05n02 kernel: block drbd2: helper command: /sbin/drbdadm fence-peer minor-2
Jan 1 21:26:01 an-c05n02 kernel: block drbd2: receiver terminated
Jan 1 21:26:01 an-c05n02 kernel: block drbd2: Restarting receiver thread
Jan 1 21:26:01 an-c05n02 kernel: block drbd2: receiver (re)started
Jan 1 21:26:01 an-c05n02 kernel: block drbd2: conn( Unconnected -> WFConnection )
Jan 1 21:26:01 an-c05n02 /sbin/obliterate-peer.sh: Local node ID: 2 / Remote node: an-c05n01.alteeve.ca
Jan 1 21:26:01 an-c05n02 /sbin/obliterate-peer.sh: kill node failed: Invalid argument
Jan 1 21:26:03 an-c05n02 kernel: block drbd0: PingAck did not arrive in time.
Jan 1 21:26:03 an-c05n02 kernel: block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Jan 1 21:26:03 an-c05n02 kernel: block drbd0: asender terminated
Jan 1 21:26:03 an-c05n02 kernel: block drbd0: Terminating asender thread
Jan 1 21:26:03 an-c05n02 kernel: block drbd0: Connection closed
Jan 1 21:26:03 an-c05n02 kernel: block drbd0: conn( NetworkFailure -> Unconnected )
Jan 1 21:26:03 an-c05n02 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
Jan 1 21:26:03 an-c05n02 kernel: block drbd0: receiver terminated
Jan 1 21:26:03 an-c05n02 kernel: block drbd0: Restarting receiver thread
Jan 1 21:26:03 an-c05n02 kernel: block drbd0: receiver (re)started
Jan 1 21:26:03 an-c05n02 kernel: block drbd0: conn( Unconnected -> WFConnection )
Jan 1 21:26:03 an-c05n02 /sbin/obliterate-peer.sh: Local node ID: 2 / Remote node: an-c05n01.alteeve.ca
Jan 1 21:26:03 an-c05n02 /sbin/obliterate-peer.sh: kill node failed: Invalid argument
Jan 1 21:26:09 an-c05n02 corosync[1963]: [TOTEM ] A processor failed, forming new configuration.
Jan 1 21:26:11 an-c05n02 corosync[1963]: [QUORUM] Members[1]: 2
Jan 1 21:26:11 an-c05n02 corosync[1963]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 1 21:26:11 an-c05n02 kernel: dlm: closing connection to node 1
Jan 1 21:26:11 an-c05n02 corosync[1963]: [CPG ] chosen downlist: sender r(0) ip(10.20.50.2) ; members(old:2 left:1)
Jan 1 21:26:11 an-c05n02 corosync[1963]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 1 21:26:11 an-c05n02 fenced[2022]: fencing node an-c05n01.alteeve.ca
Jan 1 21:26:11 an-c05n02 kernel: GFS2: fsid=an-cluster-A:shared.0: jid=1: Trying to acquire journal lock...
Jan 1 21:26:14 an-c05n02 fence_node[15572]: fence an-c05n01.alteeve.ca success
Jan 1 21:26:14 an-c05n02 kernel: block drbd1: helper command: /sbin/drbdadm fence-peer minor-1 exit code 7 (0x700)
Jan 1 21:26:14 an-c05n02 kernel: block drbd1: fence-peer helper returned 7 (peer was stonithed)
Jan 1 21:26:14 an-c05n02 kernel: block drbd1: pdsk( DUnknown -> Outdated )
Jan 1 21:26:14 an-c05n02 kernel: block drbd1: new current UUID 6355AAB258658E8F:4642D156D54731A1:5F8A6B05E2FCCE19:165E9B466805EC81
Jan 1 21:26:14 an-c05n02 kernel: block drbd1: susp( 1 -> 0 )
Jan 1 21:26:15 an-c05n02 fenced[2022]: fence an-c05n01.alteeve.ca success
Jan 1 21:26:15 an-c05n02 fence_node[15672]: fence an-c05n01.alteeve.ca success
Jan 1 21:26:15 an-c05n02 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 7 (0x700)
Jan 1 21:26:15 an-c05n02 kernel: block drbd0: fence-peer helper returned 7 (peer was stonithed)
Jan 1 21:26:15 an-c05n02 kernel: block drbd0: pdsk( DUnknown -> Outdated )
Jan 1 21:26:15 an-c05n02 kernel: block drbd0: new current UUID C1F5EF16EE80E6C1:1B503B46E6650575:234E9A10EE04FDE7:7DBC4288E230DC9B
Jan 1 21:26:15 an-c05n02 kernel: block drbd0: susp( 1 -> 0 )
Jan 1 21:26:15 an-c05n02 fence_node[15627]: fence an-c05n01.alteeve.ca success
Jan 1 21:26:15 an-c05n02 kernel: block drbd2: helper command: /sbin/drbdadm fence-peer minor-2 exit code 7 (0x700)
Jan 1 21:26:15 an-c05n02 kernel: block drbd2: fence-peer helper returned 7 (peer was stonithed)
Jan 1 21:26:15 an-c05n02 kernel: block drbd2: pdsk( DUnknown -> Outdated )
Jan 1 21:26:15 an-c05n02 kernel: block drbd2: new current UUID 1F79DE480F1E33C1:A674C3CB12017193:76118DDAE165C5FB:871F8081B7D527A9
Jan 1 21:26:15 an-c05n02 kernel: block drbd2: susp( 1 -> 0 )
Jan 1 21:26:16 an-c05n02 kernel: GFS2: fsid=an-cluster-A:shared.0: jid=1: Looking at journal...
Jan 1 21:26:16 an-c05n02 kernel: GFS2: fsid=an-cluster-A:shared.0: jid=1: Done
Jan 1 21:26:16 an-c05n02 rgmanager[2514]: Marking service:storage_an01 as stopped: Restricted domain unavailable
Jan 1 21:26:16 an-c05n02 rgmanager[2514]: Taking over service vm:vm01-dev from down member an-c05n01.alteeve.ca
Jan 1 21:26:16 an-c05n02 rgmanager[2514]: Taking over service vm:vm02-web from down member an-c05n01.alteeve.ca
Jan 1 21:26:17 an-c05n02 kernel: device vnet2 entered promiscuous mode
Jan 1 21:26:17 an-c05n02 kernel: vbr2: port 4(vnet2) entering learning state
Jan 1 21:26:17 an-c05n02 rgmanager[2514]: Service vm:vm01-dev started
Jan 1 21:26:17 an-c05n02 kernel: device vnet3 entered promiscuous mode
Jan 1 21:26:17 an-c05n02 kernel: vbr2: port 5(vnet3) entering learning state
Jan 1 21:26:18 an-c05n02 rgmanager[2514]: Service vm:vm02-web started
Jan 1 21:26:20 an-c05n02 ntpd[2275]: Listening on interface #12 vnet2, fe80::fc54:ff:fe9b:3cf7#123 Enabled
Jan 1 21:26:20 an-c05n02 ntpd[2275]: Listening on interface #13 vnet3, fe80::fc54:ff:fe65:3960#123 Enabled
Jan 1 21:26:27 an-c05n02 kernel: kvm: 16177: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xabcd
Jan 1 21:26:29 an-c05n02 kernel: kvm: 16118: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xabcd
Jan 1 21:26:32 an-c05n02 kernel: vbr2: port 4(vnet2) entering forwarding state
Jan 1 21:26:32 an-c05n02 kernel: vbr2: port 5(vnet3) entering forwarding state Checking with clustat, we can confirm that all four VMs are now running on an-c05n02. clustat Cluster Status for an-cluster-A @ Sun Jan 1 21:28:00 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n02.alteeve.ca started
vm:vm02-web an-c05n02.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Perfect! This is exactly why we built the cluster! If we wait a few minutes, we'll see that the hung node has recovered. clustat Cluster Status for an-cluster-A @ Sun Jan 1 22:30:04 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n02.alteeve.ca started
vm:vm02-web an-c05n02.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Before we can push the VMs back though, we must make sure that the underlying DRBD resource has finished synchronizing.
cat /proc/drbd version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag@Build64R6, 2011-11-20 10:57:03
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:1182704 nr:1053880 dw:1052676 dr:1245848 al:0 bm:266 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:2087568 nr:362698 dw:366444 dr:2263316 al:9 bm:411 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
2: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:2098343 nr:1114307 dw:1065375 dr:2340421 al:10 bm:551 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 We're ready, so lets migrate back vm01-dev and vm02-web. clusvcadm -M vm:vm01-dev -m an-c05n01.alteeve.ca Trying to migrate vm:vm01-dev to an-c05n01.alteeve.ca...Success clusvcadm -M vm:vm02-web -m an-c05n01.alteeve.ca Trying to migrate vm:vm02-web to an-c05n01.alteeve.ca...Success Confirm; clustat Cluster Status for an-cluster-A @ Sun Jan 1 22:37:10 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started There we have it. Successful crash and recovery of an-c05n01. Discussing the syslog MessagesLet's step back and look at the syslog output; There are a few things to discuss. The first thing we see is that almost immediately after hanging an-c05n01, the first messages are from DRBD, not the cluster. This in turn trigger's DRBD's fence-handler script, obliterate-peer.sh. This is because DRBD is extremely sensitive to interruptions, even more so than the cluster itself. You will notice that DRBD reacted a full 9 seconds faster than the cluster. The first thing the cluster does, upon realizing it has lost communication with its peer, is call a fence against the lost node. As mentioned, this involves calling obliterate-peer.sh, which is itself a very simple wrapper for cman_tool and fence_node shell calls. Jan 1 21:26:00 an-c05n02 kernel: block drbd1: helper command: /sbin/drbdadm fence-peer minor-1
Jan 1 21:26:00 an-c05n02 kernel: block drbd1: receiver terminated
Jan 1 21:26:00 an-c05n02 kernel: block drbd1: Restarting receiver thread
Jan 1 21:26:00 an-c05n02 kernel: block drbd1: receiver (re)started
Jan 1 21:26:00 an-c05n02 kernel: block drbd1: conn( Unconnected -> WFConnection )
Jan 1 21:26:00 an-c05n02 /sbin/obliterate-peer.sh: Local node ID: 2 / Remote node: an-c05n01.alteeve.ca Here we see DRBD calling the handler (first message), shortly after we see a log entry from obliterate-peer.sh (last entry). What you don't see is that right after that last message, obliterate-peer.sh goes into a 10-iteration loop where it calls fence_node against its peer. Jan 1 21:26:01 an-c05n02 /sbin/obliterate-peer.sh: Local node ID: 2 / Remote node: an-c05n01.alteeve.ca
Jan 1 21:26:01 an-c05n02 /sbin/obliterate-peer.sh: kill node failed: Invalid argument The fence_node call runs in the background, so the obliterate-peer.sh script goes into a short sleep before trying again (and again...). These subsequent calls will generate the kill node failed: Invalid argument because the first call is already in the process of fencing the node, and are thus safe to ignore. The important past was that this error message didn't follow the first entry. Jan 1 21:26:15 an-c05n02 fenced[2022]: fence an-c05n01.alteeve.ca success This is what matters. Here we see that the fence succeeded and the hung node was indeed fenced. Failing and Recovery of an-c05n02With everything back in place, we'll hang an-c05n02 and ensure that its VMs will recover on an-c05n01. As always, check the current state. clustat Cluster Status for an-cluster-A @ Sun Jan 1 22:53:43 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Now hang an-c05n02. echo c > /proc/sysrq-trigger As before, that command will not return. If we check an-c05n01's syslog though, we should see that the node is fenced and the lost VMs are recovered. Jan 1 22:56:14 an-c05n01 kernel: block drbd1: PingAck did not arrive in time.
Jan 1 22:56:14 an-c05n01 kernel: block drbd1: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Jan 1 22:56:15 an-c05n01 kernel: block drbd1: asender terminated
Jan 1 22:56:15 an-c05n01 kernel: block drbd1: Terminating asender thread
Jan 1 22:56:15 an-c05n01 kernel: block drbd1: Connection closed
Jan 1 22:56:15 an-c05n01 kernel: block drbd1: conn( NetworkFailure -> Unconnected )
Jan 1 22:56:15 an-c05n01 kernel: block drbd1: helper command: /sbin/drbdadm fence-peer minor-1
Jan 1 22:56:15 an-c05n01 kernel: block drbd1: receiver terminated
Jan 1 22:56:15 an-c05n01 kernel: block drbd1: Restarting receiver thread
Jan 1 22:56:15 an-c05n01 kernel: block drbd1: receiver (re)started
Jan 1 22:56:15 an-c05n01 kernel: block drbd1: conn( Unconnected -> WFConnection )
Jan 1 22:56:15 an-c05n01 /sbin/obliterate-peer.sh: Local node ID: 1 / Remote node: an-c05n02.alteeve.ca
Jan 1 22:56:19 an-c05n01 kernel: block drbd0: PingAck did not arrive in time.
Jan 1 22:56:19 an-c05n01 kernel: block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Jan 1 22:56:19 an-c05n01 kernel: block drbd0: asender terminated
Jan 1 22:56:19 an-c05n01 kernel: block drbd0: Terminating asender thread
Jan 1 22:56:19 an-c05n01 kernel: block drbd0: Connection closed
Jan 1 22:56:19 an-c05n01 kernel: block drbd0: conn( NetworkFailure -> Unconnected )
Jan 1 22:56:19 an-c05n01 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
Jan 1 22:56:19 an-c05n01 kernel: block drbd0: receiver terminated
Jan 1 22:56:19 an-c05n01 kernel: block drbd0: Restarting receiver thread
Jan 1 22:56:19 an-c05n01 kernel: block drbd0: receiver (re)started
Jan 1 22:56:19 an-c05n01 kernel: block drbd0: conn( Unconnected -> WFConnection )
Jan 1 22:56:19 an-c05n01 /sbin/obliterate-peer.sh: Local node ID: 1 / Remote node: an-c05n02.alteeve.ca
Jan 1 22:56:19 an-c05n01 /sbin/obliterate-peer.sh: kill node failed: Invalid argument
Jan 1 22:56:21 an-c05n01 kernel: block drbd2: PingAck did not arrive in time.
Jan 1 22:56:21 an-c05n01 kernel: block drbd2: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Jan 1 22:56:21 an-c05n01 kernel: block drbd2: asender terminated
Jan 1 22:56:21 an-c05n01 kernel: block drbd2: Terminating asender thread
Jan 1 22:56:21 an-c05n01 kernel: block drbd2: Connection closed
Jan 1 22:56:21 an-c05n01 kernel: block drbd2: conn( NetworkFailure -> Unconnected )
Jan 1 22:56:21 an-c05n01 kernel: block drbd2: receiver terminated
Jan 1 22:56:21 an-c05n01 kernel: block drbd2: Restarting receiver thread
Jan 1 22:56:21 an-c05n01 kernel: block drbd2: receiver (re)started
Jan 1 22:56:21 an-c05n01 kernel: block drbd2: conn( Unconnected -> WFConnection )
Jan 1 22:56:21 an-c05n01 kernel: block drbd2: helper command: /sbin/drbdadm fence-peer minor-2
Jan 1 22:56:21 an-c05n01 /sbin/obliterate-peer.sh: Local node ID: 1 / Remote node: an-c05n02.alteeve.ca
Jan 1 22:56:21 an-c05n01 /sbin/obliterate-peer.sh: kill node failed: Invalid argument
Jan 1 22:56:22 an-c05n01 corosync[1958]: [TOTEM ] A processor failed, forming new configuration.
Jan 1 22:56:24 an-c05n01 corosync[1958]: [QUORUM] Members[1]: 1
Jan 1 22:56:24 an-c05n01 corosync[1958]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 1 22:56:24 an-c05n01 kernel: dlm: closing connection to node 2
Jan 1 22:56:24 an-c05n01 corosync[1958]: [CPG ] chosen downlist: sender r(0) ip(10.20.50.1) ; members(old:2 left:1)
Jan 1 22:56:24 an-c05n01 corosync[1958]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 1 22:56:24 an-c05n01 fenced[2014]: fencing node an-c05n02.alteeve.ca
Jan 1 22:56:24 an-c05n01 kernel: GFS2: fsid=an-cluster-A:shared.1: jid=0: Trying to acquire journal lock...
Jan 1 22:56:28 an-c05n01 fenced[2014]: fence an-c05n02.alteeve.ca success
Jan 1 22:56:29 an-c05n01 fence_node[638]: fence an-c05n02.alteeve.ca success
Jan 1 22:56:29 an-c05n01 kernel: block drbd2: helper command: /sbin/drbdadm fence-peer minor-2 exit code 7 (0x700)
Jan 1 22:56:29 an-c05n01 kernel: block drbd2: fence-peer helper returned 7 (peer was stonithed)
Jan 1 22:56:29 an-c05n01 kernel: block drbd2: pdsk( DUnknown -> Outdated )
Jan 1 22:56:29 an-c05n01 kernel: block drbd2: new current UUID 207F7C9279067EC1:3EEB0F756A6A289F:FD92DAC355F53A93:FD91DAC355F53A93
Jan 1 22:56:29 an-c05n01 kernel: block drbd2: susp( 1 -> 0 )
Jan 1 22:56:29 an-c05n01 fence_node[518]: fence an-c05n02.alteeve.ca success
Jan 1 22:56:29 an-c05n01 kernel: block drbd1: helper command: /sbin/drbdadm fence-peer minor-1 exit code 7 (0x700)
Jan 1 22:56:29 an-c05n01 kernel: block drbd1: fence-peer helper returned 7 (peer was stonithed)
Jan 1 22:56:29 an-c05n01 kernel: block drbd1: pdsk( DUnknown -> Outdated )
Jan 1 22:56:29 an-c05n01 kernel: block drbd1: new current UUID C65C044AE682D8C5:67D512BD61B70265:C1947DF86E910F8B:C1937DF86E910F8B
Jan 1 22:56:29 an-c05n01 kernel: block drbd1: susp( 1 -> 0 )
Jan 1 22:56:29 an-c05n01 rgmanager[2507]: Marking service:storage_an02 as stopped: Restricted domain unavailable
Jan 1 22:56:29 an-c05n01 fence_node[583]: fence an-c05n02.alteeve.ca success
Jan 1 22:56:29 an-c05n01 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 7 (0x700)
Jan 1 22:56:29 an-c05n01 kernel: block drbd0: fence-peer helper returned 7 (peer was stonithed)
Jan 1 22:56:29 an-c05n01 kernel: block drbd0: pdsk( DUnknown -> Outdated )
Jan 1 22:56:29 an-c05n01 kernel: block drbd0: new current UUID 295A00166167B5C3:A3F3889ECF7247F5:30313B4AFFF6F82B:30303B4AFFF6F82B
Jan 1 22:56:29 an-c05n01 kernel: block drbd0: susp( 1 -> 0 )
Jan 1 22:56:29 an-c05n01 kernel: GFS2: fsid=an-cluster-A:shared.1: jid=0: Looking at journal...
Jan 1 22:56:30 an-c05n01 kernel: GFS2: fsid=an-cluster-A:shared.1: jid=0: Done
Jan 1 22:56:30 an-c05n01 rgmanager[2507]: Taking over service vm:vm03-db from down member an-c05n02.alteeve.ca
Jan 1 22:56:30 an-c05n01 rgmanager[2507]: Taking over service vm:vm04-ms from down member an-c05n02.alteeve.ca
Jan 1 22:56:30 an-c05n01 kernel: device vnet2 entered promiscuous mode
Jan 1 22:56:30 an-c05n01 kernel: vbr2: port 4(vnet2) entering learning state
Jan 1 22:56:30 an-c05n01 rgmanager[2507]: Service vm:vm03-db started
Jan 1 22:56:31 an-c05n01 kernel: device vnet3 entered promiscuous mode
Jan 1 22:56:31 an-c05n01 kernel: vbr2: port 5(vnet3) entering learning state
Jan 1 22:56:31 an-c05n01 rgmanager[2507]: Service vm:vm04-ms started
Jan 1 22:56:34 an-c05n01 ntpd[2267]: Listening on interface #12 vnet3, fe80::fc54:ff:fe5e:b147#123 Enabled
Jan 1 22:56:34 an-c05n01 ntpd[2267]: Listening on interface #13 vnet2, fe80::fc54:ff:fe44:83ec#123 Enabled
Jan 1 22:56:40 an-c05n01 kernel: kvm: 1074: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xabcd
Jan 1 22:56:45 an-c05n01 kernel: vbr2: port 4(vnet2) entering forwarding state
Jan 1 22:56:46 an-c05n01 kernel: vbr2: port 5(vnet3) entering forwarding state Checking clustat; clustat Cluster Status for an-cluster-A @ Sun Jan 1 22:57:36 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Offline
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 (an-c05n02.alteeve.ca) stopped
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n01.alteeve.ca started
vm:vm04-ms an-c05n01.alteeve.ca started All four VMs are back up and running on an-c05n01! Within a few moments, we should see see that an-c05n02 has rejoined the cluster. clustat Cluster Status for an-cluster-A @ Sun Jan 1 23:00:43 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n01.alteeve.ca started
vm:vm04-ms an-c05n01.alteeve.ca started Now we'll wait for the backing DRBD resources to be in sync. cat /proc/drbd GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag@Build64R6, 2011-11-20 10:57:03
0: cs:SyncTarget ro:Primary/Primary ds:Inconsistent/UpToDate C r-----
ns:0 nr:272884 dw:271744 dr:5700 al:0 bm:25 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:780928
[====>...............] sync'ed: 26.4% (780928/1052672)K
finish: 0:10:02 speed: 1,284 (1,280) want: 250 K/sec
1: cs:SyncTarget ro:Primary/Primary ds:Inconsistent/UpToDate C r-----
ns:0 nr:272196 dw:271048 dr:3688 al:0 bm:45 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:122292
[=============>......] sync'ed: 70.2% (122292/393216)K
finish: 0:01:31 speed: 1,328 (1,276) want: 250 K/sec
2: cs:SyncTarget ro:Primary/Primary ds:Inconsistent/UpToDate C r-----
ns:0 nr:273426 dw:272258 dr:3636 al:0 bm:47 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:781500
[====>...............] sync'ed: 26.4% (781500/1052760)K
finish: 0:09:49 speed: 1,308 (1,284) want: 250 K/sec (time passes) cat /proc/drbd version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag@Build64R6, 2011-11-20 10:57:03
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:1053812 dw:1052672 dr:6964 al:0 bm:74 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:394560 dw:393412 dr:4988 al:0 bm:70 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
2: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:1055190 dw:1054022 dr:4936 al:0 bm:167 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 Now we're ready to migrate vm03-db and vm04-ms back to an-c05n02. clusvcadm -M vm:vm03-db -m an-c05n02.alteeve.ca Trying to migrate vm:vm03-db to an-c05n02.alteeve.ca...Success clusvcadm -M vm:vm04-ms -m an-c05n02.alteeve.ca Trying to migrate vm:vm04-ms to an-c05n02.alteeve.ca...Success A final check; clustat Cluster Status for an-cluster-A @ Sun Jan 1 23:08:06 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started Good! Complete Cold Shut Down And Cold Starting The ClusterThe final testing is now complete. There is one final task to cover though; "Cold Shut Down" and "Cold Start" of the cluster. This involves shutting down all VMs, stopping rgmanager and cman on both nodes, then powering off both nodes. The cold-start process involves simply powering both nodes on within the set post_join_delay, then manually enabling the four VMs. Stopping All VMsCheck the status as always; clustat Cluster Status for an-cluster-A @ Sun Jan 1 23:13:24 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started All four VMs are up, so we'll stop all of them.
clusvcadm -d vm:vm01-dev Local machine disabling vm:vm01-dev...Success clusvcadm -d vm:vm02-web Local machine disabling vm:vm02-web...Success clusvcadm -d vm:vm03-db Local machine disabling vm:vm03-db...Success clusvcadm -d vm:vm04-ms Local machine disabling vm:vm04-ms...Success Confirm; clustat Cluster Status for an-cluster-A @ Sun Jan 1 23:17:29 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev (an-c05n01.alteeve.ca) disabled
vm:vm02-web (an-c05n01.alteeve.ca) disabled
vm:vm03-db (an-c05n02.alteeve.ca) disabled
vm:vm04-ms (an-c05n02.alteeve.ca) disabled Good, we can now stop rgmanager on both nodes. Shutting Down The Cluster Entirely
On an-c05n01; /etc/init.d/rgmanager stop Stopping Cluster Service Manager: [ OK ] On an-c05n02; /etc/init.d/rgmanager stop Stopping Cluster Service Manager: [ OK ] Now stop cman on both nodes. On an-c05n01; /etc/init.d/cman stop Stopping cluster:
Leaving fence domain... [ OK ]
Stopping gfs_controld... [ OK ]
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Waiting for corosync to shutdown: [ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ] On an-c05n02; /etc/init.d/cman stop Stopping cluster:
Leaving fence domain... [ OK ]
Stopping gfs_controld... [ OK ]
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Waiting for corosync to shutdown: [ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ] We're down, we can safely power off the nodes now. poweroff Broadcast message from root@an-c05n01.alteeve.ca
(/dev/pts/0) at 23:22 ...
The system is going down for power off NOW! Cold-Stop achieved! Cold-Starting The Cluster
Power on both nodes. You can just hit the power button, or if you have a workstation on the BCN with fence-agents installed, you can call fence_ipmilan (or the agent you use in your cluster). fence_ipmilan -a an-c05n01.ipmi -l root -p secret -o on Powering on machine @ IPMI:an-c05n01.ipmi...Done fence_ipmilan -a an-c05n02.ipmi -l root -p secret -o on Powering on machine @ IPMI:an-c05n02.ipmi...Done Once they're up, log into them again and check their status. You will see that the VMs are off-line. clustat Cluster Status for an-cluster-A @ Sun Jan 1 23:40:16 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, Local, rgmanager
an-c05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev (none) disabled
vm:vm02-web (none) disabled
vm:vm03-db (none) disabled
vm:vm04-ms (none) disabled Check that DRBD is ready; cat /proc/drbd version: 8.3.12 (api:88/proto:86-96)
GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag@Build64R6, 2011-11-20 10:57:03
0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:4 nr:0 dw:0 dr:8712 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:4632 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
2: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:0 dw:0 dr:4648 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 Golden, let's start the VMs. clusvcadm -e vm:vm01-dev -m an-c05n01.alteeve.ca vm:vm01-dev is now running on an-c05n01.alteeve.ca clusvcadm -e vm:vm02-web -m an-c05n01.alteeve.ca vm:vm02-web is now running on an-c05n01.alteeve.ca clusvcadm -e vm:vm03-db -m an-c05n02.alteeve.ca vm:vm03-db is now running on an-c05n02.alteeve.ca clusvcadm -e vm:vm04-ms -m an-c05n02.alteeve.ca vm:vm04-ms is now running on an-c05n02.alteeve.ca Check the new status; clustat Cluster Status for an-cluster-A @ Sun Jan 1 23:45:35 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-c05n01.alteeve.ca 1 Online, rgmanager
an-c05n02.alteeve.ca 2 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:storage_an01 an-c05n01.alteeve.ca started
service:storage_an02 an-c05n02.alteeve.ca started
vm:vm01-dev an-c05n01.alteeve.ca started
vm:vm02-web an-c05n01.alteeve.ca started
vm:vm03-db an-c05n02.alteeve.ca started
vm:vm04-ms an-c05n02.alteeve.ca started We're back up and running! Done and Done!That, ladies and gentlemen, is all she wrote! You should now be safely ready to take your cluster into production at this stage. Happy Hacking! TroubleshootingThe troubleshooting section seems to have pushed Media Wiki beyond it's single-article length limit. For this reason, it has been moved to it's own page. Disabling rsyslog Rate LimitingPlease see;
|
---|