High-Availability Clustering in the Open Source Ecosystem

From Alteeve Wiki
Jump to navigation Jump to search

 AN!Wiki :: High-Availability Clustering in the Open Source Ecosystem

Warning: This is not considered complete yet, and may contain mistakes and factual errors.

High-Availability Clustering in the Open Source Ecosystem

Two Histories, One Future

Prologue

Open-source high-availability clustering has a complex history that can cause confusion for new users. Many traces of this history are still found on the Internet, causing many new users to make false starts down deprecated paths in their early efforts to learn HA clustering.

The principle goal of this document is to provide clarity on the future of open source clustering. If it succeeds in this goal, future new users will be less likely to start down the wrong path.

In order to understand the future, we must start in the past. Documenting the past is always a tricky prospect and prone to error. Should you have insight that would help improve the completeness or accuracy of this document, please contact contact us at "docs@alteeve.ca". We will reply as quickly as possible and will improve this document with your help.

Two Types of Clusters; HA and HPC

There are two distinct types of "clusters"; High-Availability clustering and High-Performance Compute clusters. The former, HA clusters, are designed to ensure the availability of services in spite of otherwise-catastrophic failures. The later, HPC clusters, are designed to speed up the processing of problems beyond the abilities of any single server.

This document speaks specifically about the history and future of high-availability clusters.

A Brief History of Two Pasts and One Future

Two entirely independent attempts to create an open-source high-availability platform were started in the late 1990s. The earliest of these two stacks was SUSE's "Linux-HA" project. The other stack would become Red Hat's “Cluster Services", though it's genesis was spread across two projects; Mission Critical Linux and Sistina's "Global File System".

These two projects remained entirely separate until 2007 when, out of the Linux-HA project, Pacemaker was born as a cluster resource manager that could take membership from and communicate via Red Hat's OpenAIS or SUSE's Heartbeat.

In 2008, an informal meeting was held where both SUSE and Red Hat developers discussed reusing some code. SUSE's main interest was CRM/pacemaker and Red Hat's main interest was OpenAIS. By this time, both companies had large installation bases. Both understood that this merger effort would take many years to complete to insure that there was minimal impact of existing users.

In 2013, Red Hat announced full support for Pacemaker with the release of Red Hat Enterprise Linux 6.5. In RHEL 6, pacemaker draws it's quorum and membership from cman still. Elsewhere, distributions that support corosync version 2 can draw their quorum and membership from it instead.

All new users to high availability clustering should now use "Pacemaker" for resource management and "corosync" for cluster membership, quorum and communications. Any existing users of other, older cluster software should now be actively developing a migration path to this new stack.

Necessary Vocabulary

There is some jargon in high-availability clustering. Understanding these terms should help those who are new to HA clustering understand the topic we will shortly discuss. Readers who are new to HA clustering would benefit from jumping down to the glossary section below before proceeding.

SUSE's Story; Linux-HA, Heartbeat and Pacemaker

In 1998, SUSE hired the developers of the Linux-HA project who creating a new protocol called "Heartbeat". In 1999, a resource manager was created on this new protocol, creating a monolithic, two-node cluster stack. This provided membership, message passing, fencing and resource management in a single program.

Heartbeat "version 1" was limited to two nodes (the membership layer supported more nodes, but the resource management did not) and provided a basic resource management model, but it had the benefit of being very easy to use.

In 2003, the limited capabilities of Heartbeat v1's resource manager spurred a new project called the "cluster resource manager", "crm". This was created as a new program that was dedicated to resource management and designed to support up to 16 nodes. The crm sat on top of heartbeat and used it for membership, fencing and cluster communication.

A year and a half later on July 29, 2005, “Heartbeat version 2” was released, introducing the CRM. Version 2 introduced support for larger clusters, more complex resource management and support for openais as the membership and communication layer. Heartbeat introduced the CCM, cluster consensus manager, the upgraded membership algorithm that gathered the list of agreed members and handled quorum.

Enter Pacemaker

Until Heartbeat version 2.1.3, the CRM was coupled tightly to Heartbeat, matching it's versioning and release cycle. In 2007, it was decided to break the CRM out of the main Heartbeat package called “Pacemaker”. It was extended to support both heartbeat and the new “OpenAIS” program for cluster membership and communication. Quorum became a plug-in.

In 2008, Pacemaker version 0.6.0 was released. This remained the main version until the release of Pacemaker version 1.0 in 2010. Until then, working with Pacemaker required directly working with it's “Cluster Information Base”, CIB, an XML configuration and state file that was as powerful as it was user unfriendly. With the release of version 1.0, the CIB was abstracted away behind the “Cluster Resource Manager” called “crm”. Support for Red Hat's “cman” quorum provider was also added.

In 2010, pacemaker 1.1 was released.

Exit Heartbeat

The Heartbeat project reached version 3, but it's user base was declining. It has (and still has) many fans who love it's simplicity. The flexibility that comes with either Pacemaker or RHCS certainly raises the barrier to entry for many users.

Despite the loyal fans, heartbeats limitations restricted it's use-cases to an ever-smaller number. As Pacemaker and RHCS matured and their respective tools became easier to use, the barrier to entry reduced. Eventually, in <year?>, development of Heartbeat was stopped all together.

In <year?>, LinBit, the company behind the very popular DRBD project, took over maintenance of the heartbeat code-base. To this day, they still provide commercial support for heartbeat clusters and still provide bug and security fixes.

No plans to restart development exist though, so heartbeat as a project should be considered “deprecated”. Anyone still running a heartbeat cluster should be working on migration plans. Certainly, no new projects should use it.

Heartbeat version 3 is basically heartbeat version 2 with the resource management reverted to that used in version 1.

Red Hat's Cluster Services; A More Complicated Story

Where SUSE's Linux-HA project was fairly linear, Red Hat's history is somewhat more involved. It was born out of two projects started by two different companies.

Mission Critical Linux

In 1999, a group of engineers, mostly former DEC employees, founded a company called “Mission Critical Linux”. The goal of this company was to create an enterprise-grade Linux distribution based around a high-availability cluster stack.

In 2000, Mission Critical Linux looked at the Linux-HA project and decided that it did not suit their needs. They did take STONITH from the Linux-HA project and adapted it for their project. They built an HA platform called “Kimberlite” which supported two-node clusters.

Sistina Software's Global File System

Separately in 199?, a group of people from the University of Minnesota began work on a clustered file system they called “Global File System”. In 2000, this group started a company called Sistina Software.

Sistina's other notable contribution to Linux was the Logical Volume Manager. Initially, this was not a cluster project, though it is one of the more important contribution to Linux and deserves to be noted.

In 2003, Red Hat purchased Sistina Software.

The Birth of Red Hat Cluster Services

In 2002, when MCL closed, Red Hat took their “Kimberlite” HA stack and used it to create “Red Hat Cluster Manager” version 1, called “clumanager” for short. This was introduced in Red Hat Advanced Server version 2.1 which itself became “Red Hat Enterprise Linux”.

Like the Linux-HA project's Heartbeat program, cman was a monolithic program providing cluster communications, membership, quorum and resource management.

In 2003, with the acquisition of Sistina Software, GFS was introduced. Unlike traditional file systems, GFS was designed to coordinate access to a common file system from multiple nodes. It did this by way of a “distributed lock manager”, called DLM. In 20??, LVM would extended to support DLM, allowing for multiple nodes to manage LVM resources on shared storage in a coordinated manner.

Around the same time, Kimberlite was largely rewritten, extending commercially supported cluster sizes of up to 8 nodes, though technically, clusters could have more nodes.

Another core component of Red Hat's cluster stack was Quorum Disk, called “qdisk” for short. It used SAN storage as a key component of intelligently determining what nodes had quorum, particularly in a failure state. It did this by having one or more quorum votes.

Should the cluster partition, qdisk could use heuristics to decide which partition would get it's votes and thus have quorum. Through these heuristics, for example, qdisk could give its vote to a smaller partition if that partition still had access to critical resources, like a router.

  • Fenced; The Fence daemon, used power or fabric fencing to forcibly power off or logically disconnect from the (storage) network and node that stopped responding.
  • Clustered LVM; This extended normal LVM to support shared storage inside the cluster.

Q. What is GuLM? - Grand Unified Lock Manager; replaced by DLM Q. When was DLM created? Was it part of Sistina's GFS? - No, it was introduced with GFS2 to replace GULM. Q. When and why was fencing split out?

It was originally part of lock_gulmd, but the architecture of DLM/cman required a stand-alone fence daemon.

Q. When did qdisk get added? Was it from MCL or was it part of the re-write?

It was created around 2004, using some work done by MCL and extended by Red Hat. Check git history.

Q. When was clvmd created?

Outlier; OpenAIS

The OpenAIS project was created as an attempt to implement an open source implementation of the SA Forum's AIS API. This API is designed to provide a common mechanism for highly available applications and hardware.

OpenAIS was developped by Steven Dake, then working at MontaVista Software, as a personal project. Dave Teigland and Christine Caulfield drove the Red Hat adoption of OpenAIS.

In the open source high availability world, OpenAIS provided a very robust platform for multi-node cluster membership and communication. Principally, it managed who was allowed to be part of the cluster via “closed process groups” and ensured that all nodes in the CPG received messages in the same order and that all nodes confirmed receipt of a given message before moving on to the next message in a process called “virtual synchrony”. This was implemented using multicast groups; allowing for the efficient distribution of messages to many nodes in the cluster.

OpenAIS itself has no mechanism for managing services, which is why we do not consider it an HA project in it's own right. It can not start, stop, restart, migrate or relocate services on nodes. As we will see though, it played a significant role in early open source clustering, thus deserving immediate mention here.

First Contact

In 2004, a cluster summit in Prgue was held where SUSE and Red Hat developers attended together.

SUSE's Lars Marosky-Bree presented on the CRM version 2.

Red Hat presented it's cluster managers, cman and GuLM. It also presented DLM, GFS and a new independent fencing mechanism.

Red Hat decided to support most of the Linux-HA project's OCF resource agent draft version 1.0 API.

With the release of Red Hat Enterprise Linux 5, the kernel based cman was replaced with userspace cman using openais.

Red Hat; 2004 to 2008

In 2005, cluster manager's resource management was split off and the Resource Group Manager, called “rgmanager”, was created. It sat on top of “cman” which continued to provide quorum and cluster communication and messaging.

Initially, Red Hat opted not to support OpenAIS, opting instead to stick with GuLM. In 2006, however, support for OpenAIS was added. GuLM was deprecated and cman was reduced to providing membership and quorum. OpenAIS became the cluster communication layer.

In 200?, Red Hat released RHEL 5, introducing Red Hat Cluster Services version 2. From a functional level, RHCS stable 2 changed minimally from the initial release of RHCS. OpenAIS was used as the cluster communication layer and qdisk became optional.

SUSE; 2004 to 2008

In 2007, the CRM became the Pacemaker project.

Q. What note-able events happened here?

Q. What was different in Hearbeat 3 and when was it released?

2008; Merge Begins

In 2008, an informal meeting was held in Prague between SUSE and Red Hat HA cluster developers.

At this meeting, it was decided to work towards a common HA cluster stack. Some details were decided quickly, others would take many years.

One of the early changes was to strip out core functions from OpenAIS and create a new project called “Corosync”, first released in 2009. OpenAIS became an optional plug-in for corosync, should a future project wish to implement the entire AIS API.

Corosync itself was simplified to just the core functions needed by open source clustering. Both pacemaker and Red Hat's cluster stacks adopted adopt it. Red Hat Enterprise Linux 6.0 would introduce Red Hat cluster services version 3. RHCS became an optional RHEL Add-On called “High-Availability Add-On”. It separated GFS2 as an optional Add-On called “Resilient Storage”. Users who purchased RS got the HA add-on as well.

In 2010, Pacemaker added support for cman, allowing it to take quorum and membership from it. Pacemaker quickly became the most popular resource manager with it's good support across multiple Linux distributions and powerful and flexible resource manager. Red Hat's “rgmanager”, despite being very stable, was less flexible and it's popularity waned.

With the release of Red Hat Enterprise Linux 6.0, pacemaker was introduced under Red Hat's “Technology Preview” program. Between RHEL 6.0 and 6.4, Pacemaker underwent rapid development. Support was added for Red Hat's fence agents. Benefits of Heartbeat's resource agents were back-ported to Red Hat's resource agents. Pacemaker version 1.1.10, released with RHEL 6.5, added support for Red Hat style fence methods and fence ordering.

2014; Here and Forward

Now we have a decent understanding of where we are. It's time to discuss where things are going.

Wither the Old

Red Hat builds it's Red Hat Enterprise Linux releases on the “upstream” Fedora community's releases. Red Hat is not obliged to follow Fedora, but it examining it is often a good indication of what the next RHEL will look like.

The Red Hat “cluster” package, which provided “cman” and “rgmanager”, came to an end in Fedora 16 at version 3.2. Corosync reached version 2 and with that, it became a quorum provider. As such, both “cman” and “rgmanager” development ceased. DLM, still used by GFS2 and clustered LVM, was split-out into a stand-alone project.

Red Hat has committed to supporting and maintaining cman and rgmanager through the life of RHEL 6, which is scheduled to continue until 2020. As such, users of Red Hat's traditional cluster stack will be supported for years to come. The writing does appear to be on the wall though, and cman and rgmanager are not expected to be included with RHEL 7.

In a similar manner, LinBit still supports the Heartbeat package. They have not announced and end to support, and they have diligently supported it for some years now. Like Red Hat's “cluster” package though, it has not been actively developed in some time and there are no plans to restart it.

A Word of Warning

All users of Red Hat's cluster suite and SUSE/LinBit's heartbeat packages should be making migration plans. Readers starting new projects should not use either of these projects.

The Future! Corosync and Pacemaker

If the predictions of RHEL 7 are correct, it will mark the end of the merger of the two stacks. From then until the foreseeable future, all open source, high availability clusters will be based on corosync version 2 or higher and pacemaker version 1.1.10 or higher.

Much simpler!

All other components; gfs2, clvmd, resource agents, fence agents and other components will be simple additions, based on the needs of the user.

A Short Discussion on Management Tools

Thus far, this paper has focused on the core components of the two, now one, HA stacks.

There are a myriad of tools out there designed to simplify the use and management of these clusters. Trying to create an exhaustive list would require it's own paper, but a few key ones are worth discussing.

Pacemaker Tools

With the release of Pacemaker 1.0, the “crm shell” called “crmsh” was introduced. It was build specifically to configure and manage pacemaker, abstracting away the complex XML found in the “cib.xml” file. It is a very mature tool that most existing pacemaker users are well familiar with.

As part of the adoption of pacemaker by Red Hat, a new tool called the “Pacemaker Configuration System”, called “pcs” was introduced with the release of RHEL 6.3. It's goal is to provide a single tool to configure pacemaker, corosync and other components of the cluster stack.

Red Hat Tools

Red Hat's “cman” and “rgmanager” based clusters can be configured directly by working with the “cluster.conf” XML configuration file. Alternatively, Red Hat has a web-based management tool called “Luci”. It is a web-based application that can run on any machine in or outside the cluster. It makes use of a program called “ricci” and “modclusterd” for manipulating the cluster's configuration and pushing changes to other nodes in the cluster.

The 'luci' program will be replaced by pcsd, which uses a REST-like in RHEL version 7 and beyond.

Thanks

This document would not exist without the help of:

References

Glossary

This is not an attempt at a complete glossary of terms one will come across while learning about HA clustering. It is a short list of core terms and concepts needed to understand topics covered in this paper.

Fencing

When a node fails in an HA cluster, it enters an “unknown state”. The surviving node(s) can not simply assume that it has failed. To do so would be to risk a split-brain condition, even with the use of quorum. Clustered services that had been running on the lost node are likewise in an unknown state, so they can not be safely recovered until the host node is put into a known state.

The mechanism of putting a node into a known state is called “fencing”. It's fundamental role is to ensure that the lost node can no longer provide it's clustered services or access clustered resources. There are two effective fencing methods; Fabric Fencing and Power Fencing.

In Fabric fencing, the lost node is isolated. This is generally done my forcibly disconnecting it from the network and/or shared storage.

In Power fencing, the lost node is forced to power off. It is a crude but effective.

Quorum

Quorum is, at it's most basic, “simple majority”. It provides a mechanism that a node can use to determine if it is allowed to provide clustered resources. In it's basic form, “quorum” is determined by dividing the number of nodes in a cluster by two, adding one and then rounding down. For example; a five node cluster, divided by two, is “2.5”. Adding one is “3.5” and then rounded down is “3”. So in a five-node cluster, three nodes must be able to reach each other in order to be “quorate”.

A node must be “quorate” in order to start and run clustered services. If a node is “inquorate”, it will refuse to start any cluster resources and, if it was running any already, will shut down it's services as soon as quorum is lost.

Quorum can not be used on basic two-node clusters.

Split-Brain

In HA clustering, the most dangerous scenario is a “split-brain” condition. This is where a cluster splits into two or more “partitions” and each partition independently provides the same HA resources at the same time. Quorum helps prevent split-brain conditions, though it is not a perfect solution.

Split-brain conditions are ultimately only avoidable with proper fencing use. Should a split-brain occur, data divergence, data loss and file system corruption become likely.

STONITH

In the Linux-HA project, the term “stonith” is used to describe fencing. It is an acronym for “shoot the other node in the head”, a term that particularly references power fencing. Please see “Fencing” for more information.

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.