Split-brain
Alteeve Wiki :: How To :: Split-brain |
Warning: A "split-brain" condition is a potentially catastrophic event in clustering. |
It is a particular risk in two-node clusters, but can occur in larger clusters if quorum is not honoured. This section uses language specific to two node split-brain conditions, but "node" could be replaced with "partition" to describe a subsection of a cluster and still be accurate.
A split-brain is a state in which two nodes lose contact with one another and then both try to take control of shared resources or simultaneously provide clustered services.
Risk
The biggest risk in a split-brain condition is the corruption of shared storage. If both nodes try to alter storage on a shared block device, like a SAN partition or a DRBD resource, the two nodes will quickly corrupt the file system as they make changes without the knowledge of the other.
Protection; Quorum
The most effective protection against a split-brain condition in clusters with three or more nodes is via quorum. With quorum, a partition of nodes must have a pluraily of votes from the members in order to win quorum and take over clustered resources. For example, in a five-node cluster, each node with one vote, quorum is achieved when a group of three or more nodes forms. Later, if that cluster split into two groups, one of three nodes and the other of two nodes, the former would have the simple majority (3 votes, which is >50%) and the later would not (only 2 votes).
In this scenario, even though the two partitions could not talk to one another, the group of three will reliably know that the other group can not win quorum and, thus, will not try to use clustered resources or provide clustered services. With this clearly know, the group of three will use clustered resource and provide clustered services, safe in the knowledge that it is the only one to do so.
Note that an even split will not allow either side to win quorum, and the cluster will shut down. An example would be a cluster of four nodes that split into groups of two, a "2" is exactly 50%, and quorum requires greater than 50%.
Protection; Fencing
In a high-availability two node cluster, it is not possible to use quorum. This is because it is not possible for a single node to ever have more than 50% of the votes, as the vote count is 2. To get around this, clusters allow for quorum to be effectively disabled. Without quorum then, the last line of defense against a split-brain is via fence devices.
A fence device forcibly ejects a node from a cluster, generally by powering it off. Once a fence call has succeeded, which is determined by the device and it's agent, the remaining node can safely take over the clustered resources and services.
Enhancing Quorum With qdisk
In RHCS, there is an optional technology called a "Quorum disk", or qdisk. This is a special partition on shared SAN storage that allows for quorum to be awarded to a partition that passes certain tests, known as heuristics. You can not use qdisk on DRBD resources, however.
Any questions, feedback, advice, complaints or meanderings are welcome. | |||
Alteeve's Niche! | Alteeve Enterprise Support | Community Support | |
© 2024 Alteeve. Intelligent Availability® is a registered trademark of Alteeve's Niche! Inc. 1997-2024 | |||
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions. |