AN!Wiki :: Fence Loop
A fence loop is a condition in HA clusters where each node fences the other on boot. It occurs when quorum has been disabled and the cluster stack starts automatically with the OS and the network connection between the nodes has failed.
- A two node cluster has to have quorum disabled.
- The network connection between corosync or drbd has failed, triggering a fence action.
- Node 1 wins the initial fence, node 2 reboots.
- Node 2, on boot, starts corosync or drbd automatically, can't connect to Node 1 and calls a fence.
- Node 1 reboots. On boot, it starts corosync or drbd, fails to connect to Node 2 and calls a fence.
- Node 2 reboots. On boot, it starts corosync or drbd, fails to connect to Node 1 and calls a fence.
This loop continues until the network connection is repaired.
There are three ways to mitigate against this;
- Use 3 or more nodes so that quorum can be enabled.
- Set fence actions to off instead of reboot, thus preventing a fenced node from booting.
- Disable the cluster stack from starting on boot.
Of these, the third option is usually the best. The rationale for this is that if a node got fenced in a production cluster, something likely went wrong. Having the node boot, but not join the cluster, will allow you to log into it and examine what happened. When you've determined that the node is healthy, rejoin it to the cluster. If a node has a recurring problem, allowing to rejoin the cluster automatically could mean that it repeatedly gets fenced. This is, on the surface, safe. However, fence actions can be disruptive and are never totally risk free.
|Any questions, feedback, advice, complaints or meanderings are welcome.|
|Us: Alteeve's Niche!||Support: Mailing List||IRC: #clusterlabs on Freenode||© Alteeve's Niche! Inc. 1997-2019|
|legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.|