Alteeve Wiki - User contributions [en-gb]

Abandoned - Two Node Fedora 13 Cluster

2010-09-09T22:52:24Z

Facade: /* Thanks */

{{howto_header}}

'''''Notice''''', '''Sep. 08, 2010''' - ''Draft 1'': This is the first draft of this HowTo. It has not been peer-reviewed yet, so please proceed with caution and on a test cluster. Any and all feedback is much appreciated.

= Overview =

This paper has one goal;

# How to assemble the simplest cluster possible, a '''2-Node Cluster''', which you can then expand on for your own needs.

With this completed, you can then jump into "step 2" papers that will show various uses of a two node cluster:

# How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time.
# How to create simple resources that can move between nodes. Examples will be a simple PostgreSQL database, DHCP, DNS and web servers.

== Prerequisites ==

It is expected that you are already comfortable with the Linux command line, specifically [[bash]], and that you are familiar with general administrative tasks in Red Hat based distributions, specifically [[Fedora]]. You will also need to be comfortable using editors like [[vim]], [[nano]] or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, [[multicast]], broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

== Platform ==

This paper will implement the [[Red Hat]] Cluster Suite using the Fedora v13 distribution. This paper uses the [[x86_64]] repositories, however, if you are on an [[i386]] (32 bit) system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names.

You can either download the stock Fedora 13 DVD ISO (currently at version 5.4 which is used in this document), or you can try out the alpha [[#AN!Cluster Install|AN!Cluster Install]] DVD. (4.3GB iso). If you use the latter, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant Fedora distro, please [[Digimer|contact me]] and let me know what your trouble was.

== Why Fedora 13? ==

Generally speaking, I much prefer to use a server-oriented Linux distribution like [[CentOS]], [[Debian]] or similar. However, there have been many recent changes in the Linux-Clustering world that have made all of the currently available server-class distributions obsolete. With luck, once [[Red Hat]] Enterprise Linux and [[CentOS]] version 6 is released, this will change.

Until then, [[Fedora]] version 13 provides the most up to date binary releases of the new implementation of the clustering stack available. For this reason, FC13 is the best choice in clustering, if you want to be current. To mitigate some of the issues introduced by using a workstation distribution, many packages will be stripped out of the default install.

== Focus ==

Clusters can serve to solve three problems; '''Reliability''', '''Performance''' and '''Scalability'''.

This focus of the cluster described in this paper is primarily '''reliability'''. Second to this, '''scalability''' will be the priority leaving '''performance''' to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper.

== Goal ==

At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" resources. That is, resources that exist on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources.

This paper should also serve to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper.

= Begin =

Let's begin!

== Hardware ==

We will need two physical servers each with the following hardware:
* One or more multi-core [[CPU]]s with Virtualization support.
* Three network cards; At least one should be gigabit or faster.
* One or more hard drives.
* You will need some form of a [[fence|fence device]]. This can be an [[IPMI]]-enabled server, a [http://nodeassassin.org Node Assassin], a fenceable [http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900&tab=features PDU] or similar.

This paper uses the following hardware:
* ASUS [http://support.asus.com/search/search.aspx?keyword=m4a78l-m&SLanguage=en-us M4A78L-M]
* AMD Athlon II x2 250
* 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
* 1x Intel 82540 PCI NICs
* 1x D-Link DGE-560T
* Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would serve the purposes of creating this document. What you purchase shouldn't matter, so long as the minimum requirements are met.

== Pre-Assembly ==

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to [[Setting Up a PXE Server in Fedora|network boot]] your install media. In that case, if the wrong NIC is chosen for eth0, you will be presented with a list of MAC addresses to attempt setup with.

Before you assemble your servers, record their network cards' [[MAC]] addresses. I like to keep simple text files like these:

<source lang="bash">
cat an-node01.mac
</source>
<source lang="text">
90:E6:BA:71:82:EA eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53 eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

<source lang="bash">
cat an-node02.mac
</source>
<source lang="text">
90:E6:BA:71:82:D8 eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

Feel free to record the information in any way that suits you the best.

== OS Install ==

Start with a stock Fedora 13 install. This How-To uses Fedora 13 x86_64, however it should be fairly easy to adapt to other recent Fedora versions. This document is also attempting to be easily ported to [[CentOS]]/[[RHEL]] version 6 once it is released. This will not adapt well to CentOS/RHEL version 5 though... Much of the cluster stack has changed dramatically since it was initially released.

These are sample kickstart script used by this paper. Be sure to set your own password string and network settings.

'''''Warning'''''! These kickstart scripts '''''will erase your hard drive'''''! Adapt them, don't blindly use them.

Generic cluster node kickstart scripts.
* [[Fedora13 KS an-node01.ks|an-node01.ks]]
* [[Fedora13 KS an-node02.ks|an-node02.ks]]

== AN!Cluster Install ==

If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.

* Download the custom '''AN!Cluster v0.2.001''' Install DVD. (4.5[[GiB]] iso). (Currently disabled - Reworking for F13)

== Post OS Install ==

Once the OS is installed, we need to do a couple things;

# Disable selinux
# Setup networking.
# Change the default run-level.

=== Disable 'selinux' ===

To allow this paper to focus on clustering, we will disable selinux. Obviously, this introduces security issues that you may not be comfortable with. If you wish to leave selinux enabled, it will be up to you to sort out issues that crop up.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect.

=== Change the Default Run-Level ===

'''This is an optional step'''. It improves performance only, it is not a required step.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reduces the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

<source lang="bash">
vim /etc/inittab
</source>
<source lang="text">
id:3:initdefault:
</source>
<source lang="bash">
init 3
</source>

=== Setup Networking ===

We need to remove NetworkManager, enable network, configure the ifcfg-eth* files and then start the network daemon.

==== Network Layout ====

This setup expects you to have three physical network cards connected to three independent networks. To have a common vernacular, lets use this table to describe them:

{|
!class="cell_all"|Network Description
!class="cell_tbr"|Short Name
!class="cell_tbr"|Device Name
!class="cell_tbr"|Suggested Subnet
!class="cell_tbr"|NIC Properties
|-
|class="cell_blr"|Internet-Facing Network
|class="cell_br"|IFN
|class="cell_br"|eth0
|class="cell_br"|192.168.1.0/24
|class="cell_br"|Remaining NIC should be used here. If using a [[Setting Up a PXE Server in Fedora|PXE server]], this should be a bootable NIC.
|-
|class="cell_blr"|Storage Network
|class="cell_br"|SN
|class="cell_br"|eth1
|class="cell_br"|10.0.0.0/24
|class="cell_br"|Fastest NIC should be used here.
|-
|class="cell_blr"|Back-Channel Network
|class="cell_br"|BCN
|class="cell_br"|eth2
|class="cell_br"|10.0.1.0/24
|class="cell_br"|NICs with [[IPMI]] piggy-back '''must''' be used here. Second-fastest NIC should be used here.
|}

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:
# If your nodes have [[IPMI]] piggy-backing on a normal NIC, that NIC '''''must''''' be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a ''major'' security risk.
# The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
# If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
# The final NIC should be used for the IFN subnet.

==== Node IP Addresses ====

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

{|
|class="cell_br"| 
|class="cell_tbr"|'''Internet-Facing Network''' (IFN)
|class="cell_tbr"|'''Storage Network''' (SN)
|class="cell_tbr"|'''Back-Channel Network''' (BCN)
|-
|class="cell_blr"|'''an-node01'''
|class="cell_br"|192.168.1.71
|class="cell_br"|10.0.0.71
|class="cell_br"|10.0.1.71
|-
|class="cell_blr"|'''an-node02'''
|class="cell_br"|192.168.1.72
|class="cell_br"|10.0.0.72
|class="cell_br"|10.0.1.72
|}

==== Remove NetworkManager ====

Some cluster software '''will not''' start with NetworkManager installed. This is because it is designed to be a highly-adaptive network system that can accommodate frequent changes in the network. To simplify these network transitions for end-users, a lot of reconfiguration of the network is done behind the scenes.

For workstations, this is wonderful. For clustering, this can be disastrous. Transient network outages are already a risk to a cluster's stability!

So first up, make sure that NetworkManager is completely removed from your system. If you used the [[#OS Install|kickstart]] scripts, then it was not installed. Otherwise, run:

<source lang="bash">
yum remove NetworkManager NetworkManager-gnome NetworkManager-openconnect NetworkManager-openvpn NetworkManager-pptp NetworkManager-vpnc cnetworkmanager knetworkmanager knetworkmanager-openvpn knetworkmanager-pptp knetworkmanager-vpnc libproxy-networkmanager yum-NetworkManager-dispatcher
</source>

==== Setup 'network' ====

Before proceeding with network configuration, check to see if your network cards are aligned to the proper ethX network names. If they need to be adjusted, please follow this How-To before proceeding:

* [[Changing the ethX to Ethernet Device Mapping in Fedora]]

There are a few ways to configure network in Fedora:
* system-config-network (graphical)
* system-config-network-tui (ncurses)
* Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files. (See: [http://docs.fedoraproject.org/en-US/Fedora/12/html/Deployment_Guide/s1-networkscripts-interfaces.html here] for a full list of options)

Do not proceed until your node's networking is fully configured.

==== Update the Hosts File ====

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

'''Note''': Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.

The updated /etc/hosts file should look like this:

<source lang="bash">
vim /etc/hosts
</source>
<source lang="text">
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# Back-channel IPs to name mapping.
10.0.1.71 an-node01 an-node01.alteeve.com
10.0.1.72 an-node02 an-node02.alteeve.com

# Storage network
10.0.0.71 an-node01.sn
10.0.0.72 an-node02.sn

# Internet facing
192.168.1.71 an-node01.inet
192.168.1.72 an-node02.inet
</source>

To test this, ping both nodes by name and make sure the ping packets are sent on the 10.0.1.0/24 subnet:

<source lang="bash">
ping -c 5 an-node01
</source>
<source lang="text">
PING an-node01 (10.0.1.71) 56(84) bytes of data.
64 bytes from an-node01 (10.0.1.71): icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=2 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=3 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=4 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=5 ttl=64 time=0.055 ms
</source>
<source lang="bash">
ping -c 5 an-node02
</source>
<source lang="text">
PING an-node02 (10.0.1.72) 56(84) bytes of data.
64 bytes from an-node02 (10.0.1.72): icmp_seq=1 ttl=64 time=0.221 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=2 ttl=64 time=0.188 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=3 ttl=64 time=0.217 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=4 ttl=64 time=0.192 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=5 ttl=64 time=0.163 ms
</source>

==== Disable Firewalls ====

Be sure to flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

There will be enough potential sources of problem as it is. Disabling firewalls at this stage will minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.

<source lang="bash">
chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop
</source>

==== Setup SSH Shared Keys ====

'''This is an optional step'''. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. This is not meant to be a security-focused How-To, so please independently study the risks.

If you're a little new to [[SSH]], it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the ''other'' node's root user's directory.

For each user, on each machine you want to connect '''from''', run:

<source lang="bash">
ssh-keygen -t dsa -N "" -f ~/.ssh/id_dsa
</source>
<source lang="text">
Generating public/private dsa key pair.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
c4:cc:10:0a:6b:60:24:e7:57:94:69:e5:10:b2:cb:1f root@an-node01.alteeve.com
The key's randomart image is:
+--[ DSA 1024]----+
|+oo ..B*. |
|o+ o =+B |
| + +. * |
| . o . . |
| o E S |
| . . |
| . |
| |
| |
+-----------------+
</source>

This will create two files: the private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private '''''must never''''' be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

'''Private key''':
<source lang="bash">
cat ~/.ssh/id_dsa
</source>
<source lang="text">
-----BEGIN DSA PRIVATE KEY-----
MIIBugIBAAKBgQCVo5nzeC/W8HWaFkD4pAmWO7ovf7S01f8HgocKKOYX/nMNxRui
wExBUUUt1b2TxYYQklTd9dn8UAzZ3UohN0CJNDrp//t2wfyACKKc2q+ewao5/svQ
xXTBu2ZPebKPcr1w9p0hq0xSr77Rg3v6Auc6AnC79DM9XkhYkVgNfgrvMwIVAKdY
SgTycvqNEWsJrCTTn5485yWvAoGANQ3rzIxFy2zHWyMlrpA9r5jllgxfaJVx3/Iw
DrEF82lr2dOmG8oYiKAhfvdgNyV7VeM2yxdjcdLPJAHHCx5BZZ7V9KjMzY26wS2r
d58TZJoWmO0nscmcwIbCv2at3fiMpxgr+nzD4tKxFOdWxYBwdmUpIjtJoW8wzY2H
uNghjL0CgYA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDN
bbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EE
D3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5AIUFX97yIig
n2SUltN+10NWUy4oYsA=
-----END DSA PRIVATE KEY-----
</source>

'''Public key''':
<source lang="bash">
cat ~/.ssh/id_dsa.pub
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From '''an-node01''', type:
<source lang="bash">
ssh root@an-node02
</source>
<source lang="text">
The authenticity of host 'an-node02 (10.0.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,10.0.1.72' (RSA) to the list of known hosts.
root@an-node02's password:
Last login: Tue Jul 6 22:28:19 2010 from 192.168.1.102
</source>

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

<source lang="bash">
cat ~/.ssh/authorized_keys
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

= Initial Cluster Setup =

Before we get into specifics, let's take a minute to talk about the major components used in our cluster.

== Core Program Overviews ==

These are the core programs that may be new to you that we will use to build our cluster. Before we configure them, let's take a minute to understand their roles.

=== Cluster Manager ===

The cluster manager, cman, takes the configuration file [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] configuration file and uses it to configure and start the various cluster elements. This is where nodes are defined, fence devices are configured and various cluster tolerances are set.

=== Corosync ===

[http://www.corosync.org Corosync] is, essentially, the clustering kernel which provides the essential services for cluster-aware applications and other cluster services. At it's core is the [[totem]] protocol which provides cluster membership and ordered messaging. On top of this, it implements a number of core clustering services like [[#cpg; Closed Process Group|closed process group messaging]], [[quorum]] and [[confdb]] (an object database).

It's goal is to provide a substantially simpler and more flexible set of APIs to facilitate clustering in Linux. It manages which nodes are in the cluster, it triggers error messages when something fails, manages cluster locking and so on. Most other clustered applications rely on corosync to know when something has happened or to announce when a cluster-related action has taken place.

Please note that the corosync_overview man page is considered out of date at the time of this writing.

==== cpg; Closed Process Group ====

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install corosynclib-devel
man cpg_overview
</source>

The closed process group is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages amongst nodes in a consistent order. It adds [[PID]]s and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group.

=== OpenAIS ===

OpenAIS is now an extension to Corosync that adds an open-source implementation of the [http://www.saforum.org/ Service Availability (SA) Forum's] 'Application Interface Specification' ([[AIS]]). It is an [[API]] and policy designed to be used by applications concerned with maintaining services during faults. AIS implements the 'Availability Management Framework' ([[AMF]]) which, in turn, provides for application fail over, cluster management ([[CLM]]), Checkpointing ([[CKPT]]), Eventing ([[EVT]]), Messaging ([[MSG]]), and Distributed Locking ([[DLOCK]]).

It does this by implementing pluggable shared libraries that get loaded into corosync at runtime. These libraries then can make use of corosync's internal API. Both corosync and openais services provide [[IPC]] interfaces so that application programmers can make use of their behaviour.

In short; applications can use OpenAIS to be cluster-aware. It's libraries are used by some applications, including Pacemaker. In our application, we will only be using it's libraries indirectly.

Please note that the openais_overview man page is considered out of date at the time of this writing.

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install openais
</source>

=== Pacemaker ===

[http://www.clusterlabs.org Pacemaker] is the cluster resource manager. It can be used to trigger scripts to control non cluster-aware applications. In this way, these non cluster-aware applications can be made more highly available. It is considered a replacement for cman's [[rgmanager]].

== Setup Core Programs ==

We now need to edit and create the configuration files for our cluster.

=== Setup cman ===

You may only need to configure this section.

First, install the cluster manager:

<source lang="bash">
yum install cman
</source>

Once installed, we need to create and fill in the [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] file. This is the core configuration file of our cluster and is an [[XML]] formatted file.

Please see: [[Two-Node Fedora 13 cluster.conf]]

<source lang="bash">
vim /etc/cluster/cluster.conf
</source>

The example in that article details many of the common options you will want to use. For a more complete list, please run man 5 cluster.conf for a more complete discussion on this file. Also, please be aware that some arguments can overlap between [[#Setup Corosync|corosync]] and this file.

Once it's setup, we need to verify it. Once you've finished, be sure to run this:

<source lang="bash">
xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf
</source>

If there are errors, fix them. Once it is formatted properly, the contents of you cluster.conf file will be returned followed by "/etc/cluster/cluster.conf validates". Once this is the case, you can move on.

A note: normally at this stage, you'd use chkconfig to enable cman to start at boot. However, depending on how you use your cluster, you may need to modify the boot order. For now, let's leave it off until we're got all the components enabled. This is particularly important if you will be using [[DRBD]], [[CLVM]] or [[Xen]].

<source lang="bash">
chkconfig cman off
</source>

To see if it's working though, have a terminal open to both nodes and manually start cman. I suggest have two terminal windows open on each node. In one, start tailing /var/log/messages:

<source lang="bash">
clear; tail -f -n 0 /var/log/messages
</source>

Now, with both log files being watched, start cman on both nodes.

'''Note''': You MUST execute the following on both nodes withing the time limit set by post_join_delay (default is 6 seconds). If you wait longer than this, the first node started will [[fence]] the other node, assuming you have fencing configured properly. If fencing is '''not''' configured properly, your first node will hang.

<source lang="bash">
/etc/init.d/cman start
</source>

Examine the log files to verify there were no errors. If there were, fix them. If there was not, fantastic!

'''Warning''': This last step will forcefully reboot your nodes. Do not do this on a production cluster! If it must be done on a production server, stop all non-essential services and be sure you have a good backup, ''just in case''.

Lastly, test that your fencing is configured and working properly. We do this by using a program called fence_node. With both nodes up and running cman, call a fence against the other node. Be sure to be watching the log files.

'''From''' an-node01.alteeve.com

<source lang="bash">
fence_node an-node02.alteeve.com
</source>
<source lang="bash">
fence an-node02.alteeve.com success
</source>

If it worked, when the node comes back up, repeat the process from an-node02.alteeve.com against an-node01.alteeve.com.

=== Setup Corosync ===

'''Note''': In our cluster, we will not need to configure Corosync.

When using cman, as we are here, Corosync's configuration file is ignored.

Most of the settings available in [[Two-Node Fedora 13 corosync.conf|corosync.conf]] can be found in the [[Two-Node Fedora 13 cluster.conf|cluster.conf]] file. Please see the cluster.conf article for a comprehensive list of what is and is not supported.

If you want to configure corosync and not use cman, please see the following article:

* [[Two-Node Fedora 13 corosync.conf]]

=== Setup Pacemaker ===

ToDO.

= Fencing =

Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.

* The Cluster Admin's Mantra:
** '''The only thing you don't know is what you don't know'''.

Just because one node loses communication with another node, it '''cannot''' assume that the silent node is dead!

== What is it? ==

"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a '''split-brain''' condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.

Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:

* Power
** Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
* Blocking
** Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.

With power fencing, the term used is "STONITH", literally, '''S'''hoot '''T'''he '''O'''ther '''N'''ode '''I'''n '''T'''he '''H'''ead. Picture it like an old west duel. If one node is dead, the other node is going to win the duel by default and the dead node will just be shot again. When both nodes are alive, however, the faster node will win and will "kill" (power off or reset) the slower node before it has a chance to fire. Once this duel is over, the surviving node can then access the shared resource confident that it is the only one working on it.

== Misconception ==

It is a '''very''' common mistake to ignore fencing when first starting to learn about clustering. Often people think ''"It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster."''.

'''''Wrong!'''''

For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, [[cman]] and related daemons will fail if they can't find a fence agent to use.

Secondly: testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.

== Implementation ==

How you implement fencing depends largely on what kind of fence device(s) you have. This is where I need help from you, dear reader. I only have access to [[Node Assassin]] fence devices on my lab's cluster. If you have other fence devices, like [[IPMI]], [[DRAC]] or so on, please let me know so that I can expand this section.

=== Implementing Node Assassin ===

In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced daemon. We'll cover the details of the [[cluster.conf]] file in a moment.

When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:
<source lang="xml">
<cluster name="an-cluster" config_version="1">
<clusternodes>
<clusternode name="an-node02.alteeve.com" nodeid="2">
<fence>
<method name="node_assassin">
<device name="motoko" port="02" action="off"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="motoko" agent="fence_na" quiet="true"
ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
</fencedevice>
</fencedevices>
</cluster>
</source>

If the cluster manager determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name, which is motoko in this case. It then looks in the <fencedevices> section for the device with the matching name. From there, it gets the information needed to find and access the fence device. Once it connects to the fence device, it then passes the options set in an-node02.alteeve.com's <fence> argument.

So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments:
* ipaddr=motoko.alteeve.com
* login=motoko
* passwd=secret
* quiet=true
* port=2
* action=off

How the fence agent acts on these arguments varies depending on the fence device itself. In general terms, the 'fence_na' fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on, which could be a power jack, a power or reset button, a network switch port and so on. Finally, it tells the device what action to take.

Once the device completes, it returns a success or failed message. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.

If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.

== Fence Devices ==

Many major [[OEM]]s have their own remote management devices that can serve as fence devices. Examples are [http://dell.ca Dell]'s 'DRAC' (Dell Remote Access Controller), [http://hp.ca HP]'s iLO (Integrate Lights Out), [http://ibm.ca IBM]'s 'RSA' (Remote Supervisor Adapter), [http://sun.ca Sun]'s 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via [[IPMI]], Intelligent Power Management Interface.

In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.

Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.

=== IPMI ===

[[IPMI]] is perhaps the most common method used for fencing nodes in a cluster. It requires having a system board with an IPMI baseboard management controller (BMC). IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

To start, we need to install the IPMI user software:

<source lang="bash">
</source>

=== Node Assassin ===

A cheap alternative is the [[Node Assassin]], an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.

'''Full Disclosure''': Node Assassin was created by me, with much help from others, for this paper.

= Clean Up =

Some daemons may have been installed or dragged in during the setup of the cluster. Most notable is heartbeat which is not compatible with [[RHCS]]. The following command will disable various daemons from starting at boot.

As for the rest of the daemons below, it is safe to run even when the daemons aren't installed or have already been removed. Of course, skip the ones you actually want to use. This assumes that the nodes have been built according to this HowTo.

<source lang="bash">
chkconfig heartbeat off; chkconfig iscsid off; chkconfig iptables off; chkconfig ip6tables off
</source>

= So You Have a Cluster - Now What? =

Building the cluster infrastructure was pretty easy, wasn't it?

But what will you '''''do''''' with it?

{|style="width: 75%; text-align: center; padding: 10px; border-spacing: 15px;"
|-
!style="width: 100%; border: 1px solid #dfdfdf;" colspan="3"|Choose Your Own Adventure Cluster
|-
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|style="width: 34%; border: 1px solid #dfdfdf;"| [[Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM|Xen-Based Virtual Machine Host on DRBD+CLVM]] High Availability VM Host
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|}

= Thanks =

* A '''huge''' thanks to [http://iplink.net Interlink Connectivity]! They hire me as a contractor and have allowed me to extend these docs while working on their clusters. Development of these How-Tos would be much slower if not for them. If you need hosting or colo services, drop them a line. Their website is a bit out of date though, so disregard the prices. :)
* To '''sdake''' of [http://corosync.org corosync] for helping me sort out the '''plock''' component and corosync in general.
* To '''Angus Salkeld''' for helping me nail down the Corosync and OpenAIS differences.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013922.html HJ Lee] from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013925.html Steven Dake] for clarifying the to_x vs. logoutput: x arguments in openais.conf.
* To [http://dk.linkedin.com/in/fabbione Fabio Massimo Di Nitto] for helping me get caught up with clustering and VMs.

{{footer}}

Abandoned - Two Node Fedora 13 Cluster

2010-09-09T22:50:35Z

Facade: /* Fencing */

{{howto_header}}

'''''Notice''''', '''Sep. 08, 2010''' - ''Draft 1'': This is the first draft of this HowTo. It has not been peer-reviewed yet, so please proceed with caution and on a test cluster. Any and all feedback is much appreciated.

= Overview =

This paper has one goal;

# How to assemble the simplest cluster possible, a '''2-Node Cluster''', which you can then expand on for your own needs.

With this completed, you can then jump into "step 2" papers that will show various uses of a two node cluster:

# How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time.
# How to create simple resources that can move between nodes. Examples will be a simple PostgreSQL database, DHCP, DNS and web servers.

== Prerequisites ==

It is expected that you are already comfortable with the Linux command line, specifically [[bash]], and that you are familiar with general administrative tasks in Red Hat based distributions, specifically [[Fedora]]. You will also need to be comfortable using editors like [[vim]], [[nano]] or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, [[multicast]], broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

== Platform ==

This paper will implement the [[Red Hat]] Cluster Suite using the Fedora v13 distribution. This paper uses the [[x86_64]] repositories, however, if you are on an [[i386]] (32 bit) system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names.

You can either download the stock Fedora 13 DVD ISO (currently at version 5.4 which is used in this document), or you can try out the alpha [[#AN!Cluster Install|AN!Cluster Install]] DVD. (4.3GB iso). If you use the latter, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant Fedora distro, please [[Digimer|contact me]] and let me know what your trouble was.

== Why Fedora 13? ==

Generally speaking, I much prefer to use a server-oriented Linux distribution like [[CentOS]], [[Debian]] or similar. However, there have been many recent changes in the Linux-Clustering world that have made all of the currently available server-class distributions obsolete. With luck, once [[Red Hat]] Enterprise Linux and [[CentOS]] version 6 is released, this will change.

Until then, [[Fedora]] version 13 provides the most up to date binary releases of the new implementation of the clustering stack available. For this reason, FC13 is the best choice in clustering, if you want to be current. To mitigate some of the issues introduced by using a workstation distribution, many packages will be stripped out of the default install.

== Focus ==

Clusters can serve to solve three problems; '''Reliability''', '''Performance''' and '''Scalability'''.

This focus of the cluster described in this paper is primarily '''reliability'''. Second to this, '''scalability''' will be the priority leaving '''performance''' to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper.

== Goal ==

At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" resources. That is, resources that exist on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources.

This paper should also serve to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper.

= Begin =

Let's begin!

== Hardware ==

We will need two physical servers each with the following hardware:
* One or more multi-core [[CPU]]s with Virtualization support.
* Three network cards; At least one should be gigabit or faster.
* One or more hard drives.
* You will need some form of a [[fence|fence device]]. This can be an [[IPMI]]-enabled server, a [http://nodeassassin.org Node Assassin], a fenceable [http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900&tab=features PDU] or similar.

This paper uses the following hardware:
* ASUS [http://support.asus.com/search/search.aspx?keyword=m4a78l-m&SLanguage=en-us M4A78L-M]
* AMD Athlon II x2 250
* 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
* 1x Intel 82540 PCI NICs
* 1x D-Link DGE-560T
* Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would serve the purposes of creating this document. What you purchase shouldn't matter, so long as the minimum requirements are met.

== Pre-Assembly ==

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to [[Setting Up a PXE Server in Fedora|network boot]] your install media. In that case, if the wrong NIC is chosen for eth0, you will be presented with a list of MAC addresses to attempt setup with.

Before you assemble your servers, record their network cards' [[MAC]] addresses. I like to keep simple text files like these:

<source lang="bash">
cat an-node01.mac
</source>
<source lang="text">
90:E6:BA:71:82:EA eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53 eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

<source lang="bash">
cat an-node02.mac
</source>
<source lang="text">
90:E6:BA:71:82:D8 eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

Feel free to record the information in any way that suits you the best.

== OS Install ==

Start with a stock Fedora 13 install. This How-To uses Fedora 13 x86_64, however it should be fairly easy to adapt to other recent Fedora versions. This document is also attempting to be easily ported to [[CentOS]]/[[RHEL]] version 6 once it is released. This will not adapt well to CentOS/RHEL version 5 though... Much of the cluster stack has changed dramatically since it was initially released.

These are sample kickstart script used by this paper. Be sure to set your own password string and network settings.

'''''Warning'''''! These kickstart scripts '''''will erase your hard drive'''''! Adapt them, don't blindly use them.

Generic cluster node kickstart scripts.
* [[Fedora13 KS an-node01.ks|an-node01.ks]]
* [[Fedora13 KS an-node02.ks|an-node02.ks]]

== AN!Cluster Install ==

If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.

* Download the custom '''AN!Cluster v0.2.001''' Install DVD. (4.5[[GiB]] iso). (Currently disabled - Reworking for F13)

== Post OS Install ==

Once the OS is installed, we need to do a couple things;

# Disable selinux
# Setup networking.
# Change the default run-level.

=== Disable 'selinux' ===

To allow this paper to focus on clustering, we will disable selinux. Obviously, this introduces security issues that you may not be comfortable with. If you wish to leave selinux enabled, it will be up to you to sort out issues that crop up.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect.

=== Change the Default Run-Level ===

'''This is an optional step'''. It improves performance only, it is not a required step.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reduces the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

<source lang="bash">
vim /etc/inittab
</source>
<source lang="text">
id:3:initdefault:
</source>
<source lang="bash">
init 3
</source>

=== Setup Networking ===

We need to remove NetworkManager, enable network, configure the ifcfg-eth* files and then start the network daemon.

==== Network Layout ====

This setup expects you to have three physical network cards connected to three independent networks. To have a common vernacular, lets use this table to describe them:

{|
!class="cell_all"|Network Description
!class="cell_tbr"|Short Name
!class="cell_tbr"|Device Name
!class="cell_tbr"|Suggested Subnet
!class="cell_tbr"|NIC Properties
|-
|class="cell_blr"|Internet-Facing Network
|class="cell_br"|IFN
|class="cell_br"|eth0
|class="cell_br"|192.168.1.0/24
|class="cell_br"|Remaining NIC should be used here. If using a [[Setting Up a PXE Server in Fedora|PXE server]], this should be a bootable NIC.
|-
|class="cell_blr"|Storage Network
|class="cell_br"|SN
|class="cell_br"|eth1
|class="cell_br"|10.0.0.0/24
|class="cell_br"|Fastest NIC should be used here.
|-
|class="cell_blr"|Back-Channel Network
|class="cell_br"|BCN
|class="cell_br"|eth2
|class="cell_br"|10.0.1.0/24
|class="cell_br"|NICs with [[IPMI]] piggy-back '''must''' be used here. Second-fastest NIC should be used here.
|}

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:
# If your nodes have [[IPMI]] piggy-backing on a normal NIC, that NIC '''''must''''' be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a ''major'' security risk.
# The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
# If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
# The final NIC should be used for the IFN subnet.

==== Node IP Addresses ====

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

{|
|class="cell_br"| 
|class="cell_tbr"|'''Internet-Facing Network''' (IFN)
|class="cell_tbr"|'''Storage Network''' (SN)
|class="cell_tbr"|'''Back-Channel Network''' (BCN)
|-
|class="cell_blr"|'''an-node01'''
|class="cell_br"|192.168.1.71
|class="cell_br"|10.0.0.71
|class="cell_br"|10.0.1.71
|-
|class="cell_blr"|'''an-node02'''
|class="cell_br"|192.168.1.72
|class="cell_br"|10.0.0.72
|class="cell_br"|10.0.1.72
|}

==== Remove NetworkManager ====

Some cluster software '''will not''' start with NetworkManager installed. This is because it is designed to be a highly-adaptive network system that can accommodate frequent changes in the network. To simplify these network transitions for end-users, a lot of reconfiguration of the network is done behind the scenes.

For workstations, this is wonderful. For clustering, this can be disastrous. Transient network outages are already a risk to a cluster's stability!

So first up, make sure that NetworkManager is completely removed from your system. If you used the [[#OS Install|kickstart]] scripts, then it was not installed. Otherwise, run:

<source lang="bash">
yum remove NetworkManager NetworkManager-gnome NetworkManager-openconnect NetworkManager-openvpn NetworkManager-pptp NetworkManager-vpnc cnetworkmanager knetworkmanager knetworkmanager-openvpn knetworkmanager-pptp knetworkmanager-vpnc libproxy-networkmanager yum-NetworkManager-dispatcher
</source>

==== Setup 'network' ====

Before proceeding with network configuration, check to see if your network cards are aligned to the proper ethX network names. If they need to be adjusted, please follow this How-To before proceeding:

* [[Changing the ethX to Ethernet Device Mapping in Fedora]]

There are a few ways to configure network in Fedora:
* system-config-network (graphical)
* system-config-network-tui (ncurses)
* Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files. (See: [http://docs.fedoraproject.org/en-US/Fedora/12/html/Deployment_Guide/s1-networkscripts-interfaces.html here] for a full list of options)

Do not proceed until your node's networking is fully configured.

==== Update the Hosts File ====

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

'''Note''': Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.

The updated /etc/hosts file should look like this:

<source lang="bash">
vim /etc/hosts
</source>
<source lang="text">
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# Back-channel IPs to name mapping.
10.0.1.71 an-node01 an-node01.alteeve.com
10.0.1.72 an-node02 an-node02.alteeve.com

# Storage network
10.0.0.71 an-node01.sn
10.0.0.72 an-node02.sn

# Internet facing
192.168.1.71 an-node01.inet
192.168.1.72 an-node02.inet
</source>

To test this, ping both nodes by name and make sure the ping packets are sent on the 10.0.1.0/24 subnet:

<source lang="bash">
ping -c 5 an-node01
</source>
<source lang="text">
PING an-node01 (10.0.1.71) 56(84) bytes of data.
64 bytes from an-node01 (10.0.1.71): icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=2 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=3 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=4 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=5 ttl=64 time=0.055 ms
</source>
<source lang="bash">
ping -c 5 an-node02
</source>
<source lang="text">
PING an-node02 (10.0.1.72) 56(84) bytes of data.
64 bytes from an-node02 (10.0.1.72): icmp_seq=1 ttl=64 time=0.221 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=2 ttl=64 time=0.188 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=3 ttl=64 time=0.217 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=4 ttl=64 time=0.192 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=5 ttl=64 time=0.163 ms
</source>

==== Disable Firewalls ====

Be sure to flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

There will be enough potential sources of problem as it is. Disabling firewalls at this stage will minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.

<source lang="bash">
chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop
</source>

==== Setup SSH Shared Keys ====

'''This is an optional step'''. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. This is not meant to be a security-focused How-To, so please independently study the risks.

If you're a little new to [[SSH]], it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the ''other'' node's root user's directory.

For each user, on each machine you want to connect '''from''', run:

<source lang="bash">
ssh-keygen -t dsa -N "" -f ~/.ssh/id_dsa
</source>
<source lang="text">
Generating public/private dsa key pair.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
c4:cc:10:0a:6b:60:24:e7:57:94:69:e5:10:b2:cb:1f root@an-node01.alteeve.com
The key's randomart image is:
+--[ DSA 1024]----+
|+oo ..B*. |
|o+ o =+B |
| + +. * |
| . o . . |
| o E S |
| . . |
| . |
| |
| |
+-----------------+
</source>

This will create two files: the private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private '''''must never''''' be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

'''Private key''':
<source lang="bash">
cat ~/.ssh/id_dsa
</source>
<source lang="text">
-----BEGIN DSA PRIVATE KEY-----
MIIBugIBAAKBgQCVo5nzeC/W8HWaFkD4pAmWO7ovf7S01f8HgocKKOYX/nMNxRui
wExBUUUt1b2TxYYQklTd9dn8UAzZ3UohN0CJNDrp//t2wfyACKKc2q+ewao5/svQ
xXTBu2ZPebKPcr1w9p0hq0xSr77Rg3v6Auc6AnC79DM9XkhYkVgNfgrvMwIVAKdY
SgTycvqNEWsJrCTTn5485yWvAoGANQ3rzIxFy2zHWyMlrpA9r5jllgxfaJVx3/Iw
DrEF82lr2dOmG8oYiKAhfvdgNyV7VeM2yxdjcdLPJAHHCx5BZZ7V9KjMzY26wS2r
d58TZJoWmO0nscmcwIbCv2at3fiMpxgr+nzD4tKxFOdWxYBwdmUpIjtJoW8wzY2H
uNghjL0CgYA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDN
bbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EE
D3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5AIUFX97yIig
n2SUltN+10NWUy4oYsA=
-----END DSA PRIVATE KEY-----
</source>

'''Public key''':
<source lang="bash">
cat ~/.ssh/id_dsa.pub
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From '''an-node01''', type:
<source lang="bash">
ssh root@an-node02
</source>
<source lang="text">
The authenticity of host 'an-node02 (10.0.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,10.0.1.72' (RSA) to the list of known hosts.
root@an-node02's password:
Last login: Tue Jul 6 22:28:19 2010 from 192.168.1.102
</source>

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

<source lang="bash">
cat ~/.ssh/authorized_keys
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

= Initial Cluster Setup =

Before we get into specifics, let's take a minute to talk about the major components used in our cluster.

== Core Program Overviews ==

These are the core programs that may be new to you that we will use to build our cluster. Before we configure them, let's take a minute to understand their roles.

=== Cluster Manager ===

The cluster manager, cman, takes the configuration file [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] configuration file and uses it to configure and start the various cluster elements. This is where nodes are defined, fence devices are configured and various cluster tolerances are set.

=== Corosync ===

[http://www.corosync.org Corosync] is, essentially, the clustering kernel which provides the essential services for cluster-aware applications and other cluster services. At it's core is the [[totem]] protocol which provides cluster membership and ordered messaging. On top of this, it implements a number of core clustering services like [[#cpg; Closed Process Group|closed process group messaging]], [[quorum]] and [[confdb]] (an object database).

It's goal is to provide a substantially simpler and more flexible set of APIs to facilitate clustering in Linux. It manages which nodes are in the cluster, it triggers error messages when something fails, manages cluster locking and so on. Most other clustered applications rely on corosync to know when something has happened or to announce when a cluster-related action has taken place.

Please note that the corosync_overview man page is considered out of date at the time of this writing.

==== cpg; Closed Process Group ====

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install corosynclib-devel
man cpg_overview
</source>

The closed process group is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages amongst nodes in a consistent order. It adds [[PID]]s and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group.

=== OpenAIS ===

OpenAIS is now an extension to Corosync that adds an open-source implementation of the [http://www.saforum.org/ Service Availability (SA) Forum's] 'Application Interface Specification' ([[AIS]]). It is an [[API]] and policy designed to be used by applications concerned with maintaining services during faults. AIS implements the 'Availability Management Framework' ([[AMF]]) which, in turn, provides for application fail over, cluster management ([[CLM]]), Checkpointing ([[CKPT]]), Eventing ([[EVT]]), Messaging ([[MSG]]), and Distributed Locking ([[DLOCK]]).

It does this by implementing pluggable shared libraries that get loaded into corosync at runtime. These libraries then can make use of corosync's internal API. Both corosync and openais services provide [[IPC]] interfaces so that application programmers can make use of their behaviour.

In short; applications can use OpenAIS to be cluster-aware. It's libraries are used by some applications, including Pacemaker. In our application, we will only be using it's libraries indirectly.

Please note that the openais_overview man page is considered out of date at the time of this writing.

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install openais
</source>

=== Pacemaker ===

[http://www.clusterlabs.org Pacemaker] is the cluster resource manager. It can be used to trigger scripts to control non cluster-aware applications. In this way, these non cluster-aware applications can be made more highly available. It is considered a replacement for cman's [[rgmanager]].

== Setup Core Programs ==

We now need to edit and create the configuration files for our cluster.

=== Setup cman ===

You may only need to configure this section.

First, install the cluster manager:

<source lang="bash">
yum install cman
</source>

Once installed, we need to create and fill in the [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] file. This is the core configuration file of our cluster and is an [[XML]] formatted file.

Please see: [[Two-Node Fedora 13 cluster.conf]]

<source lang="bash">
vim /etc/cluster/cluster.conf
</source>

The example in that article details many of the common options you will want to use. For a more complete list, please run man 5 cluster.conf for a more complete discussion on this file. Also, please be aware that some arguments can overlap between [[#Setup Corosync|corosync]] and this file.

Once it's setup, we need to verify it. Once you've finished, be sure to run this:

<source lang="bash">
xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf
</source>

If there are errors, fix them. Once it is formatted properly, the contents of you cluster.conf file will be returned followed by "/etc/cluster/cluster.conf validates". Once this is the case, you can move on.

A note: normally at this stage, you'd use chkconfig to enable cman to start at boot. However, depending on how you use your cluster, you may need to modify the boot order. For now, let's leave it off until we're got all the components enabled. This is particularly important if you will be using [[DRBD]], [[CLVM]] or [[Xen]].

<source lang="bash">
chkconfig cman off
</source>

To see if it's working though, have a terminal open to both nodes and manually start cman. I suggest have two terminal windows open on each node. In one, start tailing /var/log/messages:

<source lang="bash">
clear; tail -f -n 0 /var/log/messages
</source>

Now, with both log files being watched, start cman on both nodes.

'''Note''': You MUST execute the following on both nodes withing the time limit set by post_join_delay (default is 6 seconds). If you wait longer than this, the first node started will [[fence]] the other node, assuming you have fencing configured properly. If fencing is '''not''' configured properly, your first node will hang.

<source lang="bash">
/etc/init.d/cman start
</source>

Examine the log files to verify there were no errors. If there were, fix them. If there was not, fantastic!

'''Warning''': This last step will forcefully reboot your nodes. Do not do this on a production cluster! If it must be done on a production server, stop all non-essential services and be sure you have a good backup, ''just in case''.

Lastly, test that your fencing is configured and working properly. We do this by using a program called fence_node. With both nodes up and running cman, call a fence against the other node. Be sure to be watching the log files.

'''From''' an-node01.alteeve.com

<source lang="bash">
fence_node an-node02.alteeve.com
</source>
<source lang="bash">
fence an-node02.alteeve.com success
</source>

If it worked, when the node comes back up, repeat the process from an-node02.alteeve.com against an-node01.alteeve.com.

=== Setup Corosync ===

'''Note''': In our cluster, we will not need to configure Corosync.

When using cman, as we are here, Corosync's configuration file is ignored.

Most of the settings available in [[Two-Node Fedora 13 corosync.conf|corosync.conf]] can be found in the [[Two-Node Fedora 13 cluster.conf|cluster.conf]] file. Please see the cluster.conf article for a comprehensive list of what is and is not supported.

If you want to configure corosync and not use cman, please see the following article:

* [[Two-Node Fedora 13 corosync.conf]]

=== Setup Pacemaker ===

ToDO.

= Fencing =

Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.

* The Cluster Admin's Mantra:
** '''The only thing you don't know is what you don't know'''.

Just because one node loses communication with another node, it '''cannot''' assume that the silent node is dead!

== What is it? ==

"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a '''split-brain''' condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.

Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:

* Power
** Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
* Blocking
** Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.

With power fencing, the term used is "STONITH", literally, '''S'''hoot '''T'''he '''O'''ther '''N'''ode '''I'''n '''T'''he '''H'''ead. Picture it like an old west duel. If one node is dead, the other node is going to win the duel by default and the dead node will just be shot again. When both nodes are alive, however, the faster node will win and will "kill" (power off or reset) the slower node before it has a chance to fire. Once this duel is over, the surviving node can then access the shared resource confident that it is the only one working on it.

== Misconception ==

It is a '''very''' common mistake to ignore fencing when first starting to learn about clustering. Often people think ''"It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster."''.

'''''Wrong!'''''

For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, [[cman]] and related daemons will fail if they can't find a fence agent to use.

Secondly: testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.

== Implementation ==

How you implement fencing depends largely on what kind of fence device(s) you have. This is where I need help from you, dear reader. I only have access to [[Node Assassin]] fence devices on my lab's cluster. If you have other fence devices, like [[IPMI]], [[DRAC]] or so on, please let me know so that I can expand this section.

=== Implementing Node Assassin ===

In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced daemon. We'll cover the details of the [[cluster.conf]] file in a moment.

When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:
<source lang="xml">
<cluster name="an-cluster" config_version="1">
<clusternodes>
<clusternode name="an-node02.alteeve.com" nodeid="2">
<fence>
<method name="node_assassin">
<device name="motoko" port="02" action="off"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="motoko" agent="fence_na" quiet="true"
ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
</fencedevice>
</fencedevices>
</cluster>
</source>

If the cluster manager determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name, which is motoko in this case. It then looks in the <fencedevices> section for the device with the matching name. From there, it gets the information needed to find and access the fence device. Once it connects to the fence device, it then passes the options set in an-node02.alteeve.com's <fence> argument.

So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments:
* ipaddr=motoko.alteeve.com
* login=motoko
* passwd=secret
* quiet=true
* port=2
* action=off

How the fence agent acts on these arguments varies depending on the fence device itself. In general terms, the 'fence_na' fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on, which could be a power jack, a power or reset button, a network switch port and so on. Finally, it tells the device what action to take.

Once the device completes, it returns a success or failed message. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.

If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.

== Fence Devices ==

Many major [[OEM]]s have their own remote management devices that can serve as fence devices. Examples are [http://dell.ca Dell]'s 'DRAC' (Dell Remote Access Controller), [http://hp.ca HP]'s iLO (Integrate Lights Out), [http://ibm.ca IBM]'s 'RSA' (Remote Supervisor Adapter), [http://sun.ca Sun]'s 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via [[IPMI]], Intelligent Power Management Interface.

In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.

Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.

=== IPMI ===

[[IPMI]] is perhaps the most common method used for fencing nodes in a cluster. It requires having a system board with an IPMI baseboard management controller (BMC). IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

To start, we need to install the IPMI user software:

<source lang="bash">
</source>

=== Node Assassin ===

A cheap alternative is the [[Node Assassin]], an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.

'''Full Disclosure''': Node Assassin was created by me, with much help from others, for this paper.

= Clean Up =

Some daemons may have been installed or dragged in during the setup of the cluster. Most notable is heartbeat which is not compatible with [[RHCS]]. The following command will disable various daemons from starting at boot.

As for the rest of the daemons below, it is safe to run even when the daemons aren't installed or have already been removed. Of course, skip the ones you actually want to use. This assumes that the nodes have been built according to this HowTo.

<source lang="bash">
chkconfig heartbeat off; chkconfig iscsid off; chkconfig iptables off; chkconfig ip6tables off
</source>

= So You Have a Cluster - Now What? =

Building the cluster infrastructure was pretty easy, wasn't it?

But what will you '''''do''''' with it?

{|style="width: 75%; text-align: center; padding: 10px; border-spacing: 15px;"
|-
!style="width: 100%; border: 1px solid #dfdfdf;" colspan="3"|Choose Your Own Adventure Cluster
|-
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|style="width: 34%; border: 1px solid #dfdfdf;"| [[Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM|Xen-Based Virtual Machine Host on DRBD+CLVM]] High Availability VM Host
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|}

= Thanks =

* A '''huge''' thanks to [http://iplink.net Interlink Connectivity]! They hire me as a contractor and have allowed me to extend these docs while working on their clusters. Development of these How-Tos would be much slower if not for them. If you need hosting or colo services, drop them a line. They're website is a bit out of date though, so disregard the prices. :)
* To '''sdake''' of [http://corosync.org corosync] for helping me sort out the '''plock''' component and corosync in general.
* To '''Angus Salkeld''' for helping me nail down the Corosync and OpenAIS differences.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013922.html HJ Lee] from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013925.html Steven Dake] for clarifying the to_x vs. logoutput: x arguments in openais.conf.
* To [http://dk.linkedin.com/in/fabbione Fabio Massimo Di Nitto] for helping me get caught up with clustering and VMs.

{{footer}}

Abandoned - Two Node Fedora 13 Cluster

2010-09-09T22:42:36Z

Facade: /* Setup cman */

{{howto_header}}

'''''Notice''''', '''Sep. 08, 2010''' - ''Draft 1'': This is the first draft of this HowTo. It has not been peer-reviewed yet, so please proceed with caution and on a test cluster. Any and all feedback is much appreciated.

= Overview =

This paper has one goal;

# How to assemble the simplest cluster possible, a '''2-Node Cluster''', which you can then expand on for your own needs.

With this completed, you can then jump into "step 2" papers that will show various uses of a two node cluster:

# How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time.
# How to create simple resources that can move between nodes. Examples will be a simple PostgreSQL database, DHCP, DNS and web servers.

== Prerequisites ==

It is expected that you are already comfortable with the Linux command line, specifically [[bash]], and that you are familiar with general administrative tasks in Red Hat based distributions, specifically [[Fedora]]. You will also need to be comfortable using editors like [[vim]], [[nano]] or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, [[multicast]], broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

== Platform ==

This paper will implement the [[Red Hat]] Cluster Suite using the Fedora v13 distribution. This paper uses the [[x86_64]] repositories, however, if you are on an [[i386]] (32 bit) system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names.

You can either download the stock Fedora 13 DVD ISO (currently at version 5.4 which is used in this document), or you can try out the alpha [[#AN!Cluster Install|AN!Cluster Install]] DVD. (4.3GB iso). If you use the latter, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant Fedora distro, please [[Digimer|contact me]] and let me know what your trouble was.

== Why Fedora 13? ==

Generally speaking, I much prefer to use a server-oriented Linux distribution like [[CentOS]], [[Debian]] or similar. However, there have been many recent changes in the Linux-Clustering world that have made all of the currently available server-class distributions obsolete. With luck, once [[Red Hat]] Enterprise Linux and [[CentOS]] version 6 is released, this will change.

Until then, [[Fedora]] version 13 provides the most up to date binary releases of the new implementation of the clustering stack available. For this reason, FC13 is the best choice in clustering, if you want to be current. To mitigate some of the issues introduced by using a workstation distribution, many packages will be stripped out of the default install.

== Focus ==

Clusters can serve to solve three problems; '''Reliability''', '''Performance''' and '''Scalability'''.

This focus of the cluster described in this paper is primarily '''reliability'''. Second to this, '''scalability''' will be the priority leaving '''performance''' to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper.

== Goal ==

At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" resources. That is, resources that exist on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources.

This paper should also serve to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper.

= Begin =

Let's begin!

== Hardware ==

We will need two physical servers each with the following hardware:
* One or more multi-core [[CPU]]s with Virtualization support.
* Three network cards; At least one should be gigabit or faster.
* One or more hard drives.
* You will need some form of a [[fence|fence device]]. This can be an [[IPMI]]-enabled server, a [http://nodeassassin.org Node Assassin], a fenceable [http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900&tab=features PDU] or similar.

This paper uses the following hardware:
* ASUS [http://support.asus.com/search/search.aspx?keyword=m4a78l-m&SLanguage=en-us M4A78L-M]
* AMD Athlon II x2 250
* 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
* 1x Intel 82540 PCI NICs
* 1x D-Link DGE-560T
* Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would serve the purposes of creating this document. What you purchase shouldn't matter, so long as the minimum requirements are met.

== Pre-Assembly ==

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to [[Setting Up a PXE Server in Fedora|network boot]] your install media. In that case, if the wrong NIC is chosen for eth0, you will be presented with a list of MAC addresses to attempt setup with.

Before you assemble your servers, record their network cards' [[MAC]] addresses. I like to keep simple text files like these:

<source lang="bash">
cat an-node01.mac
</source>
<source lang="text">
90:E6:BA:71:82:EA eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53 eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

<source lang="bash">
cat an-node02.mac
</source>
<source lang="text">
90:E6:BA:71:82:D8 eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

Feel free to record the information in any way that suits you the best.

== OS Install ==

Start with a stock Fedora 13 install. This How-To uses Fedora 13 x86_64, however it should be fairly easy to adapt to other recent Fedora versions. This document is also attempting to be easily ported to [[CentOS]]/[[RHEL]] version 6 once it is released. This will not adapt well to CentOS/RHEL version 5 though... Much of the cluster stack has changed dramatically since it was initially released.

These are sample kickstart script used by this paper. Be sure to set your own password string and network settings.

'''''Warning'''''! These kickstart scripts '''''will erase your hard drive'''''! Adapt them, don't blindly use them.

Generic cluster node kickstart scripts.
* [[Fedora13 KS an-node01.ks|an-node01.ks]]
* [[Fedora13 KS an-node02.ks|an-node02.ks]]

== AN!Cluster Install ==

If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.

* Download the custom '''AN!Cluster v0.2.001''' Install DVD. (4.5[[GiB]] iso). (Currently disabled - Reworking for F13)

== Post OS Install ==

Once the OS is installed, we need to do a couple things;

# Disable selinux
# Setup networking.
# Change the default run-level.

=== Disable 'selinux' ===

To allow this paper to focus on clustering, we will disable selinux. Obviously, this introduces security issues that you may not be comfortable with. If you wish to leave selinux enabled, it will be up to you to sort out issues that crop up.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect.

=== Change the Default Run-Level ===

'''This is an optional step'''. It improves performance only, it is not a required step.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reduces the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

<source lang="bash">
vim /etc/inittab
</source>
<source lang="text">
id:3:initdefault:
</source>
<source lang="bash">
init 3
</source>

=== Setup Networking ===

We need to remove NetworkManager, enable network, configure the ifcfg-eth* files and then start the network daemon.

==== Network Layout ====

This setup expects you to have three physical network cards connected to three independent networks. To have a common vernacular, lets use this table to describe them:

{|
!class="cell_all"|Network Description
!class="cell_tbr"|Short Name
!class="cell_tbr"|Device Name
!class="cell_tbr"|Suggested Subnet
!class="cell_tbr"|NIC Properties
|-
|class="cell_blr"|Internet-Facing Network
|class="cell_br"|IFN
|class="cell_br"|eth0
|class="cell_br"|192.168.1.0/24
|class="cell_br"|Remaining NIC should be used here. If using a [[Setting Up a PXE Server in Fedora|PXE server]], this should be a bootable NIC.
|-
|class="cell_blr"|Storage Network
|class="cell_br"|SN
|class="cell_br"|eth1
|class="cell_br"|10.0.0.0/24
|class="cell_br"|Fastest NIC should be used here.
|-
|class="cell_blr"|Back-Channel Network
|class="cell_br"|BCN
|class="cell_br"|eth2
|class="cell_br"|10.0.1.0/24
|class="cell_br"|NICs with [[IPMI]] piggy-back '''must''' be used here. Second-fastest NIC should be used here.
|}

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:
# If your nodes have [[IPMI]] piggy-backing on a normal NIC, that NIC '''''must''''' be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a ''major'' security risk.
# The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
# If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
# The final NIC should be used for the IFN subnet.

==== Node IP Addresses ====

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

{|
|class="cell_br"| 
|class="cell_tbr"|'''Internet-Facing Network''' (IFN)
|class="cell_tbr"|'''Storage Network''' (SN)
|class="cell_tbr"|'''Back-Channel Network''' (BCN)
|-
|class="cell_blr"|'''an-node01'''
|class="cell_br"|192.168.1.71
|class="cell_br"|10.0.0.71
|class="cell_br"|10.0.1.71
|-
|class="cell_blr"|'''an-node02'''
|class="cell_br"|192.168.1.72
|class="cell_br"|10.0.0.72
|class="cell_br"|10.0.1.72
|}

==== Remove NetworkManager ====

Some cluster software '''will not''' start with NetworkManager installed. This is because it is designed to be a highly-adaptive network system that can accommodate frequent changes in the network. To simplify these network transitions for end-users, a lot of reconfiguration of the network is done behind the scenes.

For workstations, this is wonderful. For clustering, this can be disastrous. Transient network outages are already a risk to a cluster's stability!

So first up, make sure that NetworkManager is completely removed from your system. If you used the [[#OS Install|kickstart]] scripts, then it was not installed. Otherwise, run:

<source lang="bash">
yum remove NetworkManager NetworkManager-gnome NetworkManager-openconnect NetworkManager-openvpn NetworkManager-pptp NetworkManager-vpnc cnetworkmanager knetworkmanager knetworkmanager-openvpn knetworkmanager-pptp knetworkmanager-vpnc libproxy-networkmanager yum-NetworkManager-dispatcher
</source>

==== Setup 'network' ====

Before proceeding with network configuration, check to see if your network cards are aligned to the proper ethX network names. If they need to be adjusted, please follow this How-To before proceeding:

* [[Changing the ethX to Ethernet Device Mapping in Fedora]]

There are a few ways to configure network in Fedora:
* system-config-network (graphical)
* system-config-network-tui (ncurses)
* Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files. (See: [http://docs.fedoraproject.org/en-US/Fedora/12/html/Deployment_Guide/s1-networkscripts-interfaces.html here] for a full list of options)

Do not proceed until your node's networking is fully configured.

==== Update the Hosts File ====

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

'''Note''': Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.

The updated /etc/hosts file should look like this:

<source lang="bash">
vim /etc/hosts
</source>
<source lang="text">
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# Back-channel IPs to name mapping.
10.0.1.71 an-node01 an-node01.alteeve.com
10.0.1.72 an-node02 an-node02.alteeve.com

# Storage network
10.0.0.71 an-node01.sn
10.0.0.72 an-node02.sn

# Internet facing
192.168.1.71 an-node01.inet
192.168.1.72 an-node02.inet
</source>

To test this, ping both nodes by name and make sure the ping packets are sent on the 10.0.1.0/24 subnet:

<source lang="bash">
ping -c 5 an-node01
</source>
<source lang="text">
PING an-node01 (10.0.1.71) 56(84) bytes of data.
64 bytes from an-node01 (10.0.1.71): icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=2 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=3 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=4 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=5 ttl=64 time=0.055 ms
</source>
<source lang="bash">
ping -c 5 an-node02
</source>
<source lang="text">
PING an-node02 (10.0.1.72) 56(84) bytes of data.
64 bytes from an-node02 (10.0.1.72): icmp_seq=1 ttl=64 time=0.221 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=2 ttl=64 time=0.188 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=3 ttl=64 time=0.217 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=4 ttl=64 time=0.192 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=5 ttl=64 time=0.163 ms
</source>

==== Disable Firewalls ====

Be sure to flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

There will be enough potential sources of problem as it is. Disabling firewalls at this stage will minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.

<source lang="bash">
chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop
</source>

==== Setup SSH Shared Keys ====

'''This is an optional step'''. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. This is not meant to be a security-focused How-To, so please independently study the risks.

If you're a little new to [[SSH]], it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the ''other'' node's root user's directory.

For each user, on each machine you want to connect '''from''', run:

<source lang="bash">
ssh-keygen -t dsa -N "" -f ~/.ssh/id_dsa
</source>
<source lang="text">
Generating public/private dsa key pair.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
c4:cc:10:0a:6b:60:24:e7:57:94:69:e5:10:b2:cb:1f root@an-node01.alteeve.com
The key's randomart image is:
+--[ DSA 1024]----+
|+oo ..B*. |
|o+ o =+B |
| + +. * |
| . o . . |
| o E S |
| . . |
| . |
| |
| |
+-----------------+
</source>

This will create two files: the private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private '''''must never''''' be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

'''Private key''':
<source lang="bash">
cat ~/.ssh/id_dsa
</source>
<source lang="text">
-----BEGIN DSA PRIVATE KEY-----
MIIBugIBAAKBgQCVo5nzeC/W8HWaFkD4pAmWO7ovf7S01f8HgocKKOYX/nMNxRui
wExBUUUt1b2TxYYQklTd9dn8UAzZ3UohN0CJNDrp//t2wfyACKKc2q+ewao5/svQ
xXTBu2ZPebKPcr1w9p0hq0xSr77Rg3v6Auc6AnC79DM9XkhYkVgNfgrvMwIVAKdY
SgTycvqNEWsJrCTTn5485yWvAoGANQ3rzIxFy2zHWyMlrpA9r5jllgxfaJVx3/Iw
DrEF82lr2dOmG8oYiKAhfvdgNyV7VeM2yxdjcdLPJAHHCx5BZZ7V9KjMzY26wS2r
d58TZJoWmO0nscmcwIbCv2at3fiMpxgr+nzD4tKxFOdWxYBwdmUpIjtJoW8wzY2H
uNghjL0CgYA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDN
bbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EE
D3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5AIUFX97yIig
n2SUltN+10NWUy4oYsA=
-----END DSA PRIVATE KEY-----
</source>

'''Public key''':
<source lang="bash">
cat ~/.ssh/id_dsa.pub
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From '''an-node01''', type:
<source lang="bash">
ssh root@an-node02
</source>
<source lang="text">
The authenticity of host 'an-node02 (10.0.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,10.0.1.72' (RSA) to the list of known hosts.
root@an-node02's password:
Last login: Tue Jul 6 22:28:19 2010 from 192.168.1.102
</source>

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

<source lang="bash">
cat ~/.ssh/authorized_keys
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

= Initial Cluster Setup =

Before we get into specifics, let's take a minute to talk about the major components used in our cluster.

== Core Program Overviews ==

These are the core programs that may be new to you that we will use to build our cluster. Before we configure them, let's take a minute to understand their roles.

=== Cluster Manager ===

The cluster manager, cman, takes the configuration file [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] configuration file and uses it to configure and start the various cluster elements. This is where nodes are defined, fence devices are configured and various cluster tolerances are set.

=== Corosync ===

[http://www.corosync.org Corosync] is, essentially, the clustering kernel which provides the essential services for cluster-aware applications and other cluster services. At it's core is the [[totem]] protocol which provides cluster membership and ordered messaging. On top of this, it implements a number of core clustering services like [[#cpg; Closed Process Group|closed process group messaging]], [[quorum]] and [[confdb]] (an object database).

It's goal is to provide a substantially simpler and more flexible set of APIs to facilitate clustering in Linux. It manages which nodes are in the cluster, it triggers error messages when something fails, manages cluster locking and so on. Most other clustered applications rely on corosync to know when something has happened or to announce when a cluster-related action has taken place.

Please note that the corosync_overview man page is considered out of date at the time of this writing.

==== cpg; Closed Process Group ====

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install corosynclib-devel
man cpg_overview
</source>

The closed process group is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages amongst nodes in a consistent order. It adds [[PID]]s and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group.

=== OpenAIS ===

OpenAIS is now an extension to Corosync that adds an open-source implementation of the [http://www.saforum.org/ Service Availability (SA) Forum's] 'Application Interface Specification' ([[AIS]]). It is an [[API]] and policy designed to be used by applications concerned with maintaining services during faults. AIS implements the 'Availability Management Framework' ([[AMF]]) which, in turn, provides for application fail over, cluster management ([[CLM]]), Checkpointing ([[CKPT]]), Eventing ([[EVT]]), Messaging ([[MSG]]), and Distributed Locking ([[DLOCK]]).

It does this by implementing pluggable shared libraries that get loaded into corosync at runtime. These libraries then can make use of corosync's internal API. Both corosync and openais services provide [[IPC]] interfaces so that application programmers can make use of their behaviour.

In short; applications can use OpenAIS to be cluster-aware. It's libraries are used by some applications, including Pacemaker. In our application, we will only be using it's libraries indirectly.

Please note that the openais_overview man page is considered out of date at the time of this writing.

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install openais
</source>

=== Pacemaker ===

[http://www.clusterlabs.org Pacemaker] is the cluster resource manager. It can be used to trigger scripts to control non cluster-aware applications. In this way, these non cluster-aware applications can be made more highly available. It is considered a replacement for cman's [[rgmanager]].

== Setup Core Programs ==

We now need to edit and create the configuration files for our cluster.

=== Setup cman ===

You may only need to configure this section.

First, install the cluster manager:

<source lang="bash">
yum install cman
</source>

Once installed, we need to create and fill in the [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] file. This is the core configuration file of our cluster and is an [[XML]] formatted file.

Please see: [[Two-Node Fedora 13 cluster.conf]]

<source lang="bash">
vim /etc/cluster/cluster.conf
</source>

The example in that article details many of the common options you will want to use. For a more complete list, please run man 5 cluster.conf for a more complete discussion on this file. Also, please be aware that some arguments can overlap between [[#Setup Corosync|corosync]] and this file.

Once it's setup, we need to verify it. Once you've finished, be sure to run this:

<source lang="bash">
xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf
</source>

If there are errors, fix them. Once it is formatted properly, the contents of you cluster.conf file will be returned followed by "/etc/cluster/cluster.conf validates". Once this is the case, you can move on.

A note: normally at this stage, you'd use chkconfig to enable cman to start at boot. However, depending on how you use your cluster, you may need to modify the boot order. For now, let's leave it off until we're got all the components enabled. This is particularly important if you will be using [[DRBD]], [[CLVM]] or [[Xen]].

<source lang="bash">
chkconfig cman off
</source>

To see if it's working though, have a terminal open to both nodes and manually start cman. I suggest have two terminal windows open on each node. In one, start tailing /var/log/messages:

<source lang="bash">
clear; tail -f -n 0 /var/log/messages
</source>

Now, with both log files being watched, start cman on both nodes.

'''Note''': You MUST execute the following on both nodes withing the time limit set by post_join_delay (default is 6 seconds). If you wait longer than this, the first node started will [[fence]] the other node, assuming you have fencing configured properly. If fencing is '''not''' configured properly, your first node will hang.

<source lang="bash">
/etc/init.d/cman start
</source>

Examine the log files to verify there were no errors. If there were, fix them. If there was not, fantastic!

'''Warning''': This last step will forcefully reboot your nodes. Do not do this on a production cluster! If it must be done on a production server, stop all non-essential services and be sure you have a good backup, ''just in case''.

Lastly, test that your fencing is configured and working properly. We do this by using a program called fence_node. With both nodes up and running cman, call a fence against the other node. Be sure to be watching the log files.

'''From''' an-node01.alteeve.com

<source lang="bash">
fence_node an-node02.alteeve.com
</source>
<source lang="bash">
fence an-node02.alteeve.com success
</source>

If it worked, when the node comes back up, repeat the process from an-node02.alteeve.com against an-node01.alteeve.com.

=== Setup Corosync ===

'''Note''': In our cluster, we will not need to configure Corosync.

When using cman, as we are here, Corosync's configuration file is ignored.

Most of the settings available in [[Two-Node Fedora 13 corosync.conf|corosync.conf]] can be found in the [[Two-Node Fedora 13 cluster.conf|cluster.conf]] file. Please see the cluster.conf article for a comprehensive list of what is and is not supported.

If you want to configure corosync and not use cman, please see the following article:

* [[Two-Node Fedora 13 corosync.conf]]

=== Setup Pacemaker ===

ToDO.

= Fencing =

Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.

* The Cluster Admin's Mantra:
** '''The only thing you don't know is what you don't know'''.

Just because one node loses communication with another node, it '''cannot''' be assume that the silent node is dead!

== What is it? ==

"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a '''split-brain''' condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.

Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:

* Power
** Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
* Blocking
** Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.

With power fencing, the term used is "STONITH", literally, '''S'''hoot '''T'''he '''O'''ther '''N'''ode '''I'''n '''T'''he '''H'''ead. Picture it like an old west dual. If one node is dead, the other node is going to win the dual by default and the dead node will just be shot again. When both nodes are alive, however, the faster node will win and will "kill" (power off or reset) the slower node before it has a chance to fire. Once this dual is over, the surviving node can then access the shared resource confident that it is the only one working on it.

== Misconception ==

It is a '''very''' common mistake to ignore fencing when first starting to learn about clustering. Often people think ''"It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster."''.

'''''Wrong!'''''

For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, [[cman]] and related daemons will fail if they can't find a fence agent to use.

Secondly; Testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.

== Implementation ==

How you implement fencing depends largely on what kind of fence device(s) you have. This is where I need help from you, dear reader. I only have access to [[Node Assassin]] fence devices on my lab's cluster. If you have other fence devices, like [[IPMI]], [[DRAC]] or so on, please let me know so that I can expand this section.

=== Implementing Node Assassin ===

In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced daemon. We'll cover the details of the [[cluster.conf]] file in a moment.

When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:
<source lang="xml">
<cluster name="an-cluster" config_version="1">
<clusternodes>
<clusternode name="an-node02.alteeve.com" nodeid="2">
<fence>
<method name="node_assassin">
<device name="motoko" port="02" action="off"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="motoko" agent="fence_na" quiet="true"
ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
</fencedevice>
</fencedevices>
</cluster>
</source>

If the cluster manager determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name, which is motoko in this case. It then looks in the <fencedevices> section for the device with the matching name. From there, it gets the information needed to find and access the fence device. Once it connects to the fence device, it then passes the options set in an-node02.alteeve.com's <fence> argument.

So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments:
* ipaddr=motoko.alteeve.com
* login=motoko
* passwd=secret
* quiet=true
* port=2
* action=off

How the fence agent acts on these arguments varies depending on the fence device itself. In general terms, the 'fence_na' fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on, which could be a power jack, a power or reset button, a network switch port and so on. Finally, it tells the device what action to take.

Once the device completes, it returns a success or failed message. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.

If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.

== Fence Devices ==

Many major [[OEM]]s have their own remote management devices that can serve as fence devices. Examples are [http://dell.ca Dell]'s 'DRAC' (Dell Remote Access Controller), [http://hp.ca HP]'s iLO (Integrate Lights Out), [http://ibm.ca IBM]'s 'RSA' (Remote Supervisor Adapter), [http://sun.ca Sun]'s 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via [[IPMI]], Intelligent Power Management Interface.

In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.

Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.

=== IPMI ===

[[IPMI]] is perhaps the most common method used for fencing nodes in a cluster. It requires having a system board with an IPMI baseboard management controller (BMC). IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

To start, we need to install the IPMI user software:

<source lang="bash">
</source>

=== Node Assassin ===

A cheap alternative is the [[Node Assassin]], an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.

'''Full Disclosure''': Node Assassin was created by me, with much help from others, for this paper.

= Clean Up =

Some daemons may have been installed or dragged in during the setup of the cluster. Most notable is heartbeat which is not compatible with [[RHCS]]. The following command will disable various daemons from starting at boot.

As for the rest of the daemons below, it is safe to run even when the daemons aren't installed or have already been removed. Of course, skip the ones you actually want to use. This assumes that the nodes have been built according to this HowTo.

<source lang="bash">
chkconfig heartbeat off; chkconfig iscsid off; chkconfig iptables off; chkconfig ip6tables off
</source>

= So You Have a Cluster - Now What? =

Building the cluster infrastructure was pretty easy, wasn't it?

But what will you '''''do''''' with it?

{|style="width: 75%; text-align: center; padding: 10px; border-spacing: 15px;"
|-
!style="width: 100%; border: 1px solid #dfdfdf;" colspan="3"|Choose Your Own Adventure Cluster
|-
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|style="width: 34%; border: 1px solid #dfdfdf;"| [[Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM|Xen-Based Virtual Machine Host on DRBD+CLVM]] High Availability VM Host
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|}

= Thanks =

* A '''huge''' thanks to [http://iplink.net Interlink Connectivity]! They hire me as a contractor and have allowed me to extend these docs while working on their clusters. Development of these How-Tos would be much slower if not for them. If you need hosting or colo services, drop them a line. They're website is a bit out of date though, so disregard the prices. :)
* To '''sdake''' of [http://corosync.org corosync] for helping me sort out the '''plock''' component and corosync in general.
* To '''Angus Salkeld''' for helping me nail down the Corosync and OpenAIS differences.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013922.html HJ Lee] from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013925.html Steven Dake] for clarifying the to_x vs. logoutput: x arguments in openais.conf.
* To [http://dk.linkedin.com/in/fabbione Fabio Massimo Di Nitto] for helping me get caught up with clustering and VMs.

{{footer}}

Abandoned - Two Node Fedora 13 Cluster

2010-09-09T22:38:32Z

Facade: /* Core Program Overviews */

{{howto_header}}

'''''Notice''''', '''Sep. 08, 2010''' - ''Draft 1'': This is the first draft of this HowTo. It has not been peer-reviewed yet, so please proceed with caution and on a test cluster. Any and all feedback is much appreciated.

= Overview =

This paper has one goal;

# How to assemble the simplest cluster possible, a '''2-Node Cluster''', which you can then expand on for your own needs.

With this completed, you can then jump into "step 2" papers that will show various uses of a two node cluster:

# How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time.
# How to create simple resources that can move between nodes. Examples will be a simple PostgreSQL database, DHCP, DNS and web servers.

== Prerequisites ==

It is expected that you are already comfortable with the Linux command line, specifically [[bash]], and that you are familiar with general administrative tasks in Red Hat based distributions, specifically [[Fedora]]. You will also need to be comfortable using editors like [[vim]], [[nano]] or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, [[multicast]], broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

== Platform ==

This paper will implement the [[Red Hat]] Cluster Suite using the Fedora v13 distribution. This paper uses the [[x86_64]] repositories, however, if you are on an [[i386]] (32 bit) system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names.

You can either download the stock Fedora 13 DVD ISO (currently at version 5.4 which is used in this document), or you can try out the alpha [[#AN!Cluster Install|AN!Cluster Install]] DVD. (4.3GB iso). If you use the latter, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant Fedora distro, please [[Digimer|contact me]] and let me know what your trouble was.

== Why Fedora 13? ==

Generally speaking, I much prefer to use a server-oriented Linux distribution like [[CentOS]], [[Debian]] or similar. However, there have been many recent changes in the Linux-Clustering world that have made all of the currently available server-class distributions obsolete. With luck, once [[Red Hat]] Enterprise Linux and [[CentOS]] version 6 is released, this will change.

Until then, [[Fedora]] version 13 provides the most up to date binary releases of the new implementation of the clustering stack available. For this reason, FC13 is the best choice in clustering, if you want to be current. To mitigate some of the issues introduced by using a workstation distribution, many packages will be stripped out of the default install.

== Focus ==

Clusters can serve to solve three problems; '''Reliability''', '''Performance''' and '''Scalability'''.

This focus of the cluster described in this paper is primarily '''reliability'''. Second to this, '''scalability''' will be the priority leaving '''performance''' to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper.

== Goal ==

At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" resources. That is, resources that exist on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources.

This paper should also serve to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper.

= Begin =

Let's begin!

== Hardware ==

We will need two physical servers each with the following hardware:
* One or more multi-core [[CPU]]s with Virtualization support.
* Three network cards; At least one should be gigabit or faster.
* One or more hard drives.
* You will need some form of a [[fence|fence device]]. This can be an [[IPMI]]-enabled server, a [http://nodeassassin.org Node Assassin], a fenceable [http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900&tab=features PDU] or similar.

This paper uses the following hardware:
* ASUS [http://support.asus.com/search/search.aspx?keyword=m4a78l-m&SLanguage=en-us M4A78L-M]
* AMD Athlon II x2 250
* 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
* 1x Intel 82540 PCI NICs
* 1x D-Link DGE-560T
* Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would serve the purposes of creating this document. What you purchase shouldn't matter, so long as the minimum requirements are met.

== Pre-Assembly ==

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to [[Setting Up a PXE Server in Fedora|network boot]] your install media. In that case, if the wrong NIC is chosen for eth0, you will be presented with a list of MAC addresses to attempt setup with.

Before you assemble your servers, record their network cards' [[MAC]] addresses. I like to keep simple text files like these:

<source lang="bash">
cat an-node01.mac
</source>
<source lang="text">
90:E6:BA:71:82:EA eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53 eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

<source lang="bash">
cat an-node02.mac
</source>
<source lang="text">
90:E6:BA:71:82:D8 eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

Feel free to record the information in any way that suits you the best.

== OS Install ==

Start with a stock Fedora 13 install. This How-To uses Fedora 13 x86_64, however it should be fairly easy to adapt to other recent Fedora versions. This document is also attempting to be easily ported to [[CentOS]]/[[RHEL]] version 6 once it is released. This will not adapt well to CentOS/RHEL version 5 though... Much of the cluster stack has changed dramatically since it was initially released.

These are sample kickstart script used by this paper. Be sure to set your own password string and network settings.

'''''Warning'''''! These kickstart scripts '''''will erase your hard drive'''''! Adapt them, don't blindly use them.

Generic cluster node kickstart scripts.
* [[Fedora13 KS an-node01.ks|an-node01.ks]]
* [[Fedora13 KS an-node02.ks|an-node02.ks]]

== AN!Cluster Install ==

If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.

* Download the custom '''AN!Cluster v0.2.001''' Install DVD. (4.5[[GiB]] iso). (Currently disabled - Reworking for F13)

== Post OS Install ==

Once the OS is installed, we need to do a couple things;

# Disable selinux
# Setup networking.
# Change the default run-level.

=== Disable 'selinux' ===

To allow this paper to focus on clustering, we will disable selinux. Obviously, this introduces security issues that you may not be comfortable with. If you wish to leave selinux enabled, it will be up to you to sort out issues that crop up.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect.

=== Change the Default Run-Level ===

'''This is an optional step'''. It improves performance only, it is not a required step.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reduces the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

<source lang="bash">
vim /etc/inittab
</source>
<source lang="text">
id:3:initdefault:
</source>
<source lang="bash">
init 3
</source>

=== Setup Networking ===

We need to remove NetworkManager, enable network, configure the ifcfg-eth* files and then start the network daemon.

==== Network Layout ====

This setup expects you to have three physical network cards connected to three independent networks. To have a common vernacular, lets use this table to describe them:

{|
!class="cell_all"|Network Description
!class="cell_tbr"|Short Name
!class="cell_tbr"|Device Name
!class="cell_tbr"|Suggested Subnet
!class="cell_tbr"|NIC Properties
|-
|class="cell_blr"|Internet-Facing Network
|class="cell_br"|IFN
|class="cell_br"|eth0
|class="cell_br"|192.168.1.0/24
|class="cell_br"|Remaining NIC should be used here. If using a [[Setting Up a PXE Server in Fedora|PXE server]], this should be a bootable NIC.
|-
|class="cell_blr"|Storage Network
|class="cell_br"|SN
|class="cell_br"|eth1
|class="cell_br"|10.0.0.0/24
|class="cell_br"|Fastest NIC should be used here.
|-
|class="cell_blr"|Back-Channel Network
|class="cell_br"|BCN
|class="cell_br"|eth2
|class="cell_br"|10.0.1.0/24
|class="cell_br"|NICs with [[IPMI]] piggy-back '''must''' be used here. Second-fastest NIC should be used here.
|}

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:
# If your nodes have [[IPMI]] piggy-backing on a normal NIC, that NIC '''''must''''' be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a ''major'' security risk.
# The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
# If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
# The final NIC should be used for the IFN subnet.

==== Node IP Addresses ====

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

{|
|class="cell_br"| 
|class="cell_tbr"|'''Internet-Facing Network''' (IFN)
|class="cell_tbr"|'''Storage Network''' (SN)
|class="cell_tbr"|'''Back-Channel Network''' (BCN)
|-
|class="cell_blr"|'''an-node01'''
|class="cell_br"|192.168.1.71
|class="cell_br"|10.0.0.71
|class="cell_br"|10.0.1.71
|-
|class="cell_blr"|'''an-node02'''
|class="cell_br"|192.168.1.72
|class="cell_br"|10.0.0.72
|class="cell_br"|10.0.1.72
|}

==== Remove NetworkManager ====

Some cluster software '''will not''' start with NetworkManager installed. This is because it is designed to be a highly-adaptive network system that can accommodate frequent changes in the network. To simplify these network transitions for end-users, a lot of reconfiguration of the network is done behind the scenes.

For workstations, this is wonderful. For clustering, this can be disastrous. Transient network outages are already a risk to a cluster's stability!

So first up, make sure that NetworkManager is completely removed from your system. If you used the [[#OS Install|kickstart]] scripts, then it was not installed. Otherwise, run:

<source lang="bash">
yum remove NetworkManager NetworkManager-gnome NetworkManager-openconnect NetworkManager-openvpn NetworkManager-pptp NetworkManager-vpnc cnetworkmanager knetworkmanager knetworkmanager-openvpn knetworkmanager-pptp knetworkmanager-vpnc libproxy-networkmanager yum-NetworkManager-dispatcher
</source>

==== Setup 'network' ====

Before proceeding with network configuration, check to see if your network cards are aligned to the proper ethX network names. If they need to be adjusted, please follow this How-To before proceeding:

* [[Changing the ethX to Ethernet Device Mapping in Fedora]]

There are a few ways to configure network in Fedora:
* system-config-network (graphical)
* system-config-network-tui (ncurses)
* Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files. (See: [http://docs.fedoraproject.org/en-US/Fedora/12/html/Deployment_Guide/s1-networkscripts-interfaces.html here] for a full list of options)

Do not proceed until your node's networking is fully configured.

==== Update the Hosts File ====

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

'''Note''': Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.

The updated /etc/hosts file should look like this:

<source lang="bash">
vim /etc/hosts
</source>
<source lang="text">
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# Back-channel IPs to name mapping.
10.0.1.71 an-node01 an-node01.alteeve.com
10.0.1.72 an-node02 an-node02.alteeve.com

# Storage network
10.0.0.71 an-node01.sn
10.0.0.72 an-node02.sn

# Internet facing
192.168.1.71 an-node01.inet
192.168.1.72 an-node02.inet
</source>

To test this, ping both nodes by name and make sure the ping packets are sent on the 10.0.1.0/24 subnet:

<source lang="bash">
ping -c 5 an-node01
</source>
<source lang="text">
PING an-node01 (10.0.1.71) 56(84) bytes of data.
64 bytes from an-node01 (10.0.1.71): icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=2 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=3 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=4 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=5 ttl=64 time=0.055 ms
</source>
<source lang="bash">
ping -c 5 an-node02
</source>
<source lang="text">
PING an-node02 (10.0.1.72) 56(84) bytes of data.
64 bytes from an-node02 (10.0.1.72): icmp_seq=1 ttl=64 time=0.221 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=2 ttl=64 time=0.188 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=3 ttl=64 time=0.217 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=4 ttl=64 time=0.192 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=5 ttl=64 time=0.163 ms
</source>

==== Disable Firewalls ====

Be sure to flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

There will be enough potential sources of problem as it is. Disabling firewalls at this stage will minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.

<source lang="bash">
chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop
</source>

==== Setup SSH Shared Keys ====

'''This is an optional step'''. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. This is not meant to be a security-focused How-To, so please independently study the risks.

If you're a little new to [[SSH]], it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the ''other'' node's root user's directory.

For each user, on each machine you want to connect '''from''', run:

<source lang="bash">
ssh-keygen -t dsa -N "" -f ~/.ssh/id_dsa
</source>
<source lang="text">
Generating public/private dsa key pair.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
c4:cc:10:0a:6b:60:24:e7:57:94:69:e5:10:b2:cb:1f root@an-node01.alteeve.com
The key's randomart image is:
+--[ DSA 1024]----+
|+oo ..B*. |
|o+ o =+B |
| + +. * |
| . o . . |
| o E S |
| . . |
| . |
| |
| |
+-----------------+
</source>

This will create two files: the private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private '''''must never''''' be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

'''Private key''':
<source lang="bash">
cat ~/.ssh/id_dsa
</source>
<source lang="text">
-----BEGIN DSA PRIVATE KEY-----
MIIBugIBAAKBgQCVo5nzeC/W8HWaFkD4pAmWO7ovf7S01f8HgocKKOYX/nMNxRui
wExBUUUt1b2TxYYQklTd9dn8UAzZ3UohN0CJNDrp//t2wfyACKKc2q+ewao5/svQ
xXTBu2ZPebKPcr1w9p0hq0xSr77Rg3v6Auc6AnC79DM9XkhYkVgNfgrvMwIVAKdY
SgTycvqNEWsJrCTTn5485yWvAoGANQ3rzIxFy2zHWyMlrpA9r5jllgxfaJVx3/Iw
DrEF82lr2dOmG8oYiKAhfvdgNyV7VeM2yxdjcdLPJAHHCx5BZZ7V9KjMzY26wS2r
d58TZJoWmO0nscmcwIbCv2at3fiMpxgr+nzD4tKxFOdWxYBwdmUpIjtJoW8wzY2H
uNghjL0CgYA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDN
bbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EE
D3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5AIUFX97yIig
n2SUltN+10NWUy4oYsA=
-----END DSA PRIVATE KEY-----
</source>

'''Public key''':
<source lang="bash">
cat ~/.ssh/id_dsa.pub
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From '''an-node01''', type:
<source lang="bash">
ssh root@an-node02
</source>
<source lang="text">
The authenticity of host 'an-node02 (10.0.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,10.0.1.72' (RSA) to the list of known hosts.
root@an-node02's password:
Last login: Tue Jul 6 22:28:19 2010 from 192.168.1.102
</source>

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

<source lang="bash">
cat ~/.ssh/authorized_keys
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

= Initial Cluster Setup =

Before we get into specifics, let's take a minute to talk about the major components used in our cluster.

== Core Program Overviews ==

These are the core programs that may be new to you that we will use to build our cluster. Before we configure them, let's take a minute to understand their roles.

=== Cluster Manager ===

The cluster manager, cman, takes the configuration file [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] configuration file and uses it to configure and start the various cluster elements. This is where nodes are defined, fence devices are configured and various cluster tolerances are set.

=== Corosync ===

[http://www.corosync.org Corosync] is, essentially, the clustering kernel which provides the essential services for cluster-aware applications and other cluster services. At it's core is the [[totem]] protocol which provides cluster membership and ordered messaging. On top of this, it implements a number of core clustering services like [[#cpg; Closed Process Group|closed process group messaging]], [[quorum]] and [[confdb]] (an object database).

It's goal is to provide a substantially simpler and more flexible set of APIs to facilitate clustering in Linux. It manages which nodes are in the cluster, it triggers error messages when something fails, manages cluster locking and so on. Most other clustered applications rely on corosync to know when something has happened or to announce when a cluster-related action has taken place.

Please note that the corosync_overview man page is considered out of date at the time of this writing.

==== cpg; Closed Process Group ====

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install corosynclib-devel
man cpg_overview
</source>

The closed process group is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages amongst nodes in a consistent order. It adds [[PID]]s and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group.

=== OpenAIS ===

OpenAIS is now an extension to Corosync that adds an open-source implementation of the [http://www.saforum.org/ Service Availability (SA) Forum's] 'Application Interface Specification' ([[AIS]]). It is an [[API]] and policy designed to be used by applications concerned with maintaining services during faults. AIS implements the 'Availability Management Framework' ([[AMF]]) which, in turn, provides for application fail over, cluster management ([[CLM]]), Checkpointing ([[CKPT]]), Eventing ([[EVT]]), Messaging ([[MSG]]), and Distributed Locking ([[DLOCK]]).

It does this by implementing pluggable shared libraries that get loaded into corosync at runtime. These libraries then can make use of corosync's internal API. Both corosync and openais services provide [[IPC]] interfaces so that application programmers can make use of their behaviour.

In short; applications can use OpenAIS to be cluster-aware. It's libraries are used by some applications, including Pacemaker. In our application, we will only be using it's libraries indirectly.

Please note that the openais_overview man page is considered out of date at the time of this writing.

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install openais
</source>

=== Pacemaker ===

[http://www.clusterlabs.org Pacemaker] is the cluster resource manager. It can be used to trigger scripts to control non cluster-aware applications. In this way, these non cluster-aware applications can be made more highly available. It is considered a replacement for cman's [[rgmanager]].

== Setup Core Programs ==

We now need to edit and create the configuration files for our cluster.

=== Setup cman ===

You may only need to configure this section.

First, install the cluster manager:

<source lang="bash">
yum install cman
</source>

Once installed, we need to create and fill in the [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] file. This is the core configuration file of our cluster and is an [[XML]] formatted file.

Please see: [[Two-Node Fedora 13 cluster.conf]]

<source lang="bash">
vim /etc/cluster/cluster.conf
</source>

The example in that article details many of the common options you will want to use. For a more complete list, please run man 5 cluster.conf for a more complete discussion on this file. Also, please be aware that some arguments can overlap between [[#Setup Corosync|corosync]] and this file.

Once it's setup, we need to verify it. Once you've finished, be sure to run this:

<source lang="bash">
xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf
</source>

If there are errors, fix them. Once it is formatted properly, the contents of you cluster.conf file will be returned followed by "/etc/cluster/cluster.conf validates". Once this is the case, you can move on.

A note; Normally at this stage, you'd use chkconfig to enable cman to start at boot. However, depending on how you use your cluster, you may need to modify the boot order. For now, let's leave it off until we're got all the components enabled. This is particularly important if you will be using [[DRBD]], [[CLVM]] or [[Xen]].

<source lang="bash">
chkconfig cman off
</source>

To see if it's working though, have a terminal open to both nodes and manually start cman. I suggest have two terminal windows open on each node. In one, start tailing /var/log/messages:

<source lang="bash">
clear; tail -f -n 0 /var/log/messages
</source>

Now, with both log files being watched, start cman on both nodes.

'''Note''': You MUST execute the following on both nodes withing the time limit set by post_join_delay (default it 6 seconds). If you wait longer than this, the first node started will [[fence]] the other node, assuming you have fencing configured properly. If fencing is '''not''' configured properly, your first node will hang.

<source lang="bash">
/etc/init.d/cman start
</source>

Examine the log files to verify there were no errors. If there were, fix them. If there was not, fantastic!

'''Warning''': This last step will forcefully reboot your nodes. Do not do this on a production cluster! If it must be done on a production server, stop all non-essential services and be sure you have a good backup, ''just in case''.

Lastly, test that your fencing is configured and working properly. We do this by using a program called fence_node. With both nodes up and running cman, call a fence against the other node. Be sure to be watching the log files.

'''From''' an-node01.alteeve.com

<source lang="bash">
fence_node an-node02.alteeve.com
</source>
<source lang="bash">
fence an-node02.alteeve.com success
</source>

If it worked, when the node comes back up, repeat the process from an-node02.alteeve.com against an-node01.alteeve.com.

=== Setup Corosync ===

'''Note''': In our cluster, we will not need to configure Corosync.

When using cman, as we are here, Corosync's configuration file is ignored.

Most of the settings available in [[Two-Node Fedora 13 corosync.conf|corosync.conf]] can be found in the [[Two-Node Fedora 13 cluster.conf|cluster.conf]] file. Please see the cluster.conf article for a comprehensive list of what is and is not supported.

If you want to configure corosync and not use cman, please see the following article:

* [[Two-Node Fedora 13 corosync.conf]]

=== Setup Pacemaker ===

ToDO.

= Fencing =

Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.

* The Cluster Admin's Mantra:
** '''The only thing you don't know is what you don't know'''.

Just because one node loses communication with another node, it '''cannot''' be assume that the silent node is dead!

== What is it? ==

"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a '''split-brain''' condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.

Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:

* Power
** Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
* Blocking
** Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.

With power fencing, the term used is "STONITH", literally, '''S'''hoot '''T'''he '''O'''ther '''N'''ode '''I'''n '''T'''he '''H'''ead. Picture it like an old west dual. If one node is dead, the other node is going to win the dual by default and the dead node will just be shot again. When both nodes are alive, however, the faster node will win and will "kill" (power off or reset) the slower node before it has a chance to fire. Once this dual is over, the surviving node can then access the shared resource confident that it is the only one working on it.

== Misconception ==

It is a '''very''' common mistake to ignore fencing when first starting to learn about clustering. Often people think ''"It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster."''.

'''''Wrong!'''''

For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, [[cman]] and related daemons will fail if they can't find a fence agent to use.

Secondly; Testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.

== Implementation ==

How you implement fencing depends largely on what kind of fence device(s) you have. This is where I need help from you, dear reader. I only have access to [[Node Assassin]] fence devices on my lab's cluster. If you have other fence devices, like [[IPMI]], [[DRAC]] or so on, please let me know so that I can expand this section.

=== Implementing Node Assassin ===

In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced daemon. We'll cover the details of the [[cluster.conf]] file in a moment.

When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:
<source lang="xml">
<cluster name="an-cluster" config_version="1">
<clusternodes>
<clusternode name="an-node02.alteeve.com" nodeid="2">
<fence>
<method name="node_assassin">
<device name="motoko" port="02" action="off"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="motoko" agent="fence_na" quiet="true"
ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
</fencedevice>
</fencedevices>
</cluster>
</source>

If the cluster manager determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name, which is motoko in this case. It then looks in the <fencedevices> section for the device with the matching name. From there, it gets the information needed to find and access the fence device. Once it connects to the fence device, it then passes the options set in an-node02.alteeve.com's <fence> argument.

So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments:
* ipaddr=motoko.alteeve.com
* login=motoko
* passwd=secret
* quiet=true
* port=2
* action=off

How the fence agent acts on these arguments varies depending on the fence device itself. In general terms, the 'fence_na' fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on, which could be a power jack, a power or reset button, a network switch port and so on. Finally, it tells the device what action to take.

Once the device completes, it returns a success or failed message. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.

If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.

== Fence Devices ==

Many major [[OEM]]s have their own remote management devices that can serve as fence devices. Examples are [http://dell.ca Dell]'s 'DRAC' (Dell Remote Access Controller), [http://hp.ca HP]'s iLO (Integrate Lights Out), [http://ibm.ca IBM]'s 'RSA' (Remote Supervisor Adapter), [http://sun.ca Sun]'s 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via [[IPMI]], Intelligent Power Management Interface.

In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.

Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.

=== IPMI ===

[[IPMI]] is perhaps the most common method used for fencing nodes in a cluster. It requires having a system board with an IPMI baseboard management controller (BMC). IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

To start, we need to install the IPMI user software:

<source lang="bash">
</source>

=== Node Assassin ===

A cheap alternative is the [[Node Assassin]], an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.

'''Full Disclosure''': Node Assassin was created by me, with much help from others, for this paper.

= Clean Up =

Some daemons may have been installed or dragged in during the setup of the cluster. Most notable is heartbeat which is not compatible with [[RHCS]]. The following command will disable various daemons from starting at boot.

As for the rest of the daemons below, it is safe to run even when the daemons aren't installed or have already been removed. Of course, skip the ones you actually want to use. This assumes that the nodes have been built according to this HowTo.

<source lang="bash">
chkconfig heartbeat off; chkconfig iscsid off; chkconfig iptables off; chkconfig ip6tables off
</source>

= So You Have a Cluster - Now What? =

Building the cluster infrastructure was pretty easy, wasn't it?

But what will you '''''do''''' with it?

{|style="width: 75%; text-align: center; padding: 10px; border-spacing: 15px;"
|-
!style="width: 100%; border: 1px solid #dfdfdf;" colspan="3"|Choose Your Own Adventure Cluster
|-
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|style="width: 34%; border: 1px solid #dfdfdf;"| [[Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM|Xen-Based Virtual Machine Host on DRBD+CLVM]] High Availability VM Host
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|}

= Thanks =

* A '''huge''' thanks to [http://iplink.net Interlink Connectivity]! They hire me as a contractor and have allowed me to extend these docs while working on their clusters. Development of these How-Tos would be much slower if not for them. If you need hosting or colo services, drop them a line. They're website is a bit out of date though, so disregard the prices. :)
* To '''sdake''' of [http://corosync.org corosync] for helping me sort out the '''plock''' component and corosync in general.
* To '''Angus Salkeld''' for helping me nail down the Corosync and OpenAIS differences.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013922.html HJ Lee] from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013925.html Steven Dake] for clarifying the to_x vs. logoutput: x arguments in openais.conf.
* To [http://dk.linkedin.com/in/fabbione Fabio Massimo Di Nitto] for helping me get caught up with clustering and VMs.

{{footer}}

Abandoned - Two Node Fedora 13 Cluster

2010-09-09T22:34:12Z

Facade: /* Setup SSH Shared Keys */

{{howto_header}}

'''''Notice''''', '''Sep. 08, 2010''' - ''Draft 1'': This is the first draft of this HowTo. It has not been peer-reviewed yet, so please proceed with caution and on a test cluster. Any and all feedback is much appreciated.

= Overview =

This paper has one goal;

# How to assemble the simplest cluster possible, a '''2-Node Cluster''', which you can then expand on for your own needs.

With this completed, you can then jump into "step 2" papers that will show various uses of a two node cluster:

# How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time.
# How to create simple resources that can move between nodes. Examples will be a simple PostgreSQL database, DHCP, DNS and web servers.

== Prerequisites ==

It is expected that you are already comfortable with the Linux command line, specifically [[bash]], and that you are familiar with general administrative tasks in Red Hat based distributions, specifically [[Fedora]]. You will also need to be comfortable using editors like [[vim]], [[nano]] or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, [[multicast]], broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

== Platform ==

This paper will implement the [[Red Hat]] Cluster Suite using the Fedora v13 distribution. This paper uses the [[x86_64]] repositories, however, if you are on an [[i386]] (32 bit) system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names.

You can either download the stock Fedora 13 DVD ISO (currently at version 5.4 which is used in this document), or you can try out the alpha [[#AN!Cluster Install|AN!Cluster Install]] DVD. (4.3GB iso). If you use the latter, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant Fedora distro, please [[Digimer|contact me]] and let me know what your trouble was.

== Why Fedora 13? ==

Generally speaking, I much prefer to use a server-oriented Linux distribution like [[CentOS]], [[Debian]] or similar. However, there have been many recent changes in the Linux-Clustering world that have made all of the currently available server-class distributions obsolete. With luck, once [[Red Hat]] Enterprise Linux and [[CentOS]] version 6 is released, this will change.

Until then, [[Fedora]] version 13 provides the most up to date binary releases of the new implementation of the clustering stack available. For this reason, FC13 is the best choice in clustering, if you want to be current. To mitigate some of the issues introduced by using a workstation distribution, many packages will be stripped out of the default install.

== Focus ==

Clusters can serve to solve three problems; '''Reliability''', '''Performance''' and '''Scalability'''.

This focus of the cluster described in this paper is primarily '''reliability'''. Second to this, '''scalability''' will be the priority leaving '''performance''' to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper.

== Goal ==

At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" resources. That is, resources that exist on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources.

This paper should also serve to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper.

= Begin =

Let's begin!

== Hardware ==

We will need two physical servers each with the following hardware:
* One or more multi-core [[CPU]]s with Virtualization support.
* Three network cards; At least one should be gigabit or faster.
* One or more hard drives.
* You will need some form of a [[fence|fence device]]. This can be an [[IPMI]]-enabled server, a [http://nodeassassin.org Node Assassin], a fenceable [http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900&tab=features PDU] or similar.

This paper uses the following hardware:
* ASUS [http://support.asus.com/search/search.aspx?keyword=m4a78l-m&SLanguage=en-us M4A78L-M]
* AMD Athlon II x2 250
* 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
* 1x Intel 82540 PCI NICs
* 1x D-Link DGE-560T
* Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would serve the purposes of creating this document. What you purchase shouldn't matter, so long as the minimum requirements are met.

== Pre-Assembly ==

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to [[Setting Up a PXE Server in Fedora|network boot]] your install media. In that case, if the wrong NIC is chosen for eth0, you will be presented with a list of MAC addresses to attempt setup with.

Before you assemble your servers, record their network cards' [[MAC]] addresses. I like to keep simple text files like these:

<source lang="bash">
cat an-node01.mac
</source>
<source lang="text">
90:E6:BA:71:82:EA eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53 eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

<source lang="bash">
cat an-node02.mac
</source>
<source lang="text">
90:E6:BA:71:82:D8 eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

Feel free to record the information in any way that suits you the best.

== OS Install ==

Start with a stock Fedora 13 install. This How-To uses Fedora 13 x86_64, however it should be fairly easy to adapt to other recent Fedora versions. This document is also attempting to be easily ported to [[CentOS]]/[[RHEL]] version 6 once it is released. This will not adapt well to CentOS/RHEL version 5 though... Much of the cluster stack has changed dramatically since it was initially released.

These are sample kickstart script used by this paper. Be sure to set your own password string and network settings.

'''''Warning'''''! These kickstart scripts '''''will erase your hard drive'''''! Adapt them, don't blindly use them.

Generic cluster node kickstart scripts.
* [[Fedora13 KS an-node01.ks|an-node01.ks]]
* [[Fedora13 KS an-node02.ks|an-node02.ks]]

== AN!Cluster Install ==

If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.

* Download the custom '''AN!Cluster v0.2.001''' Install DVD. (4.5[[GiB]] iso). (Currently disabled - Reworking for F13)

== Post OS Install ==

Once the OS is installed, we need to do a couple things;

# Disable selinux
# Setup networking.
# Change the default run-level.

=== Disable 'selinux' ===

To allow this paper to focus on clustering, we will disable selinux. Obviously, this introduces security issues that you may not be comfortable with. If you wish to leave selinux enabled, it will be up to you to sort out issues that crop up.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect.

=== Change the Default Run-Level ===

'''This is an optional step'''. It improves performance only, it is not a required step.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reduces the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

<source lang="bash">
vim /etc/inittab
</source>
<source lang="text">
id:3:initdefault:
</source>
<source lang="bash">
init 3
</source>

=== Setup Networking ===

We need to remove NetworkManager, enable network, configure the ifcfg-eth* files and then start the network daemon.

==== Network Layout ====

This setup expects you to have three physical network cards connected to three independent networks. To have a common vernacular, lets use this table to describe them:

{|
!class="cell_all"|Network Description
!class="cell_tbr"|Short Name
!class="cell_tbr"|Device Name
!class="cell_tbr"|Suggested Subnet
!class="cell_tbr"|NIC Properties
|-
|class="cell_blr"|Internet-Facing Network
|class="cell_br"|IFN
|class="cell_br"|eth0
|class="cell_br"|192.168.1.0/24
|class="cell_br"|Remaining NIC should be used here. If using a [[Setting Up a PXE Server in Fedora|PXE server]], this should be a bootable NIC.
|-
|class="cell_blr"|Storage Network
|class="cell_br"|SN
|class="cell_br"|eth1
|class="cell_br"|10.0.0.0/24
|class="cell_br"|Fastest NIC should be used here.
|-
|class="cell_blr"|Back-Channel Network
|class="cell_br"|BCN
|class="cell_br"|eth2
|class="cell_br"|10.0.1.0/24
|class="cell_br"|NICs with [[IPMI]] piggy-back '''must''' be used here. Second-fastest NIC should be used here.
|}

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:
# If your nodes have [[IPMI]] piggy-backing on a normal NIC, that NIC '''''must''''' be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a ''major'' security risk.
# The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
# If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
# The final NIC should be used for the IFN subnet.

==== Node IP Addresses ====

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

{|
|class="cell_br"| 
|class="cell_tbr"|'''Internet-Facing Network''' (IFN)
|class="cell_tbr"|'''Storage Network''' (SN)
|class="cell_tbr"|'''Back-Channel Network''' (BCN)
|-
|class="cell_blr"|'''an-node01'''
|class="cell_br"|192.168.1.71
|class="cell_br"|10.0.0.71
|class="cell_br"|10.0.1.71
|-
|class="cell_blr"|'''an-node02'''
|class="cell_br"|192.168.1.72
|class="cell_br"|10.0.0.72
|class="cell_br"|10.0.1.72
|}

==== Remove NetworkManager ====

Some cluster software '''will not''' start with NetworkManager installed. This is because it is designed to be a highly-adaptive network system that can accommodate frequent changes in the network. To simplify these network transitions for end-users, a lot of reconfiguration of the network is done behind the scenes.

For workstations, this is wonderful. For clustering, this can be disastrous. Transient network outages are already a risk to a cluster's stability!

So first up, make sure that NetworkManager is completely removed from your system. If you used the [[#OS Install|kickstart]] scripts, then it was not installed. Otherwise, run:

<source lang="bash">
yum remove NetworkManager NetworkManager-gnome NetworkManager-openconnect NetworkManager-openvpn NetworkManager-pptp NetworkManager-vpnc cnetworkmanager knetworkmanager knetworkmanager-openvpn knetworkmanager-pptp knetworkmanager-vpnc libproxy-networkmanager yum-NetworkManager-dispatcher
</source>

==== Setup 'network' ====

Before proceeding with network configuration, check to see if your network cards are aligned to the proper ethX network names. If they need to be adjusted, please follow this How-To before proceeding:

* [[Changing the ethX to Ethernet Device Mapping in Fedora]]

There are a few ways to configure network in Fedora:
* system-config-network (graphical)
* system-config-network-tui (ncurses)
* Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files. (See: [http://docs.fedoraproject.org/en-US/Fedora/12/html/Deployment_Guide/s1-networkscripts-interfaces.html here] for a full list of options)

Do not proceed until your node's networking is fully configured.

==== Update the Hosts File ====

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

'''Note''': Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.

The updated /etc/hosts file should look like this:

<source lang="bash">
vim /etc/hosts
</source>
<source lang="text">
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# Back-channel IPs to name mapping.
10.0.1.71 an-node01 an-node01.alteeve.com
10.0.1.72 an-node02 an-node02.alteeve.com

# Storage network
10.0.0.71 an-node01.sn
10.0.0.72 an-node02.sn

# Internet facing
192.168.1.71 an-node01.inet
192.168.1.72 an-node02.inet
</source>

To test this, ping both nodes by name and make sure the ping packets are sent on the 10.0.1.0/24 subnet:

<source lang="bash">
ping -c 5 an-node01
</source>
<source lang="text">
PING an-node01 (10.0.1.71) 56(84) bytes of data.
64 bytes from an-node01 (10.0.1.71): icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=2 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=3 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=4 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=5 ttl=64 time=0.055 ms
</source>
<source lang="bash">
ping -c 5 an-node02
</source>
<source lang="text">
PING an-node02 (10.0.1.72) 56(84) bytes of data.
64 bytes from an-node02 (10.0.1.72): icmp_seq=1 ttl=64 time=0.221 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=2 ttl=64 time=0.188 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=3 ttl=64 time=0.217 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=4 ttl=64 time=0.192 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=5 ttl=64 time=0.163 ms
</source>

==== Disable Firewalls ====

Be sure to flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

There will be enough potential sources of problem as it is. Disabling firewalls at this stage will minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.

<source lang="bash">
chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop
</source>

==== Setup SSH Shared Keys ====

'''This is an optional step'''. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. This is not meant to be a security-focused How-To, so please independently study the risks.

If you're a little new to [[SSH]], it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the ''other'' node's root user's directory.

For each user, on each machine you want to connect '''from''', run:

<source lang="bash">
ssh-keygen -t dsa -N "" -f ~/.ssh/id_dsa
</source>
<source lang="text">
Generating public/private dsa key pair.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
c4:cc:10:0a:6b:60:24:e7:57:94:69:e5:10:b2:cb:1f root@an-node01.alteeve.com
The key's randomart image is:
+--[ DSA 1024]----+
|+oo ..B*. |
|o+ o =+B |
| + +. * |
| . o . . |
| o E S |
| . . |
| . |
| |
| |
+-----------------+
</source>

This will create two files: the private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private '''''must never''''' be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

'''Private key''':
<source lang="bash">
cat ~/.ssh/id_dsa
</source>
<source lang="text">
-----BEGIN DSA PRIVATE KEY-----
MIIBugIBAAKBgQCVo5nzeC/W8HWaFkD4pAmWO7ovf7S01f8HgocKKOYX/nMNxRui
wExBUUUt1b2TxYYQklTd9dn8UAzZ3UohN0CJNDrp//t2wfyACKKc2q+ewao5/svQ
xXTBu2ZPebKPcr1w9p0hq0xSr77Rg3v6Auc6AnC79DM9XkhYkVgNfgrvMwIVAKdY
SgTycvqNEWsJrCTTn5485yWvAoGANQ3rzIxFy2zHWyMlrpA9r5jllgxfaJVx3/Iw
DrEF82lr2dOmG8oYiKAhfvdgNyV7VeM2yxdjcdLPJAHHCx5BZZ7V9KjMzY26wS2r
d58TZJoWmO0nscmcwIbCv2at3fiMpxgr+nzD4tKxFOdWxYBwdmUpIjtJoW8wzY2H
uNghjL0CgYA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDN
bbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EE
D3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5AIUFX97yIig
n2SUltN+10NWUy4oYsA=
-----END DSA PRIVATE KEY-----
</source>

'''Public key''':
<source lang="bash">
cat ~/.ssh/id_dsa.pub
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From '''an-node01''', type:
<source lang="bash">
ssh root@an-node02
</source>
<source lang="text">
The authenticity of host 'an-node02 (10.0.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,10.0.1.72' (RSA) to the list of known hosts.
root@an-node02's password:
Last login: Tue Jul 6 22:28:19 2010 from 192.168.1.102
</source>

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

<source lang="bash">
cat ~/.ssh/authorized_keys
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

= Initial Cluster Setup =

Before we get into specifics, let's take a minute to talk about the major components used in our cluster.

== Core Program Overviews ==

These are the core programs that may be new to you that we will use to build our cluster. Before we configure them, lets take a minute to understand their roles.

=== Cluster Manager ===

The cluster manager, cman, takes the configuration file [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] configuration file and uses it to configure and start the various cluster elements. This is where nodes are defined, fence devices are configured and various cluster tolerances are set.

=== Corosync ===

[http://www.corosync.org Corosync] is, essentially, the clustering kernel which provides the essential services for cluster-aware applications and other cluster services. At it's core is the [[totem]] protocol which provides cluster membership and ordered messaging. On top of this, it implements a number of core clustering services like [[#cpg; Closed Process Group|closed process group messaging]], [[quorum]] and [[confdb]] (an object database).

It's goal is to provide a substantially simpler and more flexible set of APIs to facilitate clustering in Linux. It manages which nodes are in the cluster, it triggers error messages when something fails, manages cluster locking and so on. Most other clustered applications rely on corosync to know when something has happened or to announce when a cluster-related action has taken place.

Please note that the corosync_overview man page is considered out of date at the time of this writing.

==== cpg; Closed Process Group ====

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install corosynclib-devel
man cpg_overview
</source>

The closed process group is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages amongst nodes in a consistent order. It adds [[PID]]s and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group.

=== OpenAIS ===

OpenAIS is now an extension to Corosync that adds an open-source implementation of the [http://www.saforum.org/ Service Availability (SA) Forum's] 'Application Interface Specification' ([[AIS]]). It is an [[API]] and policy designed to be used by applications concerned with maintaining services during faults. AIS implements the 'Availability Management Framework' ([[AMF]]) which, in turn, provides for application fail over, cluster management ([[CLM]]), Checkpointing ([[CKPT]]), Eventing ([[EVT]]), Messaging ([[MSG]]), and Distributed Locking ([[DLOCK]]).

It does this by implementing pluggable shared libraries that get loaded into corosync at runtime. These libraries then can make use of corosync's internal API. Both corosync and openais services provide [[IPC]] interfaces so that application programmers can make use of their behaviour.

In short; applications can use OpenAIS to be cluster-aware. It's libraries are used by some applications, including Pacemaker. In our application, we will only be using it's libraries indirectly.

Please note that the openais_overview man page is considered out of date at the time of this writing.

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install openais
</source>

=== Pacemaker ===

[http://www.clusterlabs.org Pacemaker] is the cluster resource manager. It can be used to trigger scripts to control non cluster-aware applications. In this way, these non cluster-aware applications can be made more highly available. It is considered a replacement for cman's [[rgmanager]].

== Setup Core Programs ==

We now need to edit and create the configuration files for our cluster.

=== Setup cman ===

You may only need to configure this section.

First, install the cluster manager:

<source lang="bash">
yum install cman
</source>

Once installed, we need to create and fill in the [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] file. This is the core configuration file of our cluster and is an [[XML]] formatted file.

Please see: [[Two-Node Fedora 13 cluster.conf]]

<source lang="bash">
vim /etc/cluster/cluster.conf
</source>

The example in that article details many of the common options you will want to use. For a more complete list, please run man 5 cluster.conf for a more complete discussion on this file. Also, please be aware that some arguments can overlap between [[#Setup Corosync|corosync]] and this file.

Once it's setup, we need to verify it. Once you've finished, be sure to run this:

<source lang="bash">
xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf
</source>

If there are errors, fix them. Once it is formatted properly, the contents of you cluster.conf file will be returned followed by "/etc/cluster/cluster.conf validates". Once this is the case, you can move on.

A note; Normally at this stage, you'd use chkconfig to enable cman to start at boot. However, depending on how you use your cluster, you may need to modify the boot order. For now, let's leave it off until we're got all the components enabled. This is particularly important if you will be using [[DRBD]], [[CLVM]] or [[Xen]].

<source lang="bash">
chkconfig cman off
</source>

To see if it's working though, have a terminal open to both nodes and manually start cman. I suggest have two terminal windows open on each node. In one, start tailing /var/log/messages:

<source lang="bash">
clear; tail -f -n 0 /var/log/messages
</source>

Now, with both log files being watched, start cman on both nodes.

'''Note''': You MUST execute the following on both nodes withing the time limit set by post_join_delay (default it 6 seconds). If you wait longer than this, the first node started will [[fence]] the other node, assuming you have fencing configured properly. If fencing is '''not''' configured properly, your first node will hang.

<source lang="bash">
/etc/init.d/cman start
</source>

Examine the log files to verify there were no errors. If there were, fix them. If there was not, fantastic!

'''Warning''': This last step will forcefully reboot your nodes. Do not do this on a production cluster! If it must be done on a production server, stop all non-essential services and be sure you have a good backup, ''just in case''.

Lastly, test that your fencing is configured and working properly. We do this by using a program called fence_node. With both nodes up and running cman, call a fence against the other node. Be sure to be watching the log files.

'''From''' an-node01.alteeve.com

<source lang="bash">
fence_node an-node02.alteeve.com
</source>
<source lang="bash">
fence an-node02.alteeve.com success
</source>

If it worked, when the node comes back up, repeat the process from an-node02.alteeve.com against an-node01.alteeve.com.

=== Setup Corosync ===

'''Note''': In our cluster, we will not need to configure Corosync.

When using cman, as we are here, Corosync's configuration file is ignored.

Most of the settings available in [[Two-Node Fedora 13 corosync.conf|corosync.conf]] can be found in the [[Two-Node Fedora 13 cluster.conf|cluster.conf]] file. Please see the cluster.conf article for a comprehensive list of what is and is not supported.

If you want to configure corosync and not use cman, please see the following article:

* [[Two-Node Fedora 13 corosync.conf]]

=== Setup Pacemaker ===

ToDO.

= Fencing =

Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.

* The Cluster Admin's Mantra:
** '''The only thing you don't know is what you don't know'''.

Just because one node loses communication with another node, it '''cannot''' be assume that the silent node is dead!

== What is it? ==

"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a '''split-brain''' condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.

Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:

* Power
** Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
* Blocking
** Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.

With power fencing, the term used is "STONITH", literally, '''S'''hoot '''T'''he '''O'''ther '''N'''ode '''I'''n '''T'''he '''H'''ead. Picture it like an old west dual. If one node is dead, the other node is going to win the dual by default and the dead node will just be shot again. When both nodes are alive, however, the faster node will win and will "kill" (power off or reset) the slower node before it has a chance to fire. Once this dual is over, the surviving node can then access the shared resource confident that it is the only one working on it.

== Misconception ==

It is a '''very''' common mistake to ignore fencing when first starting to learn about clustering. Often people think ''"It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster."''.

'''''Wrong!'''''

For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, [[cman]] and related daemons will fail if they can't find a fence agent to use.

Secondly; Testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.

== Implementation ==

How you implement fencing depends largely on what kind of fence device(s) you have. This is where I need help from you, dear reader. I only have access to [[Node Assassin]] fence devices on my lab's cluster. If you have other fence devices, like [[IPMI]], [[DRAC]] or so on, please let me know so that I can expand this section.

=== Implementing Node Assassin ===

In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced daemon. We'll cover the details of the [[cluster.conf]] file in a moment.

When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:
<source lang="xml">
<cluster name="an-cluster" config_version="1">
<clusternodes>
<clusternode name="an-node02.alteeve.com" nodeid="2">
<fence>
<method name="node_assassin">
<device name="motoko" port="02" action="off"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="motoko" agent="fence_na" quiet="true"
ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
</fencedevice>
</fencedevices>
</cluster>
</source>

If the cluster manager determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name, which is motoko in this case. It then looks in the <fencedevices> section for the device with the matching name. From there, it gets the information needed to find and access the fence device. Once it connects to the fence device, it then passes the options set in an-node02.alteeve.com's <fence> argument.

So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments:
* ipaddr=motoko.alteeve.com
* login=motoko
* passwd=secret
* quiet=true
* port=2
* action=off

How the fence agent acts on these arguments varies depending on the fence device itself. In general terms, the 'fence_na' fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on, which could be a power jack, a power or reset button, a network switch port and so on. Finally, it tells the device what action to take.

Once the device completes, it returns a success or failed message. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.

If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.

== Fence Devices ==

Many major [[OEM]]s have their own remote management devices that can serve as fence devices. Examples are [http://dell.ca Dell]'s 'DRAC' (Dell Remote Access Controller), [http://hp.ca HP]'s iLO (Integrate Lights Out), [http://ibm.ca IBM]'s 'RSA' (Remote Supervisor Adapter), [http://sun.ca Sun]'s 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via [[IPMI]], Intelligent Power Management Interface.

In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.

Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.

=== IPMI ===

[[IPMI]] is perhaps the most common method used for fencing nodes in a cluster. It requires having a system board with an IPMI baseboard management controller (BMC). IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

To start, we need to install the IPMI user software:

<source lang="bash">
</source>

=== Node Assassin ===

A cheap alternative is the [[Node Assassin]], an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.

'''Full Disclosure''': Node Assassin was created by me, with much help from others, for this paper.

= Clean Up =

Some daemons may have been installed or dragged in during the setup of the cluster. Most notable is heartbeat which is not compatible with [[RHCS]]. The following command will disable various daemons from starting at boot.

As for the rest of the daemons below, it is safe to run even when the daemons aren't installed or have already been removed. Of course, skip the ones you actually want to use. This assumes that the nodes have been built according to this HowTo.

<source lang="bash">
chkconfig heartbeat off; chkconfig iscsid off; chkconfig iptables off; chkconfig ip6tables off
</source>

= So You Have a Cluster - Now What? =

Building the cluster infrastructure was pretty easy, wasn't it?

But what will you '''''do''''' with it?

{|style="width: 75%; text-align: center; padding: 10px; border-spacing: 15px;"
|-
!style="width: 100%; border: 1px solid #dfdfdf;" colspan="3"|Choose Your Own Adventure Cluster
|-
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|style="width: 34%; border: 1px solid #dfdfdf;"| [[Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM|Xen-Based Virtual Machine Host on DRBD+CLVM]] High Availability VM Host
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|}

= Thanks =

* A '''huge''' thanks to [http://iplink.net Interlink Connectivity]! They hire me as a contractor and have allowed me to extend these docs while working on their clusters. Development of these How-Tos would be much slower if not for them. If you need hosting or colo services, drop them a line. They're website is a bit out of date though, so disregard the prices. :)
* To '''sdake''' of [http://corosync.org corosync] for helping me sort out the '''plock''' component and corosync in general.
* To '''Angus Salkeld''' for helping me nail down the Corosync and OpenAIS differences.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013922.html HJ Lee] from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013925.html Steven Dake] for clarifying the to_x vs. logoutput: x arguments in openais.conf.
* To [http://dk.linkedin.com/in/fabbione Fabio Massimo Di Nitto] for helping me get caught up with clustering and VMs.

{{footer}}

Abandoned - Two Node Fedora 13 Cluster

2010-09-09T22:28:29Z

Facade: /* Setup 'network' */

{{howto_header}}

'''''Notice''''', '''Sep. 08, 2010''' - ''Draft 1'': This is the first draft of this HowTo. It has not been peer-reviewed yet, so please proceed with caution and on a test cluster. Any and all feedback is much appreciated.

= Overview =

This paper has one goal;

# How to assemble the simplest cluster possible, a '''2-Node Cluster''', which you can then expand on for your own needs.

With this completed, you can then jump into "step 2" papers that will show various uses of a two node cluster:

# How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time.
# How to create simple resources that can move between nodes. Examples will be a simple PostgreSQL database, DHCP, DNS and web servers.

== Prerequisites ==

It is expected that you are already comfortable with the Linux command line, specifically [[bash]], and that you are familiar with general administrative tasks in Red Hat based distributions, specifically [[Fedora]]. You will also need to be comfortable using editors like [[vim]], [[nano]] or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, [[multicast]], broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

== Platform ==

This paper will implement the [[Red Hat]] Cluster Suite using the Fedora v13 distribution. This paper uses the [[x86_64]] repositories, however, if you are on an [[i386]] (32 bit) system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names.

You can either download the stock Fedora 13 DVD ISO (currently at version 5.4 which is used in this document), or you can try out the alpha [[#AN!Cluster Install|AN!Cluster Install]] DVD. (4.3GB iso). If you use the latter, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant Fedora distro, please [[Digimer|contact me]] and let me know what your trouble was.

== Why Fedora 13? ==

Generally speaking, I much prefer to use a server-oriented Linux distribution like [[CentOS]], [[Debian]] or similar. However, there have been many recent changes in the Linux-Clustering world that have made all of the currently available server-class distributions obsolete. With luck, once [[Red Hat]] Enterprise Linux and [[CentOS]] version 6 is released, this will change.

Until then, [[Fedora]] version 13 provides the most up to date binary releases of the new implementation of the clustering stack available. For this reason, FC13 is the best choice in clustering, if you want to be current. To mitigate some of the issues introduced by using a workstation distribution, many packages will be stripped out of the default install.

== Focus ==

Clusters can serve to solve three problems; '''Reliability''', '''Performance''' and '''Scalability'''.

This focus of the cluster described in this paper is primarily '''reliability'''. Second to this, '''scalability''' will be the priority leaving '''performance''' to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper.

== Goal ==

At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" resources. That is, resources that exist on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources.

This paper should also serve to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper.

= Begin =

Let's begin!

== Hardware ==

We will need two physical servers each with the following hardware:
* One or more multi-core [[CPU]]s with Virtualization support.
* Three network cards; At least one should be gigabit or faster.
* One or more hard drives.
* You will need some form of a [[fence|fence device]]. This can be an [[IPMI]]-enabled server, a [http://nodeassassin.org Node Assassin], a fenceable [http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900&tab=features PDU] or similar.

This paper uses the following hardware:
* ASUS [http://support.asus.com/search/search.aspx?keyword=m4a78l-m&SLanguage=en-us M4A78L-M]
* AMD Athlon II x2 250
* 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
* 1x Intel 82540 PCI NICs
* 1x D-Link DGE-560T
* Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would serve the purposes of creating this document. What you purchase shouldn't matter, so long as the minimum requirements are met.

== Pre-Assembly ==

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to [[Setting Up a PXE Server in Fedora|network boot]] your install media. In that case, if the wrong NIC is chosen for eth0, you will be presented with a list of MAC addresses to attempt setup with.

Before you assemble your servers, record their network cards' [[MAC]] addresses. I like to keep simple text files like these:

<source lang="bash">
cat an-node01.mac
</source>
<source lang="text">
90:E6:BA:71:82:EA eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53 eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

<source lang="bash">
cat an-node02.mac
</source>
<source lang="text">
90:E6:BA:71:82:D8 eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

Feel free to record the information in any way that suits you the best.

== OS Install ==

Start with a stock Fedora 13 install. This How-To uses Fedora 13 x86_64, however it should be fairly easy to adapt to other recent Fedora versions. This document is also attempting to be easily ported to [[CentOS]]/[[RHEL]] version 6 once it is released. This will not adapt well to CentOS/RHEL version 5 though... Much of the cluster stack has changed dramatically since it was initially released.

These are sample kickstart script used by this paper. Be sure to set your own password string and network settings.

'''''Warning'''''! These kickstart scripts '''''will erase your hard drive'''''! Adapt them, don't blindly use them.

Generic cluster node kickstart scripts.
* [[Fedora13 KS an-node01.ks|an-node01.ks]]
* [[Fedora13 KS an-node02.ks|an-node02.ks]]

== AN!Cluster Install ==

If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.

* Download the custom '''AN!Cluster v0.2.001''' Install DVD. (4.5[[GiB]] iso). (Currently disabled - Reworking for F13)

== Post OS Install ==

Once the OS is installed, we need to do a couple things;

# Disable selinux
# Setup networking.
# Change the default run-level.

=== Disable 'selinux' ===

To allow this paper to focus on clustering, we will disable selinux. Obviously, this introduces security issues that you may not be comfortable with. If you wish to leave selinux enabled, it will be up to you to sort out issues that crop up.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect.

=== Change the Default Run-Level ===

'''This is an optional step'''. It improves performance only, it is not a required step.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reduces the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

<source lang="bash">
vim /etc/inittab
</source>
<source lang="text">
id:3:initdefault:
</source>
<source lang="bash">
init 3
</source>

=== Setup Networking ===

We need to remove NetworkManager, enable network, configure the ifcfg-eth* files and then start the network daemon.

==== Network Layout ====

This setup expects you to have three physical network cards connected to three independent networks. To have a common vernacular, lets use this table to describe them:

{|
!class="cell_all"|Network Description
!class="cell_tbr"|Short Name
!class="cell_tbr"|Device Name
!class="cell_tbr"|Suggested Subnet
!class="cell_tbr"|NIC Properties
|-
|class="cell_blr"|Internet-Facing Network
|class="cell_br"|IFN
|class="cell_br"|eth0
|class="cell_br"|192.168.1.0/24
|class="cell_br"|Remaining NIC should be used here. If using a [[Setting Up a PXE Server in Fedora|PXE server]], this should be a bootable NIC.
|-
|class="cell_blr"|Storage Network
|class="cell_br"|SN
|class="cell_br"|eth1
|class="cell_br"|10.0.0.0/24
|class="cell_br"|Fastest NIC should be used here.
|-
|class="cell_blr"|Back-Channel Network
|class="cell_br"|BCN
|class="cell_br"|eth2
|class="cell_br"|10.0.1.0/24
|class="cell_br"|NICs with [[IPMI]] piggy-back '''must''' be used here. Second-fastest NIC should be used here.
|}

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:
# If your nodes have [[IPMI]] piggy-backing on a normal NIC, that NIC '''''must''''' be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a ''major'' security risk.
# The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
# If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
# The final NIC should be used for the IFN subnet.

==== Node IP Addresses ====

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

{|
|class="cell_br"| 
|class="cell_tbr"|'''Internet-Facing Network''' (IFN)
|class="cell_tbr"|'''Storage Network''' (SN)
|class="cell_tbr"|'''Back-Channel Network''' (BCN)
|-
|class="cell_blr"|'''an-node01'''
|class="cell_br"|192.168.1.71
|class="cell_br"|10.0.0.71
|class="cell_br"|10.0.1.71
|-
|class="cell_blr"|'''an-node02'''
|class="cell_br"|192.168.1.72
|class="cell_br"|10.0.0.72
|class="cell_br"|10.0.1.72
|}

==== Remove NetworkManager ====

Some cluster software '''will not''' start with NetworkManager installed. This is because it is designed to be a highly-adaptive network system that can accommodate frequent changes in the network. To simplify these network transitions for end-users, a lot of reconfiguration of the network is done behind the scenes.

For workstations, this is wonderful. For clustering, this can be disastrous. Transient network outages are already a risk to a cluster's stability!

So first up, make sure that NetworkManager is completely removed from your system. If you used the [[#OS Install|kickstart]] scripts, then it was not installed. Otherwise, run:

<source lang="bash">
yum remove NetworkManager NetworkManager-gnome NetworkManager-openconnect NetworkManager-openvpn NetworkManager-pptp NetworkManager-vpnc cnetworkmanager knetworkmanager knetworkmanager-openvpn knetworkmanager-pptp knetworkmanager-vpnc libproxy-networkmanager yum-NetworkManager-dispatcher
</source>

==== Setup 'network' ====

Before proceeding with network configuration, check to see if your network cards are aligned to the proper ethX network names. If they need to be adjusted, please follow this How-To before proceeding:

* [[Changing the ethX to Ethernet Device Mapping in Fedora]]

There are a few ways to configure network in Fedora:
* system-config-network (graphical)
* system-config-network-tui (ncurses)
* Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files. (See: [http://docs.fedoraproject.org/en-US/Fedora/12/html/Deployment_Guide/s1-networkscripts-interfaces.html here] for a full list of options)

Do not proceed until your node's networking is fully configured.

==== Update the Hosts File ====

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

'''Note''': Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.

The updated /etc/hosts file should look like this:

<source lang="bash">
vim /etc/hosts
</source>
<source lang="text">
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# Back-channel IPs to name mapping.
10.0.1.71 an-node01 an-node01.alteeve.com
10.0.1.72 an-node02 an-node02.alteeve.com

# Storage network
10.0.0.71 an-node01.sn
10.0.0.72 an-node02.sn

# Internet facing
192.168.1.71 an-node01.inet
192.168.1.72 an-node02.inet
</source>

To test this, ping both nodes by name and make sure the ping packets are sent on the 10.0.1.0/24 subnet:

<source lang="bash">
ping -c 5 an-node01
</source>
<source lang="text">
PING an-node01 (10.0.1.71) 56(84) bytes of data.
64 bytes from an-node01 (10.0.1.71): icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=2 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=3 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=4 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=5 ttl=64 time=0.055 ms
</source>
<source lang="bash">
ping -c 5 an-node02
</source>
<source lang="text">
PING an-node02 (10.0.1.72) 56(84) bytes of data.
64 bytes from an-node02 (10.0.1.72): icmp_seq=1 ttl=64 time=0.221 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=2 ttl=64 time=0.188 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=3 ttl=64 time=0.217 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=4 ttl=64 time=0.192 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=5 ttl=64 time=0.163 ms
</source>

==== Disable Firewalls ====

Be sure to flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

There will be enough potential sources of problem as it is. Disabling firewalls at this stage will minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.

<source lang="bash">
chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop
</source>

==== Setup SSH Shared Keys ====

'''This is an optional step'''. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. This is not meant to be a security-focused How-To, so please independently study the risks.

If you're a little new to [[SSH]], it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the ''other'' node's root user's directory.

For each user, on each machine you want to connect '''from''', run:

<source lang="bash">
ssh-keygen -t dsa -N "" -f ~/.ssh/id_dsa
</source>
<source lang="text">
Generating public/private dsa key pair.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
c4:cc:10:0a:6b:60:24:e7:57:94:69:e5:10:b2:cb:1f root@an-node01.alteeve.com
The key's randomart image is:
+--[ DSA 1024]----+
|+oo ..B*. |
|o+ o =+B |
| + +. * |
| . o . . |
| o E S |
| . . |
| . |
| |
| |
+-----------------+
</source>

This will create two files; The private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private which '''''must never''''' be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

'''Private key''':
<source lang="bash">
cat ~/.ssh/id_dsa
</source>
<source lang="text">
-----BEGIN DSA PRIVATE KEY-----
MIIBugIBAAKBgQCVo5nzeC/W8HWaFkD4pAmWO7ovf7S01f8HgocKKOYX/nMNxRui
wExBUUUt1b2TxYYQklTd9dn8UAzZ3UohN0CJNDrp//t2wfyACKKc2q+ewao5/svQ
xXTBu2ZPebKPcr1w9p0hq0xSr77Rg3v6Auc6AnC79DM9XkhYkVgNfgrvMwIVAKdY
SgTycvqNEWsJrCTTn5485yWvAoGANQ3rzIxFy2zHWyMlrpA9r5jllgxfaJVx3/Iw
DrEF82lr2dOmG8oYiKAhfvdgNyV7VeM2yxdjcdLPJAHHCx5BZZ7V9KjMzY26wS2r
d58TZJoWmO0nscmcwIbCv2at3fiMpxgr+nzD4tKxFOdWxYBwdmUpIjtJoW8wzY2H
uNghjL0CgYA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDN
bbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EE
D3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5AIUFX97yIig
n2SUltN+10NWUy4oYsA=
-----END DSA PRIVATE KEY-----
</source>

'''Public key''':
<source lang="bash">
cat ~/.ssh/id_dsa.pub
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From '''an-node01''', type:
<source lang="bash">
ssh root@an-node02
</source>
<source lang="text">
The authenticity of host 'an-node02 (10.0.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,10.0.1.72' (RSA) to the list of known hosts.
root@an-node02's password:
Last login: Tue Jul 6 22:28:19 2010 from 192.168.1.102
</source>

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

<source lang="bash">
cat ~/.ssh/authorized_keys
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

= Initial Cluster Setup =

Before we get into specifics, let's take a minute to talk about the major components used in our cluster.

== Core Program Overviews ==

These are the core programs that may be new to you that we will use to build our cluster. Before we configure them, lets take a minute to understand their roles.

=== Cluster Manager ===

The cluster manager, cman, takes the configuration file [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] configuration file and uses it to configure and start the various cluster elements. This is where nodes are defined, fence devices are configured and various cluster tolerances are set.

=== Corosync ===

[http://www.corosync.org Corosync] is, essentially, the clustering kernel which provides the essential services for cluster-aware applications and other cluster services. At it's core is the [[totem]] protocol which provides cluster membership and ordered messaging. On top of this, it implements a number of core clustering services like [[#cpg; Closed Process Group|closed process group messaging]], [[quorum]] and [[confdb]] (an object database).

It's goal is to provide a substantially simpler and more flexible set of APIs to facilitate clustering in Linux. It manages which nodes are in the cluster, it triggers error messages when something fails, manages cluster locking and so on. Most other clustered applications rely on corosync to know when something has happened or to announce when a cluster-related action has taken place.

Please note that the corosync_overview man page is considered out of date at the time of this writing.

==== cpg; Closed Process Group ====

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install corosynclib-devel
man cpg_overview
</source>

The closed process group is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages amongst nodes in a consistent order. It adds [[PID]]s and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group.

=== OpenAIS ===

OpenAIS is now an extension to Corosync that adds an open-source implementation of the [http://www.saforum.org/ Service Availability (SA) Forum's] 'Application Interface Specification' ([[AIS]]). It is an [[API]] and policy designed to be used by applications concerned with maintaining services during faults. AIS implements the 'Availability Management Framework' ([[AMF]]) which, in turn, provides for application fail over, cluster management ([[CLM]]), Checkpointing ([[CKPT]]), Eventing ([[EVT]]), Messaging ([[MSG]]), and Distributed Locking ([[DLOCK]]).

It does this by implementing pluggable shared libraries that get loaded into corosync at runtime. These libraries then can make use of corosync's internal API. Both corosync and openais services provide [[IPC]] interfaces so that application programmers can make use of their behaviour.

In short; applications can use OpenAIS to be cluster-aware. It's libraries are used by some applications, including Pacemaker. In our application, we will only be using it's libraries indirectly.

Please note that the openais_overview man page is considered out of date at the time of this writing.

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install openais
</source>

=== Pacemaker ===

[http://www.clusterlabs.org Pacemaker] is the cluster resource manager. It can be used to trigger scripts to control non cluster-aware applications. In this way, these non cluster-aware applications can be made more highly available. It is considered a replacement for cman's [[rgmanager]].

== Setup Core Programs ==

We now need to edit and create the configuration files for our cluster.

=== Setup cman ===

You may only need to configure this section.

First, install the cluster manager:

<source lang="bash">
yum install cman
</source>

Once installed, we need to create and fill in the [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] file. This is the core configuration file of our cluster and is an [[XML]] formatted file.

Please see: [[Two-Node Fedora 13 cluster.conf]]

<source lang="bash">
vim /etc/cluster/cluster.conf
</source>

The example in that article details many of the common options you will want to use. For a more complete list, please run man 5 cluster.conf for a more complete discussion on this file. Also, please be aware that some arguments can overlap between [[#Setup Corosync|corosync]] and this file.

Once it's setup, we need to verify it. Once you've finished, be sure to run this:

<source lang="bash">
xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf
</source>

If there are errors, fix them. Once it is formatted properly, the contents of you cluster.conf file will be returned followed by "/etc/cluster/cluster.conf validates". Once this is the case, you can move on.

A note; Normally at this stage, you'd use chkconfig to enable cman to start at boot. However, depending on how you use your cluster, you may need to modify the boot order. For now, let's leave it off until we're got all the components enabled. This is particularly important if you will be using [[DRBD]], [[CLVM]] or [[Xen]].

<source lang="bash">
chkconfig cman off
</source>

To see if it's working though, have a terminal open to both nodes and manually start cman. I suggest have two terminal windows open on each node. In one, start tailing /var/log/messages:

<source lang="bash">
clear; tail -f -n 0 /var/log/messages
</source>

Now, with both log files being watched, start cman on both nodes.

'''Note''': You MUST execute the following on both nodes withing the time limit set by post_join_delay (default it 6 seconds). If you wait longer than this, the first node started will [[fence]] the other node, assuming you have fencing configured properly. If fencing is '''not''' configured properly, your first node will hang.

<source lang="bash">
/etc/init.d/cman start
</source>

Examine the log files to verify there were no errors. If there were, fix them. If there was not, fantastic!

'''Warning''': This last step will forcefully reboot your nodes. Do not do this on a production cluster! If it must be done on a production server, stop all non-essential services and be sure you have a good backup, ''just in case''.

Lastly, test that your fencing is configured and working properly. We do this by using a program called fence_node. With both nodes up and running cman, call a fence against the other node. Be sure to be watching the log files.

'''From''' an-node01.alteeve.com

<source lang="bash">
fence_node an-node02.alteeve.com
</source>
<source lang="bash">
fence an-node02.alteeve.com success
</source>

If it worked, when the node comes back up, repeat the process from an-node02.alteeve.com against an-node01.alteeve.com.

=== Setup Corosync ===

'''Note''': In our cluster, we will not need to configure Corosync.

When using cman, as we are here, Corosync's configuration file is ignored.

Most of the settings available in [[Two-Node Fedora 13 corosync.conf|corosync.conf]] can be found in the [[Two-Node Fedora 13 cluster.conf|cluster.conf]] file. Please see the cluster.conf article for a comprehensive list of what is and is not supported.

If you want to configure corosync and not use cman, please see the following article:

* [[Two-Node Fedora 13 corosync.conf]]

=== Setup Pacemaker ===

ToDO.

= Fencing =

Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.

* The Cluster Admin's Mantra:
** '''The only thing you don't know is what you don't know'''.

Just because one node loses communication with another node, it '''cannot''' be assume that the silent node is dead!

== What is it? ==

"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a '''split-brain''' condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.

Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:

* Power
** Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
* Blocking
** Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.

With power fencing, the term used is "STONITH", literally, '''S'''hoot '''T'''he '''O'''ther '''N'''ode '''I'''n '''T'''he '''H'''ead. Picture it like an old west dual. If one node is dead, the other node is going to win the dual by default and the dead node will just be shot again. When both nodes are alive, however, the faster node will win and will "kill" (power off or reset) the slower node before it has a chance to fire. Once this dual is over, the surviving node can then access the shared resource confident that it is the only one working on it.

== Misconception ==

It is a '''very''' common mistake to ignore fencing when first starting to learn about clustering. Often people think ''"It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster."''.

'''''Wrong!'''''

For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, [[cman]] and related daemons will fail if they can't find a fence agent to use.

Secondly; Testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.

== Implementation ==

How you implement fencing depends largely on what kind of fence device(s) you have. This is where I need help from you, dear reader. I only have access to [[Node Assassin]] fence devices on my lab's cluster. If you have other fence devices, like [[IPMI]], [[DRAC]] or so on, please let me know so that I can expand this section.

=== Implementing Node Assassin ===

In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced daemon. We'll cover the details of the [[cluster.conf]] file in a moment.

When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:
<source lang="xml">
<cluster name="an-cluster" config_version="1">
<clusternodes>
<clusternode name="an-node02.alteeve.com" nodeid="2">
<fence>
<method name="node_assassin">
<device name="motoko" port="02" action="off"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="motoko" agent="fence_na" quiet="true"
ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
</fencedevice>
</fencedevices>
</cluster>
</source>

If the cluster manager determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name, which is motoko in this case. It then looks in the <fencedevices> section for the device with the matching name. From there, it gets the information needed to find and access the fence device. Once it connects to the fence device, it then passes the options set in an-node02.alteeve.com's <fence> argument.

So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments:
* ipaddr=motoko.alteeve.com
* login=motoko
* passwd=secret
* quiet=true
* port=2
* action=off

How the fence agent acts on these arguments varies depending on the fence device itself. In general terms, the 'fence_na' fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on, which could be a power jack, a power or reset button, a network switch port and so on. Finally, it tells the device what action to take.

Once the device completes, it returns a success or failed message. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.

If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.

== Fence Devices ==

Many major [[OEM]]s have their own remote management devices that can serve as fence devices. Examples are [http://dell.ca Dell]'s 'DRAC' (Dell Remote Access Controller), [http://hp.ca HP]'s iLO (Integrate Lights Out), [http://ibm.ca IBM]'s 'RSA' (Remote Supervisor Adapter), [http://sun.ca Sun]'s 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via [[IPMI]], Intelligent Power Management Interface.

In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.

Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.

=== IPMI ===

[[IPMI]] is perhaps the most common method used for fencing nodes in a cluster. It requires having a system board with an IPMI baseboard management controller (BMC). IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

To start, we need to install the IPMI user software:

<source lang="bash">
</source>

=== Node Assassin ===

A cheap alternative is the [[Node Assassin]], an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.

'''Full Disclosure''': Node Assassin was created by me, with much help from others, for this paper.

= Clean Up =

Some daemons may have been installed or dragged in during the setup of the cluster. Most notable is heartbeat which is not compatible with [[RHCS]]. The following command will disable various daemons from starting at boot.

As for the rest of the daemons below, it is safe to run even when the daemons aren't installed or have already been removed. Of course, skip the ones you actually want to use. This assumes that the nodes have been built according to this HowTo.

<source lang="bash">
chkconfig heartbeat off; chkconfig iscsid off; chkconfig iptables off; chkconfig ip6tables off
</source>

= So You Have a Cluster - Now What? =

Building the cluster infrastructure was pretty easy, wasn't it?

But what will you '''''do''''' with it?

{|style="width: 75%; text-align: center; padding: 10px; border-spacing: 15px;"
|-
!style="width: 100%; border: 1px solid #dfdfdf;" colspan="3"|Choose Your Own Adventure Cluster
|-
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|style="width: 34%; border: 1px solid #dfdfdf;"| [[Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM|Xen-Based Virtual Machine Host on DRBD+CLVM]] High Availability VM Host
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|}

= Thanks =

* A '''huge''' thanks to [http://iplink.net Interlink Connectivity]! They hire me as a contractor and have allowed me to extend these docs while working on their clusters. Development of these How-Tos would be much slower if not for them. If you need hosting or colo services, drop them a line. They're website is a bit out of date though, so disregard the prices. :)
* To '''sdake''' of [http://corosync.org corosync] for helping me sort out the '''plock''' component and corosync in general.
* To '''Angus Salkeld''' for helping me nail down the Corosync and OpenAIS differences.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013922.html HJ Lee] from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013925.html Steven Dake] for clarifying the to_x vs. logoutput: x arguments in openais.conf.
* To [http://dk.linkedin.com/in/fabbione Fabio Massimo Di Nitto] for helping me get caught up with clustering and VMs.

{{footer}}

Abandoned - Two Node Fedora 13 Cluster

2010-09-09T22:25:51Z

Facade: /* Remove NetworkManager */

{{howto_header}}

'''''Notice''''', '''Sep. 08, 2010''' - ''Draft 1'': This is the first draft of this HowTo. It has not been peer-reviewed yet, so please proceed with caution and on a test cluster. Any and all feedback is much appreciated.

= Overview =

This paper has one goal;

# How to assemble the simplest cluster possible, a '''2-Node Cluster''', which you can then expand on for your own needs.

With this completed, you can then jump into "step 2" papers that will show various uses of a two node cluster:

# How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time.
# How to create simple resources that can move between nodes. Examples will be a simple PostgreSQL database, DHCP, DNS and web servers.

== Prerequisites ==

It is expected that you are already comfortable with the Linux command line, specifically [[bash]], and that you are familiar with general administrative tasks in Red Hat based distributions, specifically [[Fedora]]. You will also need to be comfortable using editors like [[vim]], [[nano]] or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, [[multicast]], broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

== Platform ==

This paper will implement the [[Red Hat]] Cluster Suite using the Fedora v13 distribution. This paper uses the [[x86_64]] repositories, however, if you are on an [[i386]] (32 bit) system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names.

You can either download the stock Fedora 13 DVD ISO (currently at version 5.4 which is used in this document), or you can try out the alpha [[#AN!Cluster Install|AN!Cluster Install]] DVD. (4.3GB iso). If you use the latter, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant Fedora distro, please [[Digimer|contact me]] and let me know what your trouble was.

== Why Fedora 13? ==

Generally speaking, I much prefer to use a server-oriented Linux distribution like [[CentOS]], [[Debian]] or similar. However, there have been many recent changes in the Linux-Clustering world that have made all of the currently available server-class distributions obsolete. With luck, once [[Red Hat]] Enterprise Linux and [[CentOS]] version 6 is released, this will change.

Until then, [[Fedora]] version 13 provides the most up to date binary releases of the new implementation of the clustering stack available. For this reason, FC13 is the best choice in clustering, if you want to be current. To mitigate some of the issues introduced by using a workstation distribution, many packages will be stripped out of the default install.

== Focus ==

Clusters can serve to solve three problems; '''Reliability''', '''Performance''' and '''Scalability'''.

This focus of the cluster described in this paper is primarily '''reliability'''. Second to this, '''scalability''' will be the priority leaving '''performance''' to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper.

== Goal ==

At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" resources. That is, resources that exist on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources.

This paper should also serve to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper.

= Begin =

Let's begin!

== Hardware ==

We will need two physical servers each with the following hardware:
* One or more multi-core [[CPU]]s with Virtualization support.
* Three network cards; At least one should be gigabit or faster.
* One or more hard drives.
* You will need some form of a [[fence|fence device]]. This can be an [[IPMI]]-enabled server, a [http://nodeassassin.org Node Assassin], a fenceable [http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900&tab=features PDU] or similar.

This paper uses the following hardware:
* ASUS [http://support.asus.com/search/search.aspx?keyword=m4a78l-m&SLanguage=en-us M4A78L-M]
* AMD Athlon II x2 250
* 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
* 1x Intel 82540 PCI NICs
* 1x D-Link DGE-560T
* Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would serve the purposes of creating this document. What you purchase shouldn't matter, so long as the minimum requirements are met.

== Pre-Assembly ==

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to [[Setting Up a PXE Server in Fedora|network boot]] your install media. In that case, if the wrong NIC is chosen for eth0, you will be presented with a list of MAC addresses to attempt setup with.

Before you assemble your servers, record their network cards' [[MAC]] addresses. I like to keep simple text files like these:

<source lang="bash">
cat an-node01.mac
</source>
<source lang="text">
90:E6:BA:71:82:EA eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53 eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

<source lang="bash">
cat an-node02.mac
</source>
<source lang="text">
90:E6:BA:71:82:D8 eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

Feel free to record the information in any way that suits you the best.

== OS Install ==

Start with a stock Fedora 13 install. This How-To uses Fedora 13 x86_64, however it should be fairly easy to adapt to other recent Fedora versions. This document is also attempting to be easily ported to [[CentOS]]/[[RHEL]] version 6 once it is released. This will not adapt well to CentOS/RHEL version 5 though... Much of the cluster stack has changed dramatically since it was initially released.

These are sample kickstart script used by this paper. Be sure to set your own password string and network settings.

'''''Warning'''''! These kickstart scripts '''''will erase your hard drive'''''! Adapt them, don't blindly use them.

Generic cluster node kickstart scripts.
* [[Fedora13 KS an-node01.ks|an-node01.ks]]
* [[Fedora13 KS an-node02.ks|an-node02.ks]]

== AN!Cluster Install ==

If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.

* Download the custom '''AN!Cluster v0.2.001''' Install DVD. (4.5[[GiB]] iso). (Currently disabled - Reworking for F13)

== Post OS Install ==

Once the OS is installed, we need to do a couple things;

# Disable selinux
# Setup networking.
# Change the default run-level.

=== Disable 'selinux' ===

To allow this paper to focus on clustering, we will disable selinux. Obviously, this introduces security issues that you may not be comfortable with. If you wish to leave selinux enabled, it will be up to you to sort out issues that crop up.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect.

=== Change the Default Run-Level ===

'''This is an optional step'''. It improves performance only, it is not a required step.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reduces the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

<source lang="bash">
vim /etc/inittab
</source>
<source lang="text">
id:3:initdefault:
</source>
<source lang="bash">
init 3
</source>

=== Setup Networking ===

We need to remove NetworkManager, enable network, configure the ifcfg-eth* files and then start the network daemon.

==== Network Layout ====

This setup expects you to have three physical network cards connected to three independent networks. To have a common vernacular, lets use this table to describe them:

{|
!class="cell_all"|Network Description
!class="cell_tbr"|Short Name
!class="cell_tbr"|Device Name
!class="cell_tbr"|Suggested Subnet
!class="cell_tbr"|NIC Properties
|-
|class="cell_blr"|Internet-Facing Network
|class="cell_br"|IFN
|class="cell_br"|eth0
|class="cell_br"|192.168.1.0/24
|class="cell_br"|Remaining NIC should be used here. If using a [[Setting Up a PXE Server in Fedora|PXE server]], this should be a bootable NIC.
|-
|class="cell_blr"|Storage Network
|class="cell_br"|SN
|class="cell_br"|eth1
|class="cell_br"|10.0.0.0/24
|class="cell_br"|Fastest NIC should be used here.
|-
|class="cell_blr"|Back-Channel Network
|class="cell_br"|BCN
|class="cell_br"|eth2
|class="cell_br"|10.0.1.0/24
|class="cell_br"|NICs with [[IPMI]] piggy-back '''must''' be used here. Second-fastest NIC should be used here.
|}

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:
# If your nodes have [[IPMI]] piggy-backing on a normal NIC, that NIC '''''must''''' be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a ''major'' security risk.
# The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
# If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
# The final NIC should be used for the IFN subnet.

==== Node IP Addresses ====

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

{|
|class="cell_br"| 
|class="cell_tbr"|'''Internet-Facing Network''' (IFN)
|class="cell_tbr"|'''Storage Network''' (SN)
|class="cell_tbr"|'''Back-Channel Network''' (BCN)
|-
|class="cell_blr"|'''an-node01'''
|class="cell_br"|192.168.1.71
|class="cell_br"|10.0.0.71
|class="cell_br"|10.0.1.71
|-
|class="cell_blr"|'''an-node02'''
|class="cell_br"|192.168.1.72
|class="cell_br"|10.0.0.72
|class="cell_br"|10.0.1.72
|}

==== Remove NetworkManager ====

Some cluster software '''will not''' start with NetworkManager installed. This is because it is designed to be a highly-adaptive network system that can accommodate frequent changes in the network. To simplify these network transitions for end-users, a lot of reconfiguration of the network is done behind the scenes.

For workstations, this is wonderful. For clustering, this can be disastrous. Transient network outages are already a risk to a cluster's stability!

So first up, make sure that NetworkManager is completely removed from your system. If you used the [[#OS Install|kickstart]] scripts, then it was not installed. Otherwise, run:

<source lang="bash">
yum remove NetworkManager NetworkManager-gnome NetworkManager-openconnect NetworkManager-openvpn NetworkManager-pptp NetworkManager-vpnc cnetworkmanager knetworkmanager knetworkmanager-openvpn knetworkmanager-pptp knetworkmanager-vpnc libproxy-networkmanager yum-NetworkManager-dispatcher
</source>

==== Setup 'network' ====

Before proceeding with network configuration, check to see if your network cards are aligned to the proper ethX network names. If they need to be adjust, please follow this How-To before proceeding:

* [[Changing the ethX to Ethernet Device Mapping in Fedora]]

There are a few ways to configure network in Fedora:
* system-config-network (graphical)
* system-config-network-tui (ncurses)
* Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files. (See: [http://docs.fedoraproject.org/en-US/Fedora/12/html/Deployment_Guide/s1-networkscripts-interfaces.html here] for a full list of options)

Do not proceed until your node's networking is fully configured.

==== Update the Hosts File ====

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

'''Note''': Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.

The updated /etc/hosts file should look like this:

<source lang="bash">
vim /etc/hosts
</source>
<source lang="text">
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# Back-channel IPs to name mapping.
10.0.1.71 an-node01 an-node01.alteeve.com
10.0.1.72 an-node02 an-node02.alteeve.com

# Storage network
10.0.0.71 an-node01.sn
10.0.0.72 an-node02.sn

# Internet facing
192.168.1.71 an-node01.inet
192.168.1.72 an-node02.inet
</source>

To test this, ping both nodes by name and make sure the ping packets are sent on the 10.0.1.0/24 subnet:

<source lang="bash">
ping -c 5 an-node01
</source>
<source lang="text">
PING an-node01 (10.0.1.71) 56(84) bytes of data.
64 bytes from an-node01 (10.0.1.71): icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=2 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=3 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=4 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=5 ttl=64 time=0.055 ms
</source>
<source lang="bash">
ping -c 5 an-node02
</source>
<source lang="text">
PING an-node02 (10.0.1.72) 56(84) bytes of data.
64 bytes from an-node02 (10.0.1.72): icmp_seq=1 ttl=64 time=0.221 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=2 ttl=64 time=0.188 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=3 ttl=64 time=0.217 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=4 ttl=64 time=0.192 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=5 ttl=64 time=0.163 ms
</source>

==== Disable Firewalls ====

Be sure to flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

There will be enough potential sources of problem as it is. Disabling firewalls at this stage will minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.

<source lang="bash">
chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop
</source>

==== Setup SSH Shared Keys ====

'''This is an optional step'''. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. This is not meant to be a security-focused How-To, so please independently study the risks.

If you're a little new to [[SSH]], it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the ''other'' node's root user's directory.

For each user, on each machine you want to connect '''from''', run:

<source lang="bash">
ssh-keygen -t dsa -N "" -f ~/.ssh/id_dsa
</source>
<source lang="text">
Generating public/private dsa key pair.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
c4:cc:10:0a:6b:60:24:e7:57:94:69:e5:10:b2:cb:1f root@an-node01.alteeve.com
The key's randomart image is:
+--[ DSA 1024]----+
|+oo ..B*. |
|o+ o =+B |
| + +. * |
| . o . . |
| o E S |
| . . |
| . |
| |
| |
+-----------------+
</source>

This will create two files; The private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private which '''''must never''''' be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

'''Private key''':
<source lang="bash">
cat ~/.ssh/id_dsa
</source>
<source lang="text">
-----BEGIN DSA PRIVATE KEY-----
MIIBugIBAAKBgQCVo5nzeC/W8HWaFkD4pAmWO7ovf7S01f8HgocKKOYX/nMNxRui
wExBUUUt1b2TxYYQklTd9dn8UAzZ3UohN0CJNDrp//t2wfyACKKc2q+ewao5/svQ
xXTBu2ZPebKPcr1w9p0hq0xSr77Rg3v6Auc6AnC79DM9XkhYkVgNfgrvMwIVAKdY
SgTycvqNEWsJrCTTn5485yWvAoGANQ3rzIxFy2zHWyMlrpA9r5jllgxfaJVx3/Iw
DrEF82lr2dOmG8oYiKAhfvdgNyV7VeM2yxdjcdLPJAHHCx5BZZ7V9KjMzY26wS2r
d58TZJoWmO0nscmcwIbCv2at3fiMpxgr+nzD4tKxFOdWxYBwdmUpIjtJoW8wzY2H
uNghjL0CgYA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDN
bbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EE
D3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5AIUFX97yIig
n2SUltN+10NWUy4oYsA=
-----END DSA PRIVATE KEY-----
</source>

'''Public key''':
<source lang="bash">
cat ~/.ssh/id_dsa.pub
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From '''an-node01''', type:
<source lang="bash">
ssh root@an-node02
</source>
<source lang="text">
The authenticity of host 'an-node02 (10.0.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,10.0.1.72' (RSA) to the list of known hosts.
root@an-node02's password:
Last login: Tue Jul 6 22:28:19 2010 from 192.168.1.102
</source>

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

<source lang="bash">
cat ~/.ssh/authorized_keys
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

= Initial Cluster Setup =

Before we get into specifics, let's take a minute to talk about the major components used in our cluster.

== Core Program Overviews ==

These are the core programs that may be new to you that we will use to build our cluster. Before we configure them, lets take a minute to understand their roles.

=== Cluster Manager ===

The cluster manager, cman, takes the configuration file [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] configuration file and uses it to configure and start the various cluster elements. This is where nodes are defined, fence devices are configured and various cluster tolerances are set.

=== Corosync ===

[http://www.corosync.org Corosync] is, essentially, the clustering kernel which provides the essential services for cluster-aware applications and other cluster services. At it's core is the [[totem]] protocol which provides cluster membership and ordered messaging. On top of this, it implements a number of core clustering services like [[#cpg; Closed Process Group|closed process group messaging]], [[quorum]] and [[confdb]] (an object database).

It's goal is to provide a substantially simpler and more flexible set of APIs to facilitate clustering in Linux. It manages which nodes are in the cluster, it triggers error messages when something fails, manages cluster locking and so on. Most other clustered applications rely on corosync to know when something has happened or to announce when a cluster-related action has taken place.

Please note that the corosync_overview man page is considered out of date at the time of this writing.

==== cpg; Closed Process Group ====

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install corosynclib-devel
man cpg_overview
</source>

The closed process group is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages amongst nodes in a consistent order. It adds [[PID]]s and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group.

=== OpenAIS ===

OpenAIS is now an extension to Corosync that adds an open-source implementation of the [http://www.saforum.org/ Service Availability (SA) Forum's] 'Application Interface Specification' ([[AIS]]). It is an [[API]] and policy designed to be used by applications concerned with maintaining services during faults. AIS implements the 'Availability Management Framework' ([[AMF]]) which, in turn, provides for application fail over, cluster management ([[CLM]]), Checkpointing ([[CKPT]]), Eventing ([[EVT]]), Messaging ([[MSG]]), and Distributed Locking ([[DLOCK]]).

It does this by implementing pluggable shared libraries that get loaded into corosync at runtime. These libraries then can make use of corosync's internal API. Both corosync and openais services provide [[IPC]] interfaces so that application programmers can make use of their behaviour.

In short; applications can use OpenAIS to be cluster-aware. It's libraries are used by some applications, including Pacemaker. In our application, we will only be using it's libraries indirectly.

Please note that the openais_overview man page is considered out of date at the time of this writing.

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install openais
</source>

=== Pacemaker ===

[http://www.clusterlabs.org Pacemaker] is the cluster resource manager. It can be used to trigger scripts to control non cluster-aware applications. In this way, these non cluster-aware applications can be made more highly available. It is considered a replacement for cman's [[rgmanager]].

== Setup Core Programs ==

We now need to edit and create the configuration files for our cluster.

=== Setup cman ===

You may only need to configure this section.

First, install the cluster manager:

<source lang="bash">
yum install cman
</source>

Once installed, we need to create and fill in the [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] file. This is the core configuration file of our cluster and is an [[XML]] formatted file.

Please see: [[Two-Node Fedora 13 cluster.conf]]

<source lang="bash">
vim /etc/cluster/cluster.conf
</source>

The example in that article details many of the common options you will want to use. For a more complete list, please run man 5 cluster.conf for a more complete discussion on this file. Also, please be aware that some arguments can overlap between [[#Setup Corosync|corosync]] and this file.

Once it's setup, we need to verify it. Once you've finished, be sure to run this:

<source lang="bash">
xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf
</source>

If there are errors, fix them. Once it is formatted properly, the contents of you cluster.conf file will be returned followed by "/etc/cluster/cluster.conf validates". Once this is the case, you can move on.

A note; Normally at this stage, you'd use chkconfig to enable cman to start at boot. However, depending on how you use your cluster, you may need to modify the boot order. For now, let's leave it off until we're got all the components enabled. This is particularly important if you will be using [[DRBD]], [[CLVM]] or [[Xen]].

<source lang="bash">
chkconfig cman off
</source>

To see if it's working though, have a terminal open to both nodes and manually start cman. I suggest have two terminal windows open on each node. In one, start tailing /var/log/messages:

<source lang="bash">
clear; tail -f -n 0 /var/log/messages
</source>

Now, with both log files being watched, start cman on both nodes.

'''Note''': You MUST execute the following on both nodes withing the time limit set by post_join_delay (default it 6 seconds). If you wait longer than this, the first node started will [[fence]] the other node, assuming you have fencing configured properly. If fencing is '''not''' configured properly, your first node will hang.

<source lang="bash">
/etc/init.d/cman start
</source>

Examine the log files to verify there were no errors. If there were, fix them. If there was not, fantastic!

'''Warning''': This last step will forcefully reboot your nodes. Do not do this on a production cluster! If it must be done on a production server, stop all non-essential services and be sure you have a good backup, ''just in case''.

Lastly, test that your fencing is configured and working properly. We do this by using a program called fence_node. With both nodes up and running cman, call a fence against the other node. Be sure to be watching the log files.

'''From''' an-node01.alteeve.com

<source lang="bash">
fence_node an-node02.alteeve.com
</source>
<source lang="bash">
fence an-node02.alteeve.com success
</source>

If it worked, when the node comes back up, repeat the process from an-node02.alteeve.com against an-node01.alteeve.com.

=== Setup Corosync ===

'''Note''': In our cluster, we will not need to configure Corosync.

When using cman, as we are here, Corosync's configuration file is ignored.

Most of the settings available in [[Two-Node Fedora 13 corosync.conf|corosync.conf]] can be found in the [[Two-Node Fedora 13 cluster.conf|cluster.conf]] file. Please see the cluster.conf article for a comprehensive list of what is and is not supported.

If you want to configure corosync and not use cman, please see the following article:

* [[Two-Node Fedora 13 corosync.conf]]

=== Setup Pacemaker ===

ToDO.

= Fencing =

Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.

* The Cluster Admin's Mantra:
** '''The only thing you don't know is what you don't know'''.

Just because one node loses communication with another node, it '''cannot''' be assume that the silent node is dead!

== What is it? ==

"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a '''split-brain''' condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.

Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:

* Power
** Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
* Blocking
** Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.

With power fencing, the term used is "STONITH", literally, '''S'''hoot '''T'''he '''O'''ther '''N'''ode '''I'''n '''T'''he '''H'''ead. Picture it like an old west dual. If one node is dead, the other node is going to win the dual by default and the dead node will just be shot again. When both nodes are alive, however, the faster node will win and will "kill" (power off or reset) the slower node before it has a chance to fire. Once this dual is over, the surviving node can then access the shared resource confident that it is the only one working on it.

== Misconception ==

It is a '''very''' common mistake to ignore fencing when first starting to learn about clustering. Often people think ''"It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster."''.

'''''Wrong!'''''

For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, [[cman]] and related daemons will fail if they can't find a fence agent to use.

Secondly; Testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.

== Implementation ==

How you implement fencing depends largely on what kind of fence device(s) you have. This is where I need help from you, dear reader. I only have access to [[Node Assassin]] fence devices on my lab's cluster. If you have other fence devices, like [[IPMI]], [[DRAC]] or so on, please let me know so that I can expand this section.

=== Implementing Node Assassin ===

In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced daemon. We'll cover the details of the [[cluster.conf]] file in a moment.

When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:
<source lang="xml">
<cluster name="an-cluster" config_version="1">
<clusternodes>
<clusternode name="an-node02.alteeve.com" nodeid="2">
<fence>
<method name="node_assassin">
<device name="motoko" port="02" action="off"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="motoko" agent="fence_na" quiet="true"
ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
</fencedevice>
</fencedevices>
</cluster>
</source>

If the cluster manager determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name, which is motoko in this case. It then looks in the <fencedevices> section for the device with the matching name. From there, it gets the information needed to find and access the fence device. Once it connects to the fence device, it then passes the options set in an-node02.alteeve.com's <fence> argument.

So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments:
* ipaddr=motoko.alteeve.com
* login=motoko
* passwd=secret
* quiet=true
* port=2
* action=off

How the fence agent acts on these arguments varies depending on the fence device itself. In general terms, the 'fence_na' fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on, which could be a power jack, a power or reset button, a network switch port and so on. Finally, it tells the device what action to take.

Once the device completes, it returns a success or failed message. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.

If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.

== Fence Devices ==

Many major [[OEM]]s have their own remote management devices that can serve as fence devices. Examples are [http://dell.ca Dell]'s 'DRAC' (Dell Remote Access Controller), [http://hp.ca HP]'s iLO (Integrate Lights Out), [http://ibm.ca IBM]'s 'RSA' (Remote Supervisor Adapter), [http://sun.ca Sun]'s 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via [[IPMI]], Intelligent Power Management Interface.

In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.

Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.

=== IPMI ===

[[IPMI]] is perhaps the most common method used for fencing nodes in a cluster. It requires having a system board with an IPMI baseboard management controller (BMC). IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

To start, we need to install the IPMI user software:

<source lang="bash">
</source>

=== Node Assassin ===

A cheap alternative is the [[Node Assassin]], an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.

'''Full Disclosure''': Node Assassin was created by me, with much help from others, for this paper.

= Clean Up =

Some daemons may have been installed or dragged in during the setup of the cluster. Most notable is heartbeat which is not compatible with [[RHCS]]. The following command will disable various daemons from starting at boot.

As for the rest of the daemons below, it is safe to run even when the daemons aren't installed or have already been removed. Of course, skip the ones you actually want to use. This assumes that the nodes have been built according to this HowTo.

<source lang="bash">
chkconfig heartbeat off; chkconfig iscsid off; chkconfig iptables off; chkconfig ip6tables off
</source>

= So You Have a Cluster - Now What? =

Building the cluster infrastructure was pretty easy, wasn't it?

But what will you '''''do''''' with it?

{|style="width: 75%; text-align: center; padding: 10px; border-spacing: 15px;"
|-
!style="width: 100%; border: 1px solid #dfdfdf;" colspan="3"|Choose Your Own Adventure Cluster
|-
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|style="width: 34%; border: 1px solid #dfdfdf;"| [[Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM|Xen-Based Virtual Machine Host on DRBD+CLVM]] High Availability VM Host
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|}

= Thanks =

* A '''huge''' thanks to [http://iplink.net Interlink Connectivity]! They hire me as a contractor and have allowed me to extend these docs while working on their clusters. Development of these How-Tos would be much slower if not for them. If you need hosting or colo services, drop them a line. They're website is a bit out of date though, so disregard the prices. :)
* To '''sdake''' of [http://corosync.org corosync] for helping me sort out the '''plock''' component and corosync in general.
* To '''Angus Salkeld''' for helping me nail down the Corosync and OpenAIS differences.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013922.html HJ Lee] from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013925.html Steven Dake] for clarifying the to_x vs. logoutput: x arguments in openais.conf.
* To [http://dk.linkedin.com/in/fabbione Fabio Massimo Di Nitto] for helping me get caught up with clustering and VMs.

{{footer}}

Abandoned - Two Node Fedora 13 Cluster

2010-09-09T22:21:16Z

Facade: /* Change the Default Run-Level */

{{howto_header}}

'''''Notice''''', '''Sep. 08, 2010''' - ''Draft 1'': This is the first draft of this HowTo. It has not been peer-reviewed yet, so please proceed with caution and on a test cluster. Any and all feedback is much appreciated.

= Overview =

This paper has one goal;

# How to assemble the simplest cluster possible, a '''2-Node Cluster''', which you can then expand on for your own needs.

With this completed, you can then jump into "step 2" papers that will show various uses of a two node cluster:

# How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time.
# How to create simple resources that can move between nodes. Examples will be a simple PostgreSQL database, DHCP, DNS and web servers.

== Prerequisites ==

It is expected that you are already comfortable with the Linux command line, specifically [[bash]], and that you are familiar with general administrative tasks in Red Hat based distributions, specifically [[Fedora]]. You will also need to be comfortable using editors like [[vim]], [[nano]] or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, [[multicast]], broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

== Platform ==

This paper will implement the [[Red Hat]] Cluster Suite using the Fedora v13 distribution. This paper uses the [[x86_64]] repositories, however, if you are on an [[i386]] (32 bit) system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names.

You can either download the stock Fedora 13 DVD ISO (currently at version 5.4 which is used in this document), or you can try out the alpha [[#AN!Cluster Install|AN!Cluster Install]] DVD. (4.3GB iso). If you use the latter, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant Fedora distro, please [[Digimer|contact me]] and let me know what your trouble was.

== Why Fedora 13? ==

Generally speaking, I much prefer to use a server-oriented Linux distribution like [[CentOS]], [[Debian]] or similar. However, there have been many recent changes in the Linux-Clustering world that have made all of the currently available server-class distributions obsolete. With luck, once [[Red Hat]] Enterprise Linux and [[CentOS]] version 6 is released, this will change.

Until then, [[Fedora]] version 13 provides the most up to date binary releases of the new implementation of the clustering stack available. For this reason, FC13 is the best choice in clustering, if you want to be current. To mitigate some of the issues introduced by using a workstation distribution, many packages will be stripped out of the default install.

== Focus ==

Clusters can serve to solve three problems; '''Reliability''', '''Performance''' and '''Scalability'''.

This focus of the cluster described in this paper is primarily '''reliability'''. Second to this, '''scalability''' will be the priority leaving '''performance''' to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper.

== Goal ==

At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" resources. That is, resources that exist on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources.

This paper should also serve to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper.

= Begin =

Let's begin!

== Hardware ==

We will need two physical servers each with the following hardware:
* One or more multi-core [[CPU]]s with Virtualization support.
* Three network cards; At least one should be gigabit or faster.
* One or more hard drives.
* You will need some form of a [[fence|fence device]]. This can be an [[IPMI]]-enabled server, a [http://nodeassassin.org Node Assassin], a fenceable [http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900&tab=features PDU] or similar.

This paper uses the following hardware:
* ASUS [http://support.asus.com/search/search.aspx?keyword=m4a78l-m&SLanguage=en-us M4A78L-M]
* AMD Athlon II x2 250
* 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
* 1x Intel 82540 PCI NICs
* 1x D-Link DGE-560T
* Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would serve the purposes of creating this document. What you purchase shouldn't matter, so long as the minimum requirements are met.

== Pre-Assembly ==

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to [[Setting Up a PXE Server in Fedora|network boot]] your install media. In that case, if the wrong NIC is chosen for eth0, you will be presented with a list of MAC addresses to attempt setup with.

Before you assemble your servers, record their network cards' [[MAC]] addresses. I like to keep simple text files like these:

<source lang="bash">
cat an-node01.mac
</source>
<source lang="text">
90:E6:BA:71:82:EA eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53 eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

<source lang="bash">
cat an-node02.mac
</source>
<source lang="text">
90:E6:BA:71:82:D8 eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

Feel free to record the information in any way that suits you the best.

== OS Install ==

Start with a stock Fedora 13 install. This How-To uses Fedora 13 x86_64, however it should be fairly easy to adapt to other recent Fedora versions. This document is also attempting to be easily ported to [[CentOS]]/[[RHEL]] version 6 once it is released. This will not adapt well to CentOS/RHEL version 5 though... Much of the cluster stack has changed dramatically since it was initially released.

These are sample kickstart script used by this paper. Be sure to set your own password string and network settings.

'''''Warning'''''! These kickstart scripts '''''will erase your hard drive'''''! Adapt them, don't blindly use them.

Generic cluster node kickstart scripts.
* [[Fedora13 KS an-node01.ks|an-node01.ks]]
* [[Fedora13 KS an-node02.ks|an-node02.ks]]

== AN!Cluster Install ==

If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.

* Download the custom '''AN!Cluster v0.2.001''' Install DVD. (4.5[[GiB]] iso). (Currently disabled - Reworking for F13)

== Post OS Install ==

Once the OS is installed, we need to do a couple things;

# Disable selinux
# Setup networking.
# Change the default run-level.

=== Disable 'selinux' ===

To allow this paper to focus on clustering, we will disable selinux. Obviously, this introduces security issues that you may not be comfortable with. If you wish to leave selinux enabled, it will be up to you to sort out issues that crop up.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect.

=== Change the Default Run-Level ===

'''This is an optional step'''. It improves performance only, it is not a required step.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reduces the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

<source lang="bash">
vim /etc/inittab
</source>
<source lang="text">
id:3:initdefault:
</source>
<source lang="bash">
init 3
</source>

=== Setup Networking ===

We need to remove NetworkManager, enable network, configure the ifcfg-eth* files and then start the network daemon.

==== Network Layout ====

This setup expects you to have three physical network cards connected to three independent networks. To have a common vernacular, lets use this table to describe them:

{|
!class="cell_all"|Network Description
!class="cell_tbr"|Short Name
!class="cell_tbr"|Device Name
!class="cell_tbr"|Suggested Subnet
!class="cell_tbr"|NIC Properties
|-
|class="cell_blr"|Internet-Facing Network
|class="cell_br"|IFN
|class="cell_br"|eth0
|class="cell_br"|192.168.1.0/24
|class="cell_br"|Remaining NIC should be used here. If using a [[Setting Up a PXE Server in Fedora|PXE server]], this should be a bootable NIC.
|-
|class="cell_blr"|Storage Network
|class="cell_br"|SN
|class="cell_br"|eth1
|class="cell_br"|10.0.0.0/24
|class="cell_br"|Fastest NIC should be used here.
|-
|class="cell_blr"|Back-Channel Network
|class="cell_br"|BCN
|class="cell_br"|eth2
|class="cell_br"|10.0.1.0/24
|class="cell_br"|NICs with [[IPMI]] piggy-back '''must''' be used here. Second-fastest NIC should be used here.
|}

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:
# If your nodes have [[IPMI]] piggy-backing on a normal NIC, that NIC '''''must''''' be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a ''major'' security risk.
# The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
# If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
# The final NIC should be used for the IFN subnet.

==== Node IP Addresses ====

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

{|
|class="cell_br"| 
|class="cell_tbr"|'''Internet-Facing Network''' (IFN)
|class="cell_tbr"|'''Storage Network''' (SN)
|class="cell_tbr"|'''Back-Channel Network''' (BCN)
|-
|class="cell_blr"|'''an-node01'''
|class="cell_br"|192.168.1.71
|class="cell_br"|10.0.0.71
|class="cell_br"|10.0.1.71
|-
|class="cell_blr"|'''an-node02'''
|class="cell_br"|192.168.1.72
|class="cell_br"|10.0.0.72
|class="cell_br"|10.0.1.72
|}

==== Remove NetworkManager ====

Some cluster software '''will not''' start with NetworkManager installed. This is because is designed to be a highly-adaptive network system that can accommodate frequent changes in the network. To simplify these network transitions for end-users, a lot of reconfiguration of the network is done behind the scenes.

For workstations, this is wonderful. For clustering, this can be disastrous. Transient network outages are already a risk to a cluster's stability!

So first up, make sure that NetworkManager is completely removed from your system. If you used the [[#OS Install|kickstart]] scripts, then it was not installed. Otherwise, run:

<source lang="bash">
yum remove NetworkManager NetworkManager-gnome NetworkManager-openconnect NetworkManager-openvpn NetworkManager-pptp NetworkManager-vpnc cnetworkmanager knetworkmanager knetworkmanager-openvpn knetworkmanager-pptp knetworkmanager-vpnc libproxy-networkmanager yum-NetworkManager-dispatcher
</source>

==== Setup 'network' ====

Before proceeding with network configuration, check to see if your network cards are aligned to the proper ethX network names. If they need to be adjust, please follow this How-To before proceeding:

* [[Changing the ethX to Ethernet Device Mapping in Fedora]]

There are a few ways to configure network in Fedora:
* system-config-network (graphical)
* system-config-network-tui (ncurses)
* Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files. (See: [http://docs.fedoraproject.org/en-US/Fedora/12/html/Deployment_Guide/s1-networkscripts-interfaces.html here] for a full list of options)

Do not proceed until your node's networking is fully configured.

==== Update the Hosts File ====

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

'''Note''': Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.

The updated /etc/hosts file should look like this:

<source lang="bash">
vim /etc/hosts
</source>
<source lang="text">
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# Back-channel IPs to name mapping.
10.0.1.71 an-node01 an-node01.alteeve.com
10.0.1.72 an-node02 an-node02.alteeve.com

# Storage network
10.0.0.71 an-node01.sn
10.0.0.72 an-node02.sn

# Internet facing
192.168.1.71 an-node01.inet
192.168.1.72 an-node02.inet
</source>

To test this, ping both nodes by name and make sure the ping packets are sent on the 10.0.1.0/24 subnet:

<source lang="bash">
ping -c 5 an-node01
</source>
<source lang="text">
PING an-node01 (10.0.1.71) 56(84) bytes of data.
64 bytes from an-node01 (10.0.1.71): icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=2 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=3 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=4 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=5 ttl=64 time=0.055 ms
</source>
<source lang="bash">
ping -c 5 an-node02
</source>
<source lang="text">
PING an-node02 (10.0.1.72) 56(84) bytes of data.
64 bytes from an-node02 (10.0.1.72): icmp_seq=1 ttl=64 time=0.221 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=2 ttl=64 time=0.188 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=3 ttl=64 time=0.217 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=4 ttl=64 time=0.192 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=5 ttl=64 time=0.163 ms
</source>

==== Disable Firewalls ====

Be sure to flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

There will be enough potential sources of problem as it is. Disabling firewalls at this stage will minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.

<source lang="bash">
chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop
</source>

==== Setup SSH Shared Keys ====

'''This is an optional step'''. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. This is not meant to be a security-focused How-To, so please independently study the risks.

If you're a little new to [[SSH]], it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the ''other'' node's root user's directory.

For each user, on each machine you want to connect '''from''', run:

<source lang="bash">
ssh-keygen -t dsa -N "" -f ~/.ssh/id_dsa
</source>
<source lang="text">
Generating public/private dsa key pair.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
c4:cc:10:0a:6b:60:24:e7:57:94:69:e5:10:b2:cb:1f root@an-node01.alteeve.com
The key's randomart image is:
+--[ DSA 1024]----+
|+oo ..B*. |
|o+ o =+B |
| + +. * |
| . o . . |
| o E S |
| . . |
| . |
| |
| |
+-----------------+
</source>

This will create two files; The private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private which '''''must never''''' be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

'''Private key''':
<source lang="bash">
cat ~/.ssh/id_dsa
</source>
<source lang="text">
-----BEGIN DSA PRIVATE KEY-----
MIIBugIBAAKBgQCVo5nzeC/W8HWaFkD4pAmWO7ovf7S01f8HgocKKOYX/nMNxRui
wExBUUUt1b2TxYYQklTd9dn8UAzZ3UohN0CJNDrp//t2wfyACKKc2q+ewao5/svQ
xXTBu2ZPebKPcr1w9p0hq0xSr77Rg3v6Auc6AnC79DM9XkhYkVgNfgrvMwIVAKdY
SgTycvqNEWsJrCTTn5485yWvAoGANQ3rzIxFy2zHWyMlrpA9r5jllgxfaJVx3/Iw
DrEF82lr2dOmG8oYiKAhfvdgNyV7VeM2yxdjcdLPJAHHCx5BZZ7V9KjMzY26wS2r
d58TZJoWmO0nscmcwIbCv2at3fiMpxgr+nzD4tKxFOdWxYBwdmUpIjtJoW8wzY2H
uNghjL0CgYA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDN
bbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EE
D3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5AIUFX97yIig
n2SUltN+10NWUy4oYsA=
-----END DSA PRIVATE KEY-----
</source>

'''Public key''':
<source lang="bash">
cat ~/.ssh/id_dsa.pub
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From '''an-node01''', type:
<source lang="bash">
ssh root@an-node02
</source>
<source lang="text">
The authenticity of host 'an-node02 (10.0.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,10.0.1.72' (RSA) to the list of known hosts.
root@an-node02's password:
Last login: Tue Jul 6 22:28:19 2010 from 192.168.1.102
</source>

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

<source lang="bash">
cat ~/.ssh/authorized_keys
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

= Initial Cluster Setup =

Before we get into specifics, let's take a minute to talk about the major components used in our cluster.

== Core Program Overviews ==

These are the core programs that may be new to you that we will use to build our cluster. Before we configure them, lets take a minute to understand their roles.

=== Cluster Manager ===

The cluster manager, cman, takes the configuration file [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] configuration file and uses it to configure and start the various cluster elements. This is where nodes are defined, fence devices are configured and various cluster tolerances are set.

=== Corosync ===

[http://www.corosync.org Corosync] is, essentially, the clustering kernel which provides the essential services for cluster-aware applications and other cluster services. At it's core is the [[totem]] protocol which provides cluster membership and ordered messaging. On top of this, it implements a number of core clustering services like [[#cpg; Closed Process Group|closed process group messaging]], [[quorum]] and [[confdb]] (an object database).

It's goal is to provide a substantially simpler and more flexible set of APIs to facilitate clustering in Linux. It manages which nodes are in the cluster, it triggers error messages when something fails, manages cluster locking and so on. Most other clustered applications rely on corosync to know when something has happened or to announce when a cluster-related action has taken place.

Please note that the corosync_overview man page is considered out of date at the time of this writing.

==== cpg; Closed Process Group ====

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install corosynclib-devel
man cpg_overview
</source>

The closed process group is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages amongst nodes in a consistent order. It adds [[PID]]s and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group.

=== OpenAIS ===

OpenAIS is now an extension to Corosync that adds an open-source implementation of the [http://www.saforum.org/ Service Availability (SA) Forum's] 'Application Interface Specification' ([[AIS]]). It is an [[API]] and policy designed to be used by applications concerned with maintaining services during faults. AIS implements the 'Availability Management Framework' ([[AMF]]) which, in turn, provides for application fail over, cluster management ([[CLM]]), Checkpointing ([[CKPT]]), Eventing ([[EVT]]), Messaging ([[MSG]]), and Distributed Locking ([[DLOCK]]).

It does this by implementing pluggable shared libraries that get loaded into corosync at runtime. These libraries then can make use of corosync's internal API. Both corosync and openais services provide [[IPC]] interfaces so that application programmers can make use of their behaviour.

In short; applications can use OpenAIS to be cluster-aware. It's libraries are used by some applications, including Pacemaker. In our application, we will only be using it's libraries indirectly.

Please note that the openais_overview man page is considered out of date at the time of this writing.

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install openais
</source>

=== Pacemaker ===

[http://www.clusterlabs.org Pacemaker] is the cluster resource manager. It can be used to trigger scripts to control non cluster-aware applications. In this way, these non cluster-aware applications can be made more highly available. It is considered a replacement for cman's [[rgmanager]].

== Setup Core Programs ==

We now need to edit and create the configuration files for our cluster.

=== Setup cman ===

You may only need to configure this section.

First, install the cluster manager:

<source lang="bash">
yum install cman
</source>

Once installed, we need to create and fill in the [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] file. This is the core configuration file of our cluster and is an [[XML]] formatted file.

Please see: [[Two-Node Fedora 13 cluster.conf]]

<source lang="bash">
vim /etc/cluster/cluster.conf
</source>

The example in that article details many of the common options you will want to use. For a more complete list, please run man 5 cluster.conf for a more complete discussion on this file. Also, please be aware that some arguments can overlap between [[#Setup Corosync|corosync]] and this file.

Once it's setup, we need to verify it. Once you've finished, be sure to run this:

<source lang="bash">
xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf
</source>

If there are errors, fix them. Once it is formatted properly, the contents of you cluster.conf file will be returned followed by "/etc/cluster/cluster.conf validates". Once this is the case, you can move on.

A note; Normally at this stage, you'd use chkconfig to enable cman to start at boot. However, depending on how you use your cluster, you may need to modify the boot order. For now, let's leave it off until we're got all the components enabled. This is particularly important if you will be using [[DRBD]], [[CLVM]] or [[Xen]].

<source lang="bash">
chkconfig cman off
</source>

To see if it's working though, have a terminal open to both nodes and manually start cman. I suggest have two terminal windows open on each node. In one, start tailing /var/log/messages:

<source lang="bash">
clear; tail -f -n 0 /var/log/messages
</source>

Now, with both log files being watched, start cman on both nodes.

'''Note''': You MUST execute the following on both nodes withing the time limit set by post_join_delay (default it 6 seconds). If you wait longer than this, the first node started will [[fence]] the other node, assuming you have fencing configured properly. If fencing is '''not''' configured properly, your first node will hang.

<source lang="bash">
/etc/init.d/cman start
</source>

Examine the log files to verify there were no errors. If there were, fix them. If there was not, fantastic!

'''Warning''': This last step will forcefully reboot your nodes. Do not do this on a production cluster! If it must be done on a production server, stop all non-essential services and be sure you have a good backup, ''just in case''.

Lastly, test that your fencing is configured and working properly. We do this by using a program called fence_node. With both nodes up and running cman, call a fence against the other node. Be sure to be watching the log files.

'''From''' an-node01.alteeve.com

<source lang="bash">
fence_node an-node02.alteeve.com
</source>
<source lang="bash">
fence an-node02.alteeve.com success
</source>

If it worked, when the node comes back up, repeat the process from an-node02.alteeve.com against an-node01.alteeve.com.

=== Setup Corosync ===

'''Note''': In our cluster, we will not need to configure Corosync.

When using cman, as we are here, Corosync's configuration file is ignored.

Most of the settings available in [[Two-Node Fedora 13 corosync.conf|corosync.conf]] can be found in the [[Two-Node Fedora 13 cluster.conf|cluster.conf]] file. Please see the cluster.conf article for a comprehensive list of what is and is not supported.

If you want to configure corosync and not use cman, please see the following article:

* [[Two-Node Fedora 13 corosync.conf]]

=== Setup Pacemaker ===

ToDO.

= Fencing =

Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.

* The Cluster Admin's Mantra:
** '''The only thing you don't know is what you don't know'''.

Just because one node loses communication with another node, it '''cannot''' be assume that the silent node is dead!

== What is it? ==

"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a '''split-brain''' condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.

Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:

* Power
** Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
* Blocking
** Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.

With power fencing, the term used is "STONITH", literally, '''S'''hoot '''T'''he '''O'''ther '''N'''ode '''I'''n '''T'''he '''H'''ead. Picture it like an old west dual. If one node is dead, the other node is going to win the dual by default and the dead node will just be shot again. When both nodes are alive, however, the faster node will win and will "kill" (power off or reset) the slower node before it has a chance to fire. Once this dual is over, the surviving node can then access the shared resource confident that it is the only one working on it.

== Misconception ==

It is a '''very''' common mistake to ignore fencing when first starting to learn about clustering. Often people think ''"It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster."''.

'''''Wrong!'''''

For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, [[cman]] and related daemons will fail if they can't find a fence agent to use.

Secondly; Testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.

== Implementation ==

How you implement fencing depends largely on what kind of fence device(s) you have. This is where I need help from you, dear reader. I only have access to [[Node Assassin]] fence devices on my lab's cluster. If you have other fence devices, like [[IPMI]], [[DRAC]] or so on, please let me know so that I can expand this section.

=== Implementing Node Assassin ===

In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced daemon. We'll cover the details of the [[cluster.conf]] file in a moment.

When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:
<source lang="xml">
<cluster name="an-cluster" config_version="1">
<clusternodes>
<clusternode name="an-node02.alteeve.com" nodeid="2">
<fence>
<method name="node_assassin">
<device name="motoko" port="02" action="off"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="motoko" agent="fence_na" quiet="true"
ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
</fencedevice>
</fencedevices>
</cluster>
</source>

If the cluster manager determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name, which is motoko in this case. It then looks in the <fencedevices> section for the device with the matching name. From there, it gets the information needed to find and access the fence device. Once it connects to the fence device, it then passes the options set in an-node02.alteeve.com's <fence> argument.

So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments:
* ipaddr=motoko.alteeve.com
* login=motoko
* passwd=secret
* quiet=true
* port=2
* action=off

How the fence agent acts on these arguments varies depending on the fence device itself. In general terms, the 'fence_na' fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on, which could be a power jack, a power or reset button, a network switch port and so on. Finally, it tells the device what action to take.

Once the device completes, it returns a success or failed message. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.

If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.

== Fence Devices ==

Many major [[OEM]]s have their own remote management devices that can serve as fence devices. Examples are [http://dell.ca Dell]'s 'DRAC' (Dell Remote Access Controller), [http://hp.ca HP]'s iLO (Integrate Lights Out), [http://ibm.ca IBM]'s 'RSA' (Remote Supervisor Adapter), [http://sun.ca Sun]'s 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via [[IPMI]], Intelligent Power Management Interface.

In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.

Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.

=== IPMI ===

[[IPMI]] is perhaps the most common method used for fencing nodes in a cluster. It requires having a system board with an IPMI baseboard management controller (BMC). IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

To start, we need to install the IPMI user software:

<source lang="bash">
</source>

=== Node Assassin ===

A cheap alternative is the [[Node Assassin]], an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.

'''Full Disclosure''': Node Assassin was created by me, with much help from others, for this paper.

= Clean Up =

Some daemons may have been installed or dragged in during the setup of the cluster. Most notable is heartbeat which is not compatible with [[RHCS]]. The following command will disable various daemons from starting at boot.

As for the rest of the daemons below, it is safe to run even when the daemons aren't installed or have already been removed. Of course, skip the ones you actually want to use. This assumes that the nodes have been built according to this HowTo.

<source lang="bash">
chkconfig heartbeat off; chkconfig iscsid off; chkconfig iptables off; chkconfig ip6tables off
</source>

= So You Have a Cluster - Now What? =

Building the cluster infrastructure was pretty easy, wasn't it?

But what will you '''''do''''' with it?

{|style="width: 75%; text-align: center; padding: 10px; border-spacing: 15px;"
|-
!style="width: 100%; border: 1px solid #dfdfdf;" colspan="3"|Choose Your Own Adventure Cluster
|-
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|style="width: 34%; border: 1px solid #dfdfdf;"| [[Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM|Xen-Based Virtual Machine Host on DRBD+CLVM]] High Availability VM Host
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|}

= Thanks =

* A '''huge''' thanks to [http://iplink.net Interlink Connectivity]! They hire me as a contractor and have allowed me to extend these docs while working on their clusters. Development of these How-Tos would be much slower if not for them. If you need hosting or colo services, drop them a line. They're website is a bit out of date though, so disregard the prices. :)
* To '''sdake''' of [http://corosync.org corosync] for helping me sort out the '''plock''' component and corosync in general.
* To '''Angus Salkeld''' for helping me nail down the Corosync and OpenAIS differences.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013922.html HJ Lee] from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013925.html Steven Dake] for clarifying the to_x vs. logoutput: x arguments in openais.conf.
* To [http://dk.linkedin.com/in/fabbione Fabio Massimo Di Nitto] for helping me get caught up with clustering and VMs.

{{footer}}

User talk:Facade

2010-09-09T13:13:56Z

Facade:

Thank you for all the spelling mistake fixes!

--[[User:Digimer|Digimer]] 13:01, 9 September 2010 (UTC)

You're welcome! Ran out of time this morning, will continue tonight if time permits.

--[[User:Facade|Facade]] 13:13, 9 September 2010 (UTC)

Abandoned - Two Node Fedora 13 Cluster

2010-09-09T06:24:08Z

Facade: /* OS Install */

{{howto_header}}

'''''Notice''''', '''Sep. 08, 2010''' - ''Draft 1'': This is the first draft of this HowTo. It has not been peer-reviewed yet, so please proceed with caution and on a test cluster. Any and all feedback is much appreciated.

= Overview =

This paper has one goal;

# How to assemble the simplest cluster possible, a '''2-Node Cluster''', which you can then expand on for your own needs.

With this completed, you can then jump into "step 2" papers that will show various uses of a two node cluster:

# How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time.
# How to create simple resources that can move between nodes. Examples will be a simple PostgreSQL database, DHCP, DNS and web servers.

== Prerequisites ==

It is expected that you are already comfortable with the Linux command line, specifically [[bash]], and that you are familiar with general administrative tasks in Red Hat based distributions, specifically [[Fedora]]. You will also need to be comfortable using editors like [[vim]], [[nano]] or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, [[multicast]], broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

== Platform ==

This paper will implement the [[Red Hat]] Cluster Suite using the Fedora v13 distribution. This paper uses the [[x86_64]] repositories, however, if you are on an [[i386]] (32 bit) system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names.

You can either download the stock Fedora 13 DVD ISO (currently at version 5.4 which is used in this document), or you can try out the alpha [[#AN!Cluster Install|AN!Cluster Install]] DVD. (4.3GB iso). If you use the latter, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant Fedora distro, please [[Digimer|contact me]] and let me know what your trouble was.

== Why Fedora 13? ==

Generally speaking, I much prefer to use a server-oriented Linux distribution like [[CentOS]], [[Debian]] or similar. However, there have been many recent changes in the Linux-Clustering world that have made all of the currently available server-class distributions obsolete. With luck, once [[Red Hat]] Enterprise Linux and [[CentOS]] version 6 is released, this will change.

Until then, [[Fedora]] version 13 provides the most up to date binary releases of the new implementation of the clustering stack available. For this reason, FC13 is the best choice in clustering, if you want to be current. To mitigate some of the issues introduced by using a workstation distribution, many packages will be stripped out of the default install.

== Focus ==

Clusters can serve to solve three problems; '''Reliability''', '''Performance''' and '''Scalability'''.

This focus of the cluster described in this paper is primarily '''reliability'''. Second to this, '''scalability''' will be the priority leaving '''performance''' to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper.

== Goal ==

At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" resources. That is, resources that exist on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources.

This paper should also serve to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper.

= Begin =

Let's begin!

== Hardware ==

We will need two physical servers each with the following hardware:
* One or more multi-core [[CPU]]s with Virtualization support.
* Three network cards; At least one should be gigabit or faster.
* One or more hard drives.
* You will need some form of a [[fence|fence device]]. This can be an [[IPMI]]-enabled server, a [http://nodeassassin.org Node Assassin], a fenceable [http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900&tab=features PDU] or similar.

This paper uses the following hardware:
* ASUS [http://support.asus.com/search/search.aspx?keyword=m4a78l-m&SLanguage=en-us M4A78L-M]
* AMD Athlon II x2 250
* 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
* 1x Intel 82540 PCI NICs
* 1x D-Link DGE-560T
* Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would serve the purposes of creating this document. What you purchase shouldn't matter, so long as the minimum requirements are met.

== Pre-Assembly ==

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to [[Setting Up a PXE Server in Fedora|network boot]] your install media. In that case, if the wrong NIC is chosen for eth0, you will be presented with a list of MAC addresses to attempt setup with.

Before you assemble your servers, record their network cards' [[MAC]] addresses. I like to keep simple text files like these:

<source lang="bash">
cat an-node01.mac
</source>
<source lang="text">
90:E6:BA:71:82:EA eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53 eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

<source lang="bash">
cat an-node02.mac
</source>
<source lang="text">
90:E6:BA:71:82:D8 eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

Feel free to record the information in any way that suits you the best.

== OS Install ==

Start with a stock Fedora 13 install. This How-To uses Fedora 13 x86_64, however it should be fairly easy to adapt to other recent Fedora versions. This document is also attempting to be easily ported to [[CentOS]]/[[RHEL]] version 6 once it is released. This will not adapt well to CentOS/RHEL version 5 though... Much of the cluster stack has changed dramatically since it was initially released.

These are sample kickstart script used by this paper. Be sure to set your own password string and network settings.

'''''Warning'''''! These kickstart scripts '''''will erase your hard drive'''''! Adapt them, don't blindly use them.

Generic cluster node kickstart scripts.
* [[Fedora13 KS an-node01.ks|an-node01.ks]]
* [[Fedora13 KS an-node02.ks|an-node02.ks]]

== AN!Cluster Install ==

If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.

* Download the custom '''AN!Cluster v0.2.001''' Install DVD. (4.5[[GiB]] iso). (Currently disabled - Reworking for F13)

== Post OS Install ==

Once the OS is installed, we need to do a couple things;

# Disable selinux
# Setup networking.
# Change the default run-level.

=== Disable 'selinux' ===

To allow this paper to focus on clustering, we will disable selinux. Obviously, this introduces security issues that you may not be comfortable with. If you wish to leave selinux enabled, it will be up to you to sort out issues that crop up.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect.

=== Change the Default Run-Level ===

'''This is an optional step'''. It improves performance only, it is not a required step.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reducing the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

<source lang="bash">
vim /etc/inittab
</source>
<source lang="text">
id:3:initdefault:
</source>
<source lang="bash">
init 3
</source>

=== Setup Networking ===

We need to remove NetworkManager, enable network, configure the ifcfg-eth* files and then start the network daemon.

==== Network Layout ====

This setup expects you to have three physical network cards connected to three independent networks. To have a common vernacular, lets use this table to describe them:

{|
!class="cell_all"|Network Description
!class="cell_tbr"|Short Name
!class="cell_tbr"|Device Name
!class="cell_tbr"|Suggested Subnet
!class="cell_tbr"|NIC Properties
|-
|class="cell_blr"|Internet-Facing Network
|class="cell_br"|IFN
|class="cell_br"|eth0
|class="cell_br"|192.168.1.0/24
|class="cell_br"|Remaining NIC should be used here. If using a [[Setting Up a PXE Server in Fedora|PXE server]], this should be a bootable NIC.
|-
|class="cell_blr"|Storage Network
|class="cell_br"|SN
|class="cell_br"|eth1
|class="cell_br"|10.0.0.0/24
|class="cell_br"|Fastest NIC should be used here.
|-
|class="cell_blr"|Back-Channel Network
|class="cell_br"|BCN
|class="cell_br"|eth2
|class="cell_br"|10.0.1.0/24
|class="cell_br"|NICs with [[IPMI]] piggy-back '''must''' be used here. Second-fastest NIC should be used here.
|}

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:
# If your nodes have [[IPMI]] piggy-backing on a normal NIC, that NIC '''''must''''' be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a ''major'' security risk.
# The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
# If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
# The final NIC should be used for the IFN subnet.

==== Node IP Addresses ====

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

{|
|class="cell_br"| 
|class="cell_tbr"|'''Internet-Facing Network''' (IFN)
|class="cell_tbr"|'''Storage Network''' (SN)
|class="cell_tbr"|'''Back-Channel Network''' (BCN)
|-
|class="cell_blr"|'''an-node01'''
|class="cell_br"|192.168.1.71
|class="cell_br"|10.0.0.71
|class="cell_br"|10.0.1.71
|-
|class="cell_blr"|'''an-node02'''
|class="cell_br"|192.168.1.72
|class="cell_br"|10.0.0.72
|class="cell_br"|10.0.1.72
|}

==== Remove NetworkManager ====

Some cluster software '''will not''' start with NetworkManager installed. This is because is designed to be a highly-adaptive network system that can accommodate frequent changes in the network. To simplify these network transitions for end-users, a lot of reconfiguration of the network is done behind the scenes.

For workstations, this is wonderful. For clustering, this can be disastrous. Transient network outages are already a risk to a cluster's stability!

So first up, make sure that NetworkManager is completely removed from your system. If you used the [[#OS Install|kickstart]] scripts, then it was not installed. Otherwise, run:

<source lang="bash">
yum remove NetworkManager NetworkManager-gnome NetworkManager-openconnect NetworkManager-openvpn NetworkManager-pptp NetworkManager-vpnc cnetworkmanager knetworkmanager knetworkmanager-openvpn knetworkmanager-pptp knetworkmanager-vpnc libproxy-networkmanager yum-NetworkManager-dispatcher
</source>

==== Setup 'network' ====

Before proceeding with network configuration, check to see if your network cards are aligned to the proper ethX network names. If they need to be adjust, please follow this How-To before proceeding:

* [[Changing the ethX to Ethernet Device Mapping in Fedora]]

There are a few ways to configure network in Fedora:
* system-config-network (graphical)
* system-config-network-tui (ncurses)
* Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files. (See: [http://docs.fedoraproject.org/en-US/Fedora/12/html/Deployment_Guide/s1-networkscripts-interfaces.html here] for a full list of options)

Do not proceed until your node's networking is fully configured.

==== Update the Hosts File ====

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

'''Note''': Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.

The updated /etc/hosts file should look like this:

<source lang="bash">
vim /etc/hosts
</source>
<source lang="text">
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# Back-channel IPs to name mapping.
10.0.1.71 an-node01 an-node01.alteeve.com
10.0.1.72 an-node02 an-node02.alteeve.com

# Storage network
10.0.0.71 an-node01.sn
10.0.0.72 an-node02.sn

# Internet facing
192.168.1.71 an-node01.inet
192.168.1.72 an-node02.inet
</source>

To test this, ping both nodes by name and make sure the ping packets are sent on the 10.0.1.0/24 subnet:

<source lang="bash">
ping -c 5 an-node01
</source>
<source lang="text">
PING an-node01 (10.0.1.71) 56(84) bytes of data.
64 bytes from an-node01 (10.0.1.71): icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=2 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=3 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=4 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=5 ttl=64 time=0.055 ms
</source>
<source lang="bash">
ping -c 5 an-node02
</source>
<source lang="text">
PING an-node02 (10.0.1.72) 56(84) bytes of data.
64 bytes from an-node02 (10.0.1.72): icmp_seq=1 ttl=64 time=0.221 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=2 ttl=64 time=0.188 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=3 ttl=64 time=0.217 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=4 ttl=64 time=0.192 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=5 ttl=64 time=0.163 ms
</source>

==== Disable Firewalls ====

Be sure to flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

There will be enough potential sources of problem as it is. Disabling firewalls at this stage will minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.

<source lang="bash">
chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop
</source>

==== Setup SSH Shared Keys ====

'''This is an optional step'''. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. This is not meant to be a security-focused How-To, so please independently study the risks.

If you're a little new to [[SSH]], it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the ''other'' node's root user's directory.

For each user, on each machine you want to connect '''from''', run:

<source lang="bash">
ssh-keygen -t dsa -N "" -f ~/.ssh/id_dsa
</source>
<source lang="text">
Generating public/private dsa key pair.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
c4:cc:10:0a:6b:60:24:e7:57:94:69:e5:10:b2:cb:1f root@an-node01.alteeve.com
The key's randomart image is:
+--[ DSA 1024]----+
|+oo ..B*. |
|o+ o =+B |
| + +. * |
| . o . . |
| o E S |
| . . |
| . |
| |
| |
+-----------------+
</source>

This will create two files; The private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private which '''''must never''''' be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

'''Private key''':
<source lang="bash">
cat ~/.ssh/id_dsa
</source>
<source lang="text">
-----BEGIN DSA PRIVATE KEY-----
MIIBugIBAAKBgQCVo5nzeC/W8HWaFkD4pAmWO7ovf7S01f8HgocKKOYX/nMNxRui
wExBUUUt1b2TxYYQklTd9dn8UAzZ3UohN0CJNDrp//t2wfyACKKc2q+ewao5/svQ
xXTBu2ZPebKPcr1w9p0hq0xSr77Rg3v6Auc6AnC79DM9XkhYkVgNfgrvMwIVAKdY
SgTycvqNEWsJrCTTn5485yWvAoGANQ3rzIxFy2zHWyMlrpA9r5jllgxfaJVx3/Iw
DrEF82lr2dOmG8oYiKAhfvdgNyV7VeM2yxdjcdLPJAHHCx5BZZ7V9KjMzY26wS2r
d58TZJoWmO0nscmcwIbCv2at3fiMpxgr+nzD4tKxFOdWxYBwdmUpIjtJoW8wzY2H
uNghjL0CgYA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDN
bbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EE
D3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5AIUFX97yIig
n2SUltN+10NWUy4oYsA=
-----END DSA PRIVATE KEY-----
</source>

'''Public key''':
<source lang="bash">
cat ~/.ssh/id_dsa.pub
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From '''an-node01''', type:
<source lang="bash">
ssh root@an-node02
</source>
<source lang="text">
The authenticity of host 'an-node02 (10.0.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,10.0.1.72' (RSA) to the list of known hosts.
root@an-node02's password:
Last login: Tue Jul 6 22:28:19 2010 from 192.168.1.102
</source>

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

<source lang="bash">
cat ~/.ssh/authorized_keys
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

= Initial Cluster Setup =

Before we get into specifics, let's take a minute to talk about the major components used in our cluster.

== Core Program Overviews ==

These are the core programs that may be new to you that we will use to build our cluster. Before we configure them, lets take a minute to understand their roles.

=== Cluster Manager ===

The cluster manager, cman, takes the configuration file [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] configuration file and uses it to configure and start the various cluster elements. This is where nodes are defined, fence devices are configured and various cluster tolerances are set.

=== Corosync ===

[http://www.corosync.org Corosync] is, essentially, the clustering kernel which provides the essential services for cluster-aware applications and other cluster services. At it's core is the [[totem]] protocol which provides cluster membership and ordered messaging. On top of this, it implements a number of core clustering services like [[#cpg; Closed Process Group|closed process group messaging]], [[quorum]] and [[confdb]] (an object database).

It's goal is to provide a substantially simpler and more flexible set of APIs to facilitate clustering in Linux. It manages which nodes are in the cluster, it triggers error messages when something fails, manages cluster locking and so on. Most other clustered applications rely on corosync to know when something has happened or to announce when a cluster-related action has taken place.

Please note that the corosync_overview man page is considered out of date at the time of this writing.

==== cpg; Closed Process Group ====

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install corosynclib-devel
man cpg_overview
</source>

The closed process group is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages amongst nodes in a consistent order. It adds [[PID]]s and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group.

=== OpenAIS ===

OpenAIS is now an extension to Corosync that adds an open-source implementation of the [http://www.saforum.org/ Service Availability (SA) Forum's] 'Application Interface Specification' ([[AIS]]). It is an [[API]] and policy designed to be used by applications concerned with maintaining services during faults. AIS implements the 'Availability Management Framework' ([[AMF]]) which, in turn, provides for application fail over, cluster management ([[CLM]]), Checkpointing ([[CKPT]]), Eventing ([[EVT]]), Messaging ([[MSG]]), and Distributed Locking ([[DLOCK]]).

It does this by implementing pluggable shared libraries that get loaded into corosync at runtime. These libraries then can make use of corosync's internal API. Both corosync and openais services provide [[IPC]] interfaces so that application programmers can make use of their behaviour.

In short; applications can use OpenAIS to be cluster-aware. It's libraries are used by some applications, including Pacemaker. In our application, we will only be using it's libraries indirectly.

Please note that the openais_overview man page is considered out of date at the time of this writing.

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install openais
</source>

=== Pacemaker ===

[http://www.clusterlabs.org Pacemaker] is the cluster resource manager. It can be used to trigger scripts to control non cluster-aware applications. In this way, these non cluster-aware applications can be made more highly available. It is considered a replacement for cman's [[rgmanager]].

== Setup Core Programs ==

We now need to edit and create the configuration files for our cluster.

=== Setup cman ===

You may only need to configure this section.

First, install the cluster manager:

<source lang="bash">
yum install cman
</source>

Once installed, we need to create and fill in the [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] file. This is the core configuration file of our cluster and is an [[XML]] formatted file.

Please see: [[Two-Node Fedora 13 cluster.conf]]

<source lang="bash">
vim /etc/cluster/cluster.conf
</source>

The example in that article details many of the common options you will want to use. For a more complete list, please run man 5 cluster.conf for a more complete discussion on this file. Also, please be aware that some arguments can overlap between [[#Setup Corosync|corosync]] and this file.

Once it's setup, we need to verify it. Once you've finished, be sure to run this:

<source lang="bash">
xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf
</source>

If there are errors, fix them. Once it is formatted properly, the contents of you cluster.conf file will be returned followed by "/etc/cluster/cluster.conf validates". Once this is the case, you can move on.

A note; Normally at this stage, you'd use chkconfig to enable cman to start at boot. However, depending on how you use your cluster, you may need to modify the boot order. For now, let's leave it off until we're got all the components enabled. This is particularly important if you will be using [[DRBD]], [[CLVM]] or [[Xen]].

<source lang="bash">
chkconfig cman off
</source>

To see if it's working though, have a terminal open to both nodes and manually start cman. I suggest have two terminal windows open on each node. In one, start tailing /var/log/messages:

<source lang="bash">
clear; tail -f -n 0 /var/log/messages
</source>

Now, with both log files being watched, start cman on both nodes.

'''Note''': You MUST execute the following on both nodes withing the time limit set by post_join_delay (default it 6 seconds). If you wait longer than this, the first node started will [[fence]] the other node, assuming you have fencing configured properly. If fencing is '''not''' configured properly, your first node will hang.

<source lang="bash">
/etc/init.d/cman start
</source>

Examine the log files to verify there were no errors. If there were, fix them. If there was not, fantastic!

'''Warning''': This last step will forcefully reboot your nodes. Do not do this on a production cluster! If it must be done on a production server, stop all non-essential services and be sure you have a good backup, ''just in case''.

Lastly, test that your fencing is configured and working properly. We do this by using a program called fence_node. With both nodes up and running cman, call a fence against the other node. Be sure to be watching the log files.

'''From''' an-node01.alteeve.com

<source lang="bash">
fence_node an-node02.alteeve.com
</source>
<source lang="bash">
fence an-node02.alteeve.com success
</source>

If it worked, when the node comes back up, repeat the process from an-node02.alteeve.com against an-node01.alteeve.com.

=== Setup Corosync ===

'''Note''': In our cluster, we will not need to configure Corosync.

When using cman, as we are here, Corosync's configuration file is ignored.

Most of the settings available in [[Two-Node Fedora 13 corosync.conf|corosync.conf]] can be found in the [[Two-Node Fedora 13 cluster.conf|cluster.conf]] file. Please see the cluster.conf article for a comprehensive list of what is and is not supported.

If you want to configure corosync and not use cman, please see the following article:

* [[Two-Node Fedora 13 corosync.conf]]

=== Setup Pacemaker ===

ToDO.

= Fencing =

Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.

* The Cluster Admin's Mantra:
** '''The only thing you don't know is what you don't know'''.

Just because one node loses communication with another node, it '''cannot''' be assume that the silent node is dead!

== What is it? ==

"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a '''split-brain''' condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.

Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:

* Power
** Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
* Blocking
** Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.

With power fencing, the term used is "STONITH", literally, '''S'''hoot '''T'''he '''O'''ther '''N'''ode '''I'''n '''T'''he '''H'''ead. Picture it like an old west dual. If one node is dead, the other node is going to win the dual by default and the dead node will just be shot again. When both nodes are alive, however, the faster node will win and will "kill" (power off or reset) the slower node before it has a chance to fire. Once this dual is over, the surviving node can then access the shared resource confident that it is the only one working on it.

== Misconception ==

It is a '''very''' common mistake to ignore fencing when first starting to learn about clustering. Often people think ''"It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster."''.

'''''Wrong!'''''

For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, [[cman]] and related daemons will fail if they can't find a fence agent to use.

Secondly; Testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.

== Implementation ==

How you implement fencing depends largely on what kind of fence device(s) you have. This is where I need help from you, dear reader. I only have access to [[Node Assassin]] fence devices on my lab's cluster. If you have other fence devices, like [[IPMI]], [[DRAC]] or so on, please let me know so that I can expand this section.

=== Implementing Node Assassin ===

In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced daemon. We'll cover the details of the [[cluster.conf]] file in a moment.

When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:
<source lang="xml">
<cluster name="an-cluster" config_version="1">
<clusternodes>
<clusternode name="an-node02.alteeve.com" nodeid="2">
<fence>
<method name="node_assassin">
<device name="motoko" port="02" action="off"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="motoko" agent="fence_na" quiet="true"
ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
</fencedevice>
</fencedevices>
</cluster>
</source>

If the cluster manager determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name, which is motoko in this case. It then looks in the <fencedevices> section for the device with the matching name. From there, it gets the information needed to find and access the fence device. Once it connects to the fence device, it then passes the options set in an-node02.alteeve.com's <fence> argument.

So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments:
* ipaddr=motoko.alteeve.com
* login=motoko
* passwd=secret
* quiet=true
* port=2
* action=off

How the fence agent acts on these arguments varies depending on the fence device itself. In general terms, the 'fence_na' fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on, which could be a power jack, a power or reset button, a network switch port and so on. Finally, it tells the device what action to take.

Once the device completes, it returns a success or failed message. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.

If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.

== Fence Devices ==

Many major [[OEM]]s have their own remote management devices that can serve as fence devices. Examples are [http://dell.ca Dell]'s 'DRAC' (Dell Remote Access Controller), [http://hp.ca HP]'s iLO (Integrate Lights Out), [http://ibm.ca IBM]'s 'RSA' (Remote Supervisor Adapter), [http://sun.ca Sun]'s 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via [[IPMI]], Intelligent Power Management Interface.

In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.

Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.

=== IPMI ===

[[IPMI]] is perhaps the most common method used for fencing nodes in a cluster. It requires having a system board with an IPMI baseboard management controller (BMC). IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

To start, we need to install the IPMI user software:

<source lang="bash">
</source>

=== Node Assassin ===

A cheap alternative is the [[Node Assassin]], an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.

'''Full Disclosure''': Node Assassin was created by me, with much help from others, for this paper.

= Clean Up =

Some daemons may have been installed or dragged in during the setup of the cluster. Most notable is heartbeat which is not compatible with [[RHCS]]. The following command will disable various daemons from starting at boot.

As for the rest of the daemons below, it is safe to run even when the daemons aren't installed or have already been removed. Of course, skip the ones you actually want to use. This assumes that the nodes have been built according to this HowTo.

<source lang="bash">
chkconfig heartbeat off; chkconfig iscsid off; chkconfig iptables off; chkconfig ip6tables off
</source>

= So You Have a Cluster - Now What? =

Building the cluster infrastructure was pretty easy, wasn't it?

But what will you '''''do''''' with it?

{|style="width: 75%; text-align: center; padding: 10px; border-spacing: 15px;"
|-
!style="width: 100%; border: 1px solid #dfdfdf;" colspan="3"|Choose Your Own Adventure Cluster
|-
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|style="width: 34%; border: 1px solid #dfdfdf;"| [[Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM|Xen-Based Virtual Machine Host on DRBD+CLVM]] High Availability VM Host
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|}

= Thanks =

* A '''huge''' thanks to [http://iplink.net Interlink Connectivity]! They hire me as a contractor and have allowed me to extend these docs while working on their clusters. Development of these How-Tos would be much slower if not for them. If you need hosting or colo services, drop them a line. They're website is a bit out of date though, so disregard the prices. :)
* To '''sdake''' of [http://corosync.org corosync] for helping me sort out the '''plock''' component and corosync in general.
* To '''Angus Salkeld''' for helping me nail down the Corosync and OpenAIS differences.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013922.html HJ Lee] from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013925.html Steven Dake] for clarifying the to_x vs. logoutput: x arguments in openais.conf.
* To [http://dk.linkedin.com/in/fabbione Fabio Massimo Di Nitto] for helping me get caught up with clustering and VMs.

{{footer}}

Abandoned - Two Node Fedora 13 Cluster

2010-09-09T06:22:58Z

Facade: /* Pre-Assembly */

{{howto_header}}

'''''Notice''''', '''Sep. 08, 2010''' - ''Draft 1'': This is the first draft of this HowTo. It has not been peer-reviewed yet, so please proceed with caution and on a test cluster. Any and all feedback is much appreciated.

= Overview =

This paper has one goal;

# How to assemble the simplest cluster possible, a '''2-Node Cluster''', which you can then expand on for your own needs.

With this completed, you can then jump into "step 2" papers that will show various uses of a two node cluster:

# How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time.
# How to create simple resources that can move between nodes. Examples will be a simple PostgreSQL database, DHCP, DNS and web servers.

== Prerequisites ==

It is expected that you are already comfortable with the Linux command line, specifically [[bash]], and that you are familiar with general administrative tasks in Red Hat based distributions, specifically [[Fedora]]. You will also need to be comfortable using editors like [[vim]], [[nano]] or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, [[multicast]], broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

== Platform ==

This paper will implement the [[Red Hat]] Cluster Suite using the Fedora v13 distribution. This paper uses the [[x86_64]] repositories, however, if you are on an [[i386]] (32 bit) system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names.

You can either download the stock Fedora 13 DVD ISO (currently at version 5.4 which is used in this document), or you can try out the alpha [[#AN!Cluster Install|AN!Cluster Install]] DVD. (4.3GB iso). If you use the latter, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant Fedora distro, please [[Digimer|contact me]] and let me know what your trouble was.

== Why Fedora 13? ==

Generally speaking, I much prefer to use a server-oriented Linux distribution like [[CentOS]], [[Debian]] or similar. However, there have been many recent changes in the Linux-Clustering world that have made all of the currently available server-class distributions obsolete. With luck, once [[Red Hat]] Enterprise Linux and [[CentOS]] version 6 is released, this will change.

Until then, [[Fedora]] version 13 provides the most up to date binary releases of the new implementation of the clustering stack available. For this reason, FC13 is the best choice in clustering, if you want to be current. To mitigate some of the issues introduced by using a workstation distribution, many packages will be stripped out of the default install.

== Focus ==

Clusters can serve to solve three problems; '''Reliability''', '''Performance''' and '''Scalability'''.

This focus of the cluster described in this paper is primarily '''reliability'''. Second to this, '''scalability''' will be the priority leaving '''performance''' to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper.

== Goal ==

At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" resources. That is, resources that exist on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources.

This paper should also serve to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper.

= Begin =

Let's begin!

== Hardware ==

We will need two physical servers each with the following hardware:
* One or more multi-core [[CPU]]s with Virtualization support.
* Three network cards; At least one should be gigabit or faster.
* One or more hard drives.
* You will need some form of a [[fence|fence device]]. This can be an [[IPMI]]-enabled server, a [http://nodeassassin.org Node Assassin], a fenceable [http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900&tab=features PDU] or similar.

This paper uses the following hardware:
* ASUS [http://support.asus.com/search/search.aspx?keyword=m4a78l-m&SLanguage=en-us M4A78L-M]
* AMD Athlon II x2 250
* 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
* 1x Intel 82540 PCI NICs
* 1x D-Link DGE-560T
* Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would serve the purposes of creating this document. What you purchase shouldn't matter, so long as the minimum requirements are met.

== Pre-Assembly ==

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to [[Setting Up a PXE Server in Fedora|network boot]] your install media. In that case, if the wrong NIC is chosen for eth0, you will be presented with a list of MAC addresses to attempt setup with.

Before you assemble your servers, record their network cards' [[MAC]] addresses. I like to keep simple text files like these:

<source lang="bash">
cat an-node01.mac
</source>
<source lang="text">
90:E6:BA:71:82:EA eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53 eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

<source lang="bash">
cat an-node02.mac
</source>
<source lang="text">
90:E6:BA:71:82:D8 eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

Feel free to record the information in any way that suits you the best.

== OS Install ==

Start with a stock Fedora 13 install. This How-To uses Fedora 13 x86_64, however it should be fairly easy to adapt to other recent Fedora version. This document is also attempting to be easily ported to [[CentOS]]/[[RHEL]] version 6 once it is released. This will not adapt well to CentOS/RHEL version 5 though... Much of the cluster stack has changed dramatically since it was initially released.

These are sample kickstart script used by this paper. Be sure to set your how password string and network settings.

'''''Warning'''''! These kickstart scripts '''''will erase your hard drive'''''! Adapt them, don't blindly use them.

Generic cluster node kickstart scripts.
* [[Fedora13 KS an-node01.ks|an-node01.ks]]
* [[Fedora13 KS an-node02.ks|an-node02.ks]]

== AN!Cluster Install ==

If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.

* Download the custom '''AN!Cluster v0.2.001''' Install DVD. (4.5[[GiB]] iso). (Currently disabled - Reworking for F13)

== Post OS Install ==

Once the OS is installed, we need to do a couple things;

# Disable selinux
# Setup networking.
# Change the default run-level.

=== Disable 'selinux' ===

To allow this paper to focus on clustering, we will disable selinux. Obviously, this introduces security issues that you may not be comfortable with. If you wish to leave selinux enabled, it will be up to you to sort out issues that crop up.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect.

=== Change the Default Run-Level ===

'''This is an optional step'''. It improves performance only, it is not a required step.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reducing the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

<source lang="bash">
vim /etc/inittab
</source>
<source lang="text">
id:3:initdefault:
</source>
<source lang="bash">
init 3
</source>

=== Setup Networking ===

We need to remove NetworkManager, enable network, configure the ifcfg-eth* files and then start the network daemon.

==== Network Layout ====

This setup expects you to have three physical network cards connected to three independent networks. To have a common vernacular, lets use this table to describe them:

{|
!class="cell_all"|Network Description
!class="cell_tbr"|Short Name
!class="cell_tbr"|Device Name
!class="cell_tbr"|Suggested Subnet
!class="cell_tbr"|NIC Properties
|-
|class="cell_blr"|Internet-Facing Network
|class="cell_br"|IFN
|class="cell_br"|eth0
|class="cell_br"|192.168.1.0/24
|class="cell_br"|Remaining NIC should be used here. If using a [[Setting Up a PXE Server in Fedora|PXE server]], this should be a bootable NIC.
|-
|class="cell_blr"|Storage Network
|class="cell_br"|SN
|class="cell_br"|eth1
|class="cell_br"|10.0.0.0/24
|class="cell_br"|Fastest NIC should be used here.
|-
|class="cell_blr"|Back-Channel Network
|class="cell_br"|BCN
|class="cell_br"|eth2
|class="cell_br"|10.0.1.0/24
|class="cell_br"|NICs with [[IPMI]] piggy-back '''must''' be used here. Second-fastest NIC should be used here.
|}

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:
# If your nodes have [[IPMI]] piggy-backing on a normal NIC, that NIC '''''must''''' be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a ''major'' security risk.
# The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
# If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
# The final NIC should be used for the IFN subnet.

==== Node IP Addresses ====

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

{|
|class="cell_br"| 
|class="cell_tbr"|'''Internet-Facing Network''' (IFN)
|class="cell_tbr"|'''Storage Network''' (SN)
|class="cell_tbr"|'''Back-Channel Network''' (BCN)
|-
|class="cell_blr"|'''an-node01'''
|class="cell_br"|192.168.1.71
|class="cell_br"|10.0.0.71
|class="cell_br"|10.0.1.71
|-
|class="cell_blr"|'''an-node02'''
|class="cell_br"|192.168.1.72
|class="cell_br"|10.0.0.72
|class="cell_br"|10.0.1.72
|}

==== Remove NetworkManager ====

Some cluster software '''will not''' start with NetworkManager installed. This is because is designed to be a highly-adaptive network system that can accommodate frequent changes in the network. To simplify these network transitions for end-users, a lot of reconfiguration of the network is done behind the scenes.

For workstations, this is wonderful. For clustering, this can be disastrous. Transient network outages are already a risk to a cluster's stability!

So first up, make sure that NetworkManager is completely removed from your system. If you used the [[#OS Install|kickstart]] scripts, then it was not installed. Otherwise, run:

<source lang="bash">
yum remove NetworkManager NetworkManager-gnome NetworkManager-openconnect NetworkManager-openvpn NetworkManager-pptp NetworkManager-vpnc cnetworkmanager knetworkmanager knetworkmanager-openvpn knetworkmanager-pptp knetworkmanager-vpnc libproxy-networkmanager yum-NetworkManager-dispatcher
</source>

==== Setup 'network' ====

Before proceeding with network configuration, check to see if your network cards are aligned to the proper ethX network names. If they need to be adjust, please follow this How-To before proceeding:

* [[Changing the ethX to Ethernet Device Mapping in Fedora]]

There are a few ways to configure network in Fedora:
* system-config-network (graphical)
* system-config-network-tui (ncurses)
* Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files. (See: [http://docs.fedoraproject.org/en-US/Fedora/12/html/Deployment_Guide/s1-networkscripts-interfaces.html here] for a full list of options)

Do not proceed until your node's networking is fully configured.

==== Update the Hosts File ====

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

'''Note''': Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.

The updated /etc/hosts file should look like this:

<source lang="bash">
vim /etc/hosts
</source>
<source lang="text">
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# Back-channel IPs to name mapping.
10.0.1.71 an-node01 an-node01.alteeve.com
10.0.1.72 an-node02 an-node02.alteeve.com

# Storage network
10.0.0.71 an-node01.sn
10.0.0.72 an-node02.sn

# Internet facing
192.168.1.71 an-node01.inet
192.168.1.72 an-node02.inet
</source>

To test this, ping both nodes by name and make sure the ping packets are sent on the 10.0.1.0/24 subnet:

<source lang="bash">
ping -c 5 an-node01
</source>
<source lang="text">
PING an-node01 (10.0.1.71) 56(84) bytes of data.
64 bytes from an-node01 (10.0.1.71): icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=2 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=3 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=4 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=5 ttl=64 time=0.055 ms
</source>
<source lang="bash">
ping -c 5 an-node02
</source>
<source lang="text">
PING an-node02 (10.0.1.72) 56(84) bytes of data.
64 bytes from an-node02 (10.0.1.72): icmp_seq=1 ttl=64 time=0.221 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=2 ttl=64 time=0.188 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=3 ttl=64 time=0.217 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=4 ttl=64 time=0.192 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=5 ttl=64 time=0.163 ms
</source>

==== Disable Firewalls ====

Be sure to flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

There will be enough potential sources of problem as it is. Disabling firewalls at this stage will minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.

<source lang="bash">
chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop
</source>

==== Setup SSH Shared Keys ====

'''This is an optional step'''. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. This is not meant to be a security-focused How-To, so please independently study the risks.

If you're a little new to [[SSH]], it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the ''other'' node's root user's directory.

For each user, on each machine you want to connect '''from''', run:

<source lang="bash">
ssh-keygen -t dsa -N "" -f ~/.ssh/id_dsa
</source>
<source lang="text">
Generating public/private dsa key pair.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
c4:cc:10:0a:6b:60:24:e7:57:94:69:e5:10:b2:cb:1f root@an-node01.alteeve.com
The key's randomart image is:
+--[ DSA 1024]----+
|+oo ..B*. |
|o+ o =+B |
| + +. * |
| . o . . |
| o E S |
| . . |
| . |
| |
| |
+-----------------+
</source>

This will create two files; The private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private which '''''must never''''' be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

'''Private key''':
<source lang="bash">
cat ~/.ssh/id_dsa
</source>
<source lang="text">
-----BEGIN DSA PRIVATE KEY-----
MIIBugIBAAKBgQCVo5nzeC/W8HWaFkD4pAmWO7ovf7S01f8HgocKKOYX/nMNxRui
wExBUUUt1b2TxYYQklTd9dn8UAzZ3UohN0CJNDrp//t2wfyACKKc2q+ewao5/svQ
xXTBu2ZPebKPcr1w9p0hq0xSr77Rg3v6Auc6AnC79DM9XkhYkVgNfgrvMwIVAKdY
SgTycvqNEWsJrCTTn5485yWvAoGANQ3rzIxFy2zHWyMlrpA9r5jllgxfaJVx3/Iw
DrEF82lr2dOmG8oYiKAhfvdgNyV7VeM2yxdjcdLPJAHHCx5BZZ7V9KjMzY26wS2r
d58TZJoWmO0nscmcwIbCv2at3fiMpxgr+nzD4tKxFOdWxYBwdmUpIjtJoW8wzY2H
uNghjL0CgYA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDN
bbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EE
D3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5AIUFX97yIig
n2SUltN+10NWUy4oYsA=
-----END DSA PRIVATE KEY-----
</source>

'''Public key''':
<source lang="bash">
cat ~/.ssh/id_dsa.pub
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From '''an-node01''', type:
<source lang="bash">
ssh root@an-node02
</source>
<source lang="text">
The authenticity of host 'an-node02 (10.0.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,10.0.1.72' (RSA) to the list of known hosts.
root@an-node02's password:
Last login: Tue Jul 6 22:28:19 2010 from 192.168.1.102
</source>

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

<source lang="bash">
cat ~/.ssh/authorized_keys
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

= Initial Cluster Setup =

Before we get into specifics, let's take a minute to talk about the major components used in our cluster.

== Core Program Overviews ==

These are the core programs that may be new to you that we will use to build our cluster. Before we configure them, lets take a minute to understand their roles.

=== Cluster Manager ===

The cluster manager, cman, takes the configuration file [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] configuration file and uses it to configure and start the various cluster elements. This is where nodes are defined, fence devices are configured and various cluster tolerances are set.

=== Corosync ===

[http://www.corosync.org Corosync] is, essentially, the clustering kernel which provides the essential services for cluster-aware applications and other cluster services. At it's core is the [[totem]] protocol which provides cluster membership and ordered messaging. On top of this, it implements a number of core clustering services like [[#cpg; Closed Process Group|closed process group messaging]], [[quorum]] and [[confdb]] (an object database).

It's goal is to provide a substantially simpler and more flexible set of APIs to facilitate clustering in Linux. It manages which nodes are in the cluster, it triggers error messages when something fails, manages cluster locking and so on. Most other clustered applications rely on corosync to know when something has happened or to announce when a cluster-related action has taken place.

Please note that the corosync_overview man page is considered out of date at the time of this writing.

==== cpg; Closed Process Group ====

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install corosynclib-devel
man cpg_overview
</source>

The closed process group is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages amongst nodes in a consistent order. It adds [[PID]]s and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group.

=== OpenAIS ===

OpenAIS is now an extension to Corosync that adds an open-source implementation of the [http://www.saforum.org/ Service Availability (SA) Forum's] 'Application Interface Specification' ([[AIS]]). It is an [[API]] and policy designed to be used by applications concerned with maintaining services during faults. AIS implements the 'Availability Management Framework' ([[AMF]]) which, in turn, provides for application fail over, cluster management ([[CLM]]), Checkpointing ([[CKPT]]), Eventing ([[EVT]]), Messaging ([[MSG]]), and Distributed Locking ([[DLOCK]]).

It does this by implementing pluggable shared libraries that get loaded into corosync at runtime. These libraries then can make use of corosync's internal API. Both corosync and openais services provide [[IPC]] interfaces so that application programmers can make use of their behaviour.

In short; applications can use OpenAIS to be cluster-aware. It's libraries are used by some applications, including Pacemaker. In our application, we will only be using it's libraries indirectly.

Please note that the openais_overview man page is considered out of date at the time of this writing.

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install openais
</source>

=== Pacemaker ===

[http://www.clusterlabs.org Pacemaker] is the cluster resource manager. It can be used to trigger scripts to control non cluster-aware applications. In this way, these non cluster-aware applications can be made more highly available. It is considered a replacement for cman's [[rgmanager]].

== Setup Core Programs ==

We now need to edit and create the configuration files for our cluster.

=== Setup cman ===

You may only need to configure this section.

First, install the cluster manager:

<source lang="bash">
yum install cman
</source>

Once installed, we need to create and fill in the [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] file. This is the core configuration file of our cluster and is an [[XML]] formatted file.

Please see: [[Two-Node Fedora 13 cluster.conf]]

<source lang="bash">
vim /etc/cluster/cluster.conf
</source>

The example in that article details many of the common options you will want to use. For a more complete list, please run man 5 cluster.conf for a more complete discussion on this file. Also, please be aware that some arguments can overlap between [[#Setup Corosync|corosync]] and this file.

Once it's setup, we need to verify it. Once you've finished, be sure to run this:

<source lang="bash">
xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf
</source>

If there are errors, fix them. Once it is formatted properly, the contents of you cluster.conf file will be returned followed by "/etc/cluster/cluster.conf validates". Once this is the case, you can move on.

A note; Normally at this stage, you'd use chkconfig to enable cman to start at boot. However, depending on how you use your cluster, you may need to modify the boot order. For now, let's leave it off until we're got all the components enabled. This is particularly important if you will be using [[DRBD]], [[CLVM]] or [[Xen]].

<source lang="bash">
chkconfig cman off
</source>

To see if it's working though, have a terminal open to both nodes and manually start cman. I suggest have two terminal windows open on each node. In one, start tailing /var/log/messages:

<source lang="bash">
clear; tail -f -n 0 /var/log/messages
</source>

Now, with both log files being watched, start cman on both nodes.

'''Note''': You MUST execute the following on both nodes withing the time limit set by post_join_delay (default it 6 seconds). If you wait longer than this, the first node started will [[fence]] the other node, assuming you have fencing configured properly. If fencing is '''not''' configured properly, your first node will hang.

<source lang="bash">
/etc/init.d/cman start
</source>

Examine the log files to verify there were no errors. If there were, fix them. If there was not, fantastic!

'''Warning''': This last step will forcefully reboot your nodes. Do not do this on a production cluster! If it must be done on a production server, stop all non-essential services and be sure you have a good backup, ''just in case''.

Lastly, test that your fencing is configured and working properly. We do this by using a program called fence_node. With both nodes up and running cman, call a fence against the other node. Be sure to be watching the log files.

'''From''' an-node01.alteeve.com

<source lang="bash">
fence_node an-node02.alteeve.com
</source>
<source lang="bash">
fence an-node02.alteeve.com success
</source>

If it worked, when the node comes back up, repeat the process from an-node02.alteeve.com against an-node01.alteeve.com.

=== Setup Corosync ===

'''Note''': In our cluster, we will not need to configure Corosync.

When using cman, as we are here, Corosync's configuration file is ignored.

Most of the settings available in [[Two-Node Fedora 13 corosync.conf|corosync.conf]] can be found in the [[Two-Node Fedora 13 cluster.conf|cluster.conf]] file. Please see the cluster.conf article for a comprehensive list of what is and is not supported.

If you want to configure corosync and not use cman, please see the following article:

* [[Two-Node Fedora 13 corosync.conf]]

=== Setup Pacemaker ===

ToDO.

= Fencing =

Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.

* The Cluster Admin's Mantra:
** '''The only thing you don't know is what you don't know'''.

Just because one node loses communication with another node, it '''cannot''' be assume that the silent node is dead!

== What is it? ==

"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a '''split-brain''' condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.

Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:

* Power
** Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
* Blocking
** Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.

With power fencing, the term used is "STONITH", literally, '''S'''hoot '''T'''he '''O'''ther '''N'''ode '''I'''n '''T'''he '''H'''ead. Picture it like an old west dual. If one node is dead, the other node is going to win the dual by default and the dead node will just be shot again. When both nodes are alive, however, the faster node will win and will "kill" (power off or reset) the slower node before it has a chance to fire. Once this dual is over, the surviving node can then access the shared resource confident that it is the only one working on it.

== Misconception ==

It is a '''very''' common mistake to ignore fencing when first starting to learn about clustering. Often people think ''"It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster."''.

'''''Wrong!'''''

For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, [[cman]] and related daemons will fail if they can't find a fence agent to use.

Secondly; Testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.

== Implementation ==

How you implement fencing depends largely on what kind of fence device(s) you have. This is where I need help from you, dear reader. I only have access to [[Node Assassin]] fence devices on my lab's cluster. If you have other fence devices, like [[IPMI]], [[DRAC]] or so on, please let me know so that I can expand this section.

=== Implementing Node Assassin ===

In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced daemon. We'll cover the details of the [[cluster.conf]] file in a moment.

When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:
<source lang="xml">
<cluster name="an-cluster" config_version="1">
<clusternodes>
<clusternode name="an-node02.alteeve.com" nodeid="2">
<fence>
<method name="node_assassin">
<device name="motoko" port="02" action="off"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="motoko" agent="fence_na" quiet="true"
ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
</fencedevice>
</fencedevices>
</cluster>
</source>

If the cluster manager determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name, which is motoko in this case. It then looks in the <fencedevices> section for the device with the matching name. From there, it gets the information needed to find and access the fence device. Once it connects to the fence device, it then passes the options set in an-node02.alteeve.com's <fence> argument.

So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments:
* ipaddr=motoko.alteeve.com
* login=motoko
* passwd=secret
* quiet=true
* port=2
* action=off

How the fence agent acts on these arguments varies depending on the fence device itself. In general terms, the 'fence_na' fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on, which could be a power jack, a power or reset button, a network switch port and so on. Finally, it tells the device what action to take.

Once the device completes, it returns a success or failed message. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.

If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.

== Fence Devices ==

Many major [[OEM]]s have their own remote management devices that can serve as fence devices. Examples are [http://dell.ca Dell]'s 'DRAC' (Dell Remote Access Controller), [http://hp.ca HP]'s iLO (Integrate Lights Out), [http://ibm.ca IBM]'s 'RSA' (Remote Supervisor Adapter), [http://sun.ca Sun]'s 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via [[IPMI]], Intelligent Power Management Interface.

In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.

Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.

=== IPMI ===

[[IPMI]] is perhaps the most common method used for fencing nodes in a cluster. It requires having a system board with an IPMI baseboard management controller (BMC). IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

To start, we need to install the IPMI user software:

<source lang="bash">
</source>

=== Node Assassin ===

A cheap alternative is the [[Node Assassin]], an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.

'''Full Disclosure''': Node Assassin was created by me, with much help from others, for this paper.

= Clean Up =

Some daemons may have been installed or dragged in during the setup of the cluster. Most notable is heartbeat which is not compatible with [[RHCS]]. The following command will disable various daemons from starting at boot.

As for the rest of the daemons below, it is safe to run even when the daemons aren't installed or have already been removed. Of course, skip the ones you actually want to use. This assumes that the nodes have been built according to this HowTo.

<source lang="bash">
chkconfig heartbeat off; chkconfig iscsid off; chkconfig iptables off; chkconfig ip6tables off
</source>

= So You Have a Cluster - Now What? =

Building the cluster infrastructure was pretty easy, wasn't it?

But what will you '''''do''''' with it?

{|style="width: 75%; text-align: center; padding: 10px; border-spacing: 15px;"
|-
!style="width: 100%; border: 1px solid #dfdfdf;" colspan="3"|Choose Your Own Adventure Cluster
|-
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|style="width: 34%; border: 1px solid #dfdfdf;"| [[Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM|Xen-Based Virtual Machine Host on DRBD+CLVM]] High Availability VM Host
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|}

= Thanks =

* A '''huge''' thanks to [http://iplink.net Interlink Connectivity]! They hire me as a contractor and have allowed me to extend these docs while working on their clusters. Development of these How-Tos would be much slower if not for them. If you need hosting or colo services, drop them a line. They're website is a bit out of date though, so disregard the prices. :)
* To '''sdake''' of [http://corosync.org corosync] for helping me sort out the '''plock''' component and corosync in general.
* To '''Angus Salkeld''' for helping me nail down the Corosync and OpenAIS differences.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013922.html HJ Lee] from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013925.html Steven Dake] for clarifying the to_x vs. logoutput: x arguments in openais.conf.
* To [http://dk.linkedin.com/in/fabbione Fabio Massimo Di Nitto] for helping me get caught up with clustering and VMs.

{{footer}}

Abandoned - Two Node Fedora 13 Cluster

2010-09-09T06:21:58Z

Facade: /* Hardware */

{{howto_header}}

'''''Notice''''', '''Sep. 08, 2010''' - ''Draft 1'': This is the first draft of this HowTo. It has not been peer-reviewed yet, so please proceed with caution and on a test cluster. Any and all feedback is much appreciated.

= Overview =

This paper has one goal;

# How to assemble the simplest cluster possible, a '''2-Node Cluster''', which you can then expand on for your own needs.

With this completed, you can then jump into "step 2" papers that will show various uses of a two node cluster:

# How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time.
# How to create simple resources that can move between nodes. Examples will be a simple PostgreSQL database, DHCP, DNS and web servers.

== Prerequisites ==

It is expected that you are already comfortable with the Linux command line, specifically [[bash]], and that you are familiar with general administrative tasks in Red Hat based distributions, specifically [[Fedora]]. You will also need to be comfortable using editors like [[vim]], [[nano]] or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, [[multicast]], broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

== Platform ==

This paper will implement the [[Red Hat]] Cluster Suite using the Fedora v13 distribution. This paper uses the [[x86_64]] repositories, however, if you are on an [[i386]] (32 bit) system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names.

You can either download the stock Fedora 13 DVD ISO (currently at version 5.4 which is used in this document), or you can try out the alpha [[#AN!Cluster Install|AN!Cluster Install]] DVD. (4.3GB iso). If you use the latter, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant Fedora distro, please [[Digimer|contact me]] and let me know what your trouble was.

== Why Fedora 13? ==

Generally speaking, I much prefer to use a server-oriented Linux distribution like [[CentOS]], [[Debian]] or similar. However, there have been many recent changes in the Linux-Clustering world that have made all of the currently available server-class distributions obsolete. With luck, once [[Red Hat]] Enterprise Linux and [[CentOS]] version 6 is released, this will change.

Until then, [[Fedora]] version 13 provides the most up to date binary releases of the new implementation of the clustering stack available. For this reason, FC13 is the best choice in clustering, if you want to be current. To mitigate some of the issues introduced by using a workstation distribution, many packages will be stripped out of the default install.

== Focus ==

Clusters can serve to solve three problems; '''Reliability''', '''Performance''' and '''Scalability'''.

This focus of the cluster described in this paper is primarily '''reliability'''. Second to this, '''scalability''' will be the priority leaving '''performance''' to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper.

== Goal ==

At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" resources. That is, resources that exist on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources.

This paper should also serve to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper.

= Begin =

Let's begin!

== Hardware ==

We will need two physical servers each with the following hardware:
* One or more multi-core [[CPU]]s with Virtualization support.
* Three network cards; At least one should be gigabit or faster.
* One or more hard drives.
* You will need some form of a [[fence|fence device]]. This can be an [[IPMI]]-enabled server, a [http://nodeassassin.org Node Assassin], a fenceable [http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900&tab=features PDU] or similar.

This paper uses the following hardware:
* ASUS [http://support.asus.com/search/search.aspx?keyword=m4a78l-m&SLanguage=en-us M4A78L-M]
* AMD Athlon II x2 250
* 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
* 1x Intel 82540 PCI NICs
* 1x D-Link DGE-560T
* Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would serve the purposes of creating this document. What you purchase shouldn't matter, so long as the minimum requirements are met.

== Pre-Assembly ==

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to [[Setting Up a PXE Server in Fedora|network boot]] your install media. In that case, if the wrong NIC is chosen for eth0, you will be presented with a list of MAC addresses to attempt setup with.

Before you assemble your servers, record their network cards' [[MAC]] addresses. I like to keep simply text files like these:

<source lang="bash">
cat an-node01.mac
</source>
<source lang="text">
90:E6:BA:71:82:EA eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53 eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

<source lang="bash">
cat an-node02.mac
</source>
<source lang="text">
90:E6:BA:71:82:D8 eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

Feel free to record the information in any way that suits you the best.

== OS Install ==

Start with a stock Fedora 13 install. This How-To uses Fedora 13 x86_64, however it should be fairly easy to adapt to other recent Fedora version. This document is also attempting to be easily ported to [[CentOS]]/[[RHEL]] version 6 once it is released. This will not adapt well to CentOS/RHEL version 5 though... Much of the cluster stack has changed dramatically since it was initially released.

These are sample kickstart script used by this paper. Be sure to set your how password string and network settings.

'''''Warning'''''! These kickstart scripts '''''will erase your hard drive'''''! Adapt them, don't blindly use them.

Generic cluster node kickstart scripts.
* [[Fedora13 KS an-node01.ks|an-node01.ks]]
* [[Fedora13 KS an-node02.ks|an-node02.ks]]

== AN!Cluster Install ==

If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.

* Download the custom '''AN!Cluster v0.2.001''' Install DVD. (4.5[[GiB]] iso). (Currently disabled - Reworking for F13)

== Post OS Install ==

Once the OS is installed, we need to do a couple things;

# Disable selinux
# Setup networking.
# Change the default run-level.

=== Disable 'selinux' ===

To allow this paper to focus on clustering, we will disable selinux. Obviously, this introduces security issues that you may not be comfortable with. If you wish to leave selinux enabled, it will be up to you to sort out issues that crop up.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect.

=== Change the Default Run-Level ===

'''This is an optional step'''. It improves performance only, it is not a required step.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reducing the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

<source lang="bash">
vim /etc/inittab
</source>
<source lang="text">
id:3:initdefault:
</source>
<source lang="bash">
init 3
</source>

=== Setup Networking ===

We need to remove NetworkManager, enable network, configure the ifcfg-eth* files and then start the network daemon.

==== Network Layout ====

This setup expects you to have three physical network cards connected to three independent networks. To have a common vernacular, lets use this table to describe them:

{|
!class="cell_all"|Network Description
!class="cell_tbr"|Short Name
!class="cell_tbr"|Device Name
!class="cell_tbr"|Suggested Subnet
!class="cell_tbr"|NIC Properties
|-
|class="cell_blr"|Internet-Facing Network
|class="cell_br"|IFN
|class="cell_br"|eth0
|class="cell_br"|192.168.1.0/24
|class="cell_br"|Remaining NIC should be used here. If using a [[Setting Up a PXE Server in Fedora|PXE server]], this should be a bootable NIC.
|-
|class="cell_blr"|Storage Network
|class="cell_br"|SN
|class="cell_br"|eth1
|class="cell_br"|10.0.0.0/24
|class="cell_br"|Fastest NIC should be used here.
|-
|class="cell_blr"|Back-Channel Network
|class="cell_br"|BCN
|class="cell_br"|eth2
|class="cell_br"|10.0.1.0/24
|class="cell_br"|NICs with [[IPMI]] piggy-back '''must''' be used here. Second-fastest NIC should be used here.
|}

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:
# If your nodes have [[IPMI]] piggy-backing on a normal NIC, that NIC '''''must''''' be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a ''major'' security risk.
# The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
# If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
# The final NIC should be used for the IFN subnet.

==== Node IP Addresses ====

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

{|
|class="cell_br"| 
|class="cell_tbr"|'''Internet-Facing Network''' (IFN)
|class="cell_tbr"|'''Storage Network''' (SN)
|class="cell_tbr"|'''Back-Channel Network''' (BCN)
|-
|class="cell_blr"|'''an-node01'''
|class="cell_br"|192.168.1.71
|class="cell_br"|10.0.0.71
|class="cell_br"|10.0.1.71
|-
|class="cell_blr"|'''an-node02'''
|class="cell_br"|192.168.1.72
|class="cell_br"|10.0.0.72
|class="cell_br"|10.0.1.72
|}

==== Remove NetworkManager ====

Some cluster software '''will not''' start with NetworkManager installed. This is because is designed to be a highly-adaptive network system that can accommodate frequent changes in the network. To simplify these network transitions for end-users, a lot of reconfiguration of the network is done behind the scenes.

For workstations, this is wonderful. For clustering, this can be disastrous. Transient network outages are already a risk to a cluster's stability!

So first up, make sure that NetworkManager is completely removed from your system. If you used the [[#OS Install|kickstart]] scripts, then it was not installed. Otherwise, run:

<source lang="bash">
yum remove NetworkManager NetworkManager-gnome NetworkManager-openconnect NetworkManager-openvpn NetworkManager-pptp NetworkManager-vpnc cnetworkmanager knetworkmanager knetworkmanager-openvpn knetworkmanager-pptp knetworkmanager-vpnc libproxy-networkmanager yum-NetworkManager-dispatcher
</source>

==== Setup 'network' ====

Before proceeding with network configuration, check to see if your network cards are aligned to the proper ethX network names. If they need to be adjust, please follow this How-To before proceeding:

* [[Changing the ethX to Ethernet Device Mapping in Fedora]]

There are a few ways to configure network in Fedora:
* system-config-network (graphical)
* system-config-network-tui (ncurses)
* Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files. (See: [http://docs.fedoraproject.org/en-US/Fedora/12/html/Deployment_Guide/s1-networkscripts-interfaces.html here] for a full list of options)

Do not proceed until your node's networking is fully configured.

==== Update the Hosts File ====

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

'''Note''': Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.

The updated /etc/hosts file should look like this:

<source lang="bash">
vim /etc/hosts
</source>
<source lang="text">
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# Back-channel IPs to name mapping.
10.0.1.71 an-node01 an-node01.alteeve.com
10.0.1.72 an-node02 an-node02.alteeve.com

# Storage network
10.0.0.71 an-node01.sn
10.0.0.72 an-node02.sn

# Internet facing
192.168.1.71 an-node01.inet
192.168.1.72 an-node02.inet
</source>

To test this, ping both nodes by name and make sure the ping packets are sent on the 10.0.1.0/24 subnet:

<source lang="bash">
ping -c 5 an-node01
</source>
<source lang="text">
PING an-node01 (10.0.1.71) 56(84) bytes of data.
64 bytes from an-node01 (10.0.1.71): icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=2 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=3 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=4 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=5 ttl=64 time=0.055 ms
</source>
<source lang="bash">
ping -c 5 an-node02
</source>
<source lang="text">
PING an-node02 (10.0.1.72) 56(84) bytes of data.
64 bytes from an-node02 (10.0.1.72): icmp_seq=1 ttl=64 time=0.221 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=2 ttl=64 time=0.188 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=3 ttl=64 time=0.217 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=4 ttl=64 time=0.192 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=5 ttl=64 time=0.163 ms
</source>

==== Disable Firewalls ====

Be sure to flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

There will be enough potential sources of problem as it is. Disabling firewalls at this stage will minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.

<source lang="bash">
chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop
</source>

==== Setup SSH Shared Keys ====

'''This is an optional step'''. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. This is not meant to be a security-focused How-To, so please independently study the risks.

If you're a little new to [[SSH]], it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the ''other'' node's root user's directory.

For each user, on each machine you want to connect '''from''', run:

<source lang="bash">
ssh-keygen -t dsa -N "" -f ~/.ssh/id_dsa
</source>
<source lang="text">
Generating public/private dsa key pair.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
c4:cc:10:0a:6b:60:24:e7:57:94:69:e5:10:b2:cb:1f root@an-node01.alteeve.com
The key's randomart image is:
+--[ DSA 1024]----+
|+oo ..B*. |
|o+ o =+B |
| + +. * |
| . o . . |
| o E S |
| . . |
| . |
| |
| |
+-----------------+
</source>

This will create two files; The private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private which '''''must never''''' be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

'''Private key''':
<source lang="bash">
cat ~/.ssh/id_dsa
</source>
<source lang="text">
-----BEGIN DSA PRIVATE KEY-----
MIIBugIBAAKBgQCVo5nzeC/W8HWaFkD4pAmWO7ovf7S01f8HgocKKOYX/nMNxRui
wExBUUUt1b2TxYYQklTd9dn8UAzZ3UohN0CJNDrp//t2wfyACKKc2q+ewao5/svQ
xXTBu2ZPebKPcr1w9p0hq0xSr77Rg3v6Auc6AnC79DM9XkhYkVgNfgrvMwIVAKdY
SgTycvqNEWsJrCTTn5485yWvAoGANQ3rzIxFy2zHWyMlrpA9r5jllgxfaJVx3/Iw
DrEF82lr2dOmG8oYiKAhfvdgNyV7VeM2yxdjcdLPJAHHCx5BZZ7V9KjMzY26wS2r
d58TZJoWmO0nscmcwIbCv2at3fiMpxgr+nzD4tKxFOdWxYBwdmUpIjtJoW8wzY2H
uNghjL0CgYA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDN
bbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EE
D3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5AIUFX97yIig
n2SUltN+10NWUy4oYsA=
-----END DSA PRIVATE KEY-----
</source>

'''Public key''':
<source lang="bash">
cat ~/.ssh/id_dsa.pub
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From '''an-node01''', type:
<source lang="bash">
ssh root@an-node02
</source>
<source lang="text">
The authenticity of host 'an-node02 (10.0.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,10.0.1.72' (RSA) to the list of known hosts.
root@an-node02's password:
Last login: Tue Jul 6 22:28:19 2010 from 192.168.1.102
</source>

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

<source lang="bash">
cat ~/.ssh/authorized_keys
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

= Initial Cluster Setup =

Before we get into specifics, let's take a minute to talk about the major components used in our cluster.

== Core Program Overviews ==

These are the core programs that may be new to you that we will use to build our cluster. Before we configure them, lets take a minute to understand their roles.

=== Cluster Manager ===

The cluster manager, cman, takes the configuration file [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] configuration file and uses it to configure and start the various cluster elements. This is where nodes are defined, fence devices are configured and various cluster tolerances are set.

=== Corosync ===

[http://www.corosync.org Corosync] is, essentially, the clustering kernel which provides the essential services for cluster-aware applications and other cluster services. At it's core is the [[totem]] protocol which provides cluster membership and ordered messaging. On top of this, it implements a number of core clustering services like [[#cpg; Closed Process Group|closed process group messaging]], [[quorum]] and [[confdb]] (an object database).

It's goal is to provide a substantially simpler and more flexible set of APIs to facilitate clustering in Linux. It manages which nodes are in the cluster, it triggers error messages when something fails, manages cluster locking and so on. Most other clustered applications rely on corosync to know when something has happened or to announce when a cluster-related action has taken place.

Please note that the corosync_overview man page is considered out of date at the time of this writing.

==== cpg; Closed Process Group ====

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install corosynclib-devel
man cpg_overview
</source>

The closed process group is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages amongst nodes in a consistent order. It adds [[PID]]s and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group.

=== OpenAIS ===

OpenAIS is now an extension to Corosync that adds an open-source implementation of the [http://www.saforum.org/ Service Availability (SA) Forum's] 'Application Interface Specification' ([[AIS]]). It is an [[API]] and policy designed to be used by applications concerned with maintaining services during faults. AIS implements the 'Availability Management Framework' ([[AMF]]) which, in turn, provides for application fail over, cluster management ([[CLM]]), Checkpointing ([[CKPT]]), Eventing ([[EVT]]), Messaging ([[MSG]]), and Distributed Locking ([[DLOCK]]).

It does this by implementing pluggable shared libraries that get loaded into corosync at runtime. These libraries then can make use of corosync's internal API. Both corosync and openais services provide [[IPC]] interfaces so that application programmers can make use of their behaviour.

In short; applications can use OpenAIS to be cluster-aware. It's libraries are used by some applications, including Pacemaker. In our application, we will only be using it's libraries indirectly.

Please note that the openais_overview man page is considered out of date at the time of this writing.

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install openais
</source>

=== Pacemaker ===

[http://www.clusterlabs.org Pacemaker] is the cluster resource manager. It can be used to trigger scripts to control non cluster-aware applications. In this way, these non cluster-aware applications can be made more highly available. It is considered a replacement for cman's [[rgmanager]].

== Setup Core Programs ==

We now need to edit and create the configuration files for our cluster.

=== Setup cman ===

You may only need to configure this section.

First, install the cluster manager:

<source lang="bash">
yum install cman
</source>

Once installed, we need to create and fill in the [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] file. This is the core configuration file of our cluster and is an [[XML]] formatted file.

Please see: [[Two-Node Fedora 13 cluster.conf]]

<source lang="bash">
vim /etc/cluster/cluster.conf
</source>

The example in that article details many of the common options you will want to use. For a more complete list, please run man 5 cluster.conf for a more complete discussion on this file. Also, please be aware that some arguments can overlap between [[#Setup Corosync|corosync]] and this file.

Once it's setup, we need to verify it. Once you've finished, be sure to run this:

<source lang="bash">
xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf
</source>

If there are errors, fix them. Once it is formatted properly, the contents of you cluster.conf file will be returned followed by "/etc/cluster/cluster.conf validates". Once this is the case, you can move on.

A note; Normally at this stage, you'd use chkconfig to enable cman to start at boot. However, depending on how you use your cluster, you may need to modify the boot order. For now, let's leave it off until we're got all the components enabled. This is particularly important if you will be using [[DRBD]], [[CLVM]] or [[Xen]].

<source lang="bash">
chkconfig cman off
</source>

To see if it's working though, have a terminal open to both nodes and manually start cman. I suggest have two terminal windows open on each node. In one, start tailing /var/log/messages:

<source lang="bash">
clear; tail -f -n 0 /var/log/messages
</source>

Now, with both log files being watched, start cman on both nodes.

'''Note''': You MUST execute the following on both nodes withing the time limit set by post_join_delay (default it 6 seconds). If you wait longer than this, the first node started will [[fence]] the other node, assuming you have fencing configured properly. If fencing is '''not''' configured properly, your first node will hang.

<source lang="bash">
/etc/init.d/cman start
</source>

Examine the log files to verify there were no errors. If there were, fix them. If there was not, fantastic!

'''Warning''': This last step will forcefully reboot your nodes. Do not do this on a production cluster! If it must be done on a production server, stop all non-essential services and be sure you have a good backup, ''just in case''.

Lastly, test that your fencing is configured and working properly. We do this by using a program called fence_node. With both nodes up and running cman, call a fence against the other node. Be sure to be watching the log files.

'''From''' an-node01.alteeve.com

<source lang="bash">
fence_node an-node02.alteeve.com
</source>
<source lang="bash">
fence an-node02.alteeve.com success
</source>

If it worked, when the node comes back up, repeat the process from an-node02.alteeve.com against an-node01.alteeve.com.

=== Setup Corosync ===

'''Note''': In our cluster, we will not need to configure Corosync.

When using cman, as we are here, Corosync's configuration file is ignored.

Most of the settings available in [[Two-Node Fedora 13 corosync.conf|corosync.conf]] can be found in the [[Two-Node Fedora 13 cluster.conf|cluster.conf]] file. Please see the cluster.conf article for a comprehensive list of what is and is not supported.

If you want to configure corosync and not use cman, please see the following article:

* [[Two-Node Fedora 13 corosync.conf]]

=== Setup Pacemaker ===

ToDO.

= Fencing =

Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.

* The Cluster Admin's Mantra:
** '''The only thing you don't know is what you don't know'''.

Just because one node loses communication with another node, it '''cannot''' be assume that the silent node is dead!

== What is it? ==

"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a '''split-brain''' condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.

Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:

* Power
** Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
* Blocking
** Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.

With power fencing, the term used is "STONITH", literally, '''S'''hoot '''T'''he '''O'''ther '''N'''ode '''I'''n '''T'''he '''H'''ead. Picture it like an old west dual. If one node is dead, the other node is going to win the dual by default and the dead node will just be shot again. When both nodes are alive, however, the faster node will win and will "kill" (power off or reset) the slower node before it has a chance to fire. Once this dual is over, the surviving node can then access the shared resource confident that it is the only one working on it.

== Misconception ==

It is a '''very''' common mistake to ignore fencing when first starting to learn about clustering. Often people think ''"It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster."''.

'''''Wrong!'''''

For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, [[cman]] and related daemons will fail if they can't find a fence agent to use.

Secondly; Testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.

== Implementation ==

How you implement fencing depends largely on what kind of fence device(s) you have. This is where I need help from you, dear reader. I only have access to [[Node Assassin]] fence devices on my lab's cluster. If you have other fence devices, like [[IPMI]], [[DRAC]] or so on, please let me know so that I can expand this section.

=== Implementing Node Assassin ===

In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced daemon. We'll cover the details of the [[cluster.conf]] file in a moment.

When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:
<source lang="xml">
<cluster name="an-cluster" config_version="1">
<clusternodes>
<clusternode name="an-node02.alteeve.com" nodeid="2">
<fence>
<method name="node_assassin">
<device name="motoko" port="02" action="off"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="motoko" agent="fence_na" quiet="true"
ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
</fencedevice>
</fencedevices>
</cluster>
</source>

If the cluster manager determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name, which is motoko in this case. It then looks in the <fencedevices> section for the device with the matching name. From there, it gets the information needed to find and access the fence device. Once it connects to the fence device, it then passes the options set in an-node02.alteeve.com's <fence> argument.

So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments:
* ipaddr=motoko.alteeve.com
* login=motoko
* passwd=secret
* quiet=true
* port=2
* action=off

How the fence agent acts on these arguments varies depending on the fence device itself. In general terms, the 'fence_na' fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on, which could be a power jack, a power or reset button, a network switch port and so on. Finally, it tells the device what action to take.

Once the device completes, it returns a success or failed message. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.

If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.

== Fence Devices ==

Many major [[OEM]]s have their own remote management devices that can serve as fence devices. Examples are [http://dell.ca Dell]'s 'DRAC' (Dell Remote Access Controller), [http://hp.ca HP]'s iLO (Integrate Lights Out), [http://ibm.ca IBM]'s 'RSA' (Remote Supervisor Adapter), [http://sun.ca Sun]'s 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via [[IPMI]], Intelligent Power Management Interface.

In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.

Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.

=== IPMI ===

[[IPMI]] is perhaps the most common method used for fencing nodes in a cluster. It requires having a system board with an IPMI baseboard management controller (BMC). IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

To start, we need to install the IPMI user software:

<source lang="bash">
</source>

=== Node Assassin ===

A cheap alternative is the [[Node Assassin]], an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.

'''Full Disclosure''': Node Assassin was created by me, with much help from others, for this paper.

= Clean Up =

Some daemons may have been installed or dragged in during the setup of the cluster. Most notable is heartbeat which is not compatible with [[RHCS]]. The following command will disable various daemons from starting at boot.

As for the rest of the daemons below, it is safe to run even when the daemons aren't installed or have already been removed. Of course, skip the ones you actually want to use. This assumes that the nodes have been built according to this HowTo.

<source lang="bash">
chkconfig heartbeat off; chkconfig iscsid off; chkconfig iptables off; chkconfig ip6tables off
</source>

= So You Have a Cluster - Now What? =

Building the cluster infrastructure was pretty easy, wasn't it?

But what will you '''''do''''' with it?

{|style="width: 75%; text-align: center; padding: 10px; border-spacing: 15px;"
|-
!style="width: 100%; border: 1px solid #dfdfdf;" colspan="3"|Choose Your Own Adventure Cluster
|-
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|style="width: 34%; border: 1px solid #dfdfdf;"| [[Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM|Xen-Based Virtual Machine Host on DRBD+CLVM]] High Availability VM Host
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|}

= Thanks =

* A '''huge''' thanks to [http://iplink.net Interlink Connectivity]! They hire me as a contractor and have allowed me to extend these docs while working on their clusters. Development of these How-Tos would be much slower if not for them. If you need hosting or colo services, drop them a line. They're website is a bit out of date though, so disregard the prices. :)
* To '''sdake''' of [http://corosync.org corosync] for helping me sort out the '''plock''' component and corosync in general.
* To '''Angus Salkeld''' for helping me nail down the Corosync and OpenAIS differences.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013922.html HJ Lee] from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013925.html Steven Dake] for clarifying the to_x vs. logoutput: x arguments in openais.conf.
* To [http://dk.linkedin.com/in/fabbione Fabio Massimo Di Nitto] for helping me get caught up with clustering and VMs.

{{footer}}

Abandoned - Two Node Fedora 13 Cluster

2010-09-09T06:20:44Z

Facade: /* Goal */

{{howto_header}}

'''''Notice''''', '''Sep. 08, 2010''' - ''Draft 1'': This is the first draft of this HowTo. It has not been peer-reviewed yet, so please proceed with caution and on a test cluster. Any and all feedback is much appreciated.

= Overview =

This paper has one goal;

# How to assemble the simplest cluster possible, a '''2-Node Cluster''', which you can then expand on for your own needs.

With this completed, you can then jump into "step 2" papers that will show various uses of a two node cluster:

# How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time.
# How to create simple resources that can move between nodes. Examples will be a simple PostgreSQL database, DHCP, DNS and web servers.

== Prerequisites ==

It is expected that you are already comfortable with the Linux command line, specifically [[bash]], and that you are familiar with general administrative tasks in Red Hat based distributions, specifically [[Fedora]]. You will also need to be comfortable using editors like [[vim]], [[nano]] or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, [[multicast]], broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

== Platform ==

This paper will implement the [[Red Hat]] Cluster Suite using the Fedora v13 distribution. This paper uses the [[x86_64]] repositories, however, if you are on an [[i386]] (32 bit) system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names.

You can either download the stock Fedora 13 DVD ISO (currently at version 5.4 which is used in this document), or you can try out the alpha [[#AN!Cluster Install|AN!Cluster Install]] DVD. (4.3GB iso). If you use the latter, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant Fedora distro, please [[Digimer|contact me]] and let me know what your trouble was.

== Why Fedora 13? ==

Generally speaking, I much prefer to use a server-oriented Linux distribution like [[CentOS]], [[Debian]] or similar. However, there have been many recent changes in the Linux-Clustering world that have made all of the currently available server-class distributions obsolete. With luck, once [[Red Hat]] Enterprise Linux and [[CentOS]] version 6 is released, this will change.

Until then, [[Fedora]] version 13 provides the most up to date binary releases of the new implementation of the clustering stack available. For this reason, FC13 is the best choice in clustering, if you want to be current. To mitigate some of the issues introduced by using a workstation distribution, many packages will be stripped out of the default install.

== Focus ==

Clusters can serve to solve three problems; '''Reliability''', '''Performance''' and '''Scalability'''.

This focus of the cluster described in this paper is primarily '''reliability'''. Second to this, '''scalability''' will be the priority leaving '''performance''' to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper.

== Goal ==

At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" resources. That is, resources that exist on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources.

This paper should also serve to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper.

= Begin =

Let's begin!

== Hardware ==

We will need two physical servers each with the following hardware:
* One or more multi-core [[CPU]]s with Virtualization support.
* Three network cards; At least one should be gigabit or faster.
* One or more hard drives.
* You will need some form of a [[fence|fence device]]. This can be an [[IPMI]]-enabled server, a [http://nodeassassin.org Node Assassin], a fenceable [http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900&tab=features PDU] or similar.

This paper uses the following hardware:
* ASUS [http://support.asus.com/search/search.aspx?keyword=m4a78l-m&SLanguage=en-us M4A78L-M]
* AMD Athlon II x2 250
* 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
* 1x Intel 82540 PCI NICs
* 1x D-Link DGE-560T
* Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would server the purposes of creating this document. What you purchase shouldn't matter, so long at the minimum requirements are met.

== Pre-Assembly ==

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to [[Setting Up a PXE Server in Fedora|network boot]] your install media. In that case, if the wrong NIC is chosen for eth0, you will be presented with a list of MAC addresses to attempt setup with.

Before you assemble your servers, record their network cards' [[MAC]] addresses. I like to keep simply text files like these:

<source lang="bash">
cat an-node01.mac
</source>
<source lang="text">
90:E6:BA:71:82:EA eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53 eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

<source lang="bash">
cat an-node02.mac
</source>
<source lang="text">
90:E6:BA:71:82:D8 eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

Feel free to record the information in any way that suits you the best.

== OS Install ==

Start with a stock Fedora 13 install. This How-To uses Fedora 13 x86_64, however it should be fairly easy to adapt to other recent Fedora version. This document is also attempting to be easily ported to [[CentOS]]/[[RHEL]] version 6 once it is released. This will not adapt well to CentOS/RHEL version 5 though... Much of the cluster stack has changed dramatically since it was initially released.

These are sample kickstart script used by this paper. Be sure to set your how password string and network settings.

'''''Warning'''''! These kickstart scripts '''''will erase your hard drive'''''! Adapt them, don't blindly use them.

Generic cluster node kickstart scripts.
* [[Fedora13 KS an-node01.ks|an-node01.ks]]
* [[Fedora13 KS an-node02.ks|an-node02.ks]]

== AN!Cluster Install ==

If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.

* Download the custom '''AN!Cluster v0.2.001''' Install DVD. (4.5[[GiB]] iso). (Currently disabled - Reworking for F13)

== Post OS Install ==

Once the OS is installed, we need to do a couple things;

# Disable selinux
# Setup networking.
# Change the default run-level.

=== Disable 'selinux' ===

To allow this paper to focus on clustering, we will disable selinux. Obviously, this introduces security issues that you may not be comfortable with. If you wish to leave selinux enabled, it will be up to you to sort out issues that crop up.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect.

=== Change the Default Run-Level ===

'''This is an optional step'''. It improves performance only, it is not a required step.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reducing the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

<source lang="bash">
vim /etc/inittab
</source>
<source lang="text">
id:3:initdefault:
</source>
<source lang="bash">
init 3
</source>

=== Setup Networking ===

We need to remove NetworkManager, enable network, configure the ifcfg-eth* files and then start the network daemon.

==== Network Layout ====

This setup expects you to have three physical network cards connected to three independent networks. To have a common vernacular, lets use this table to describe them:

{|
!class="cell_all"|Network Description
!class="cell_tbr"|Short Name
!class="cell_tbr"|Device Name
!class="cell_tbr"|Suggested Subnet
!class="cell_tbr"|NIC Properties
|-
|class="cell_blr"|Internet-Facing Network
|class="cell_br"|IFN
|class="cell_br"|eth0
|class="cell_br"|192.168.1.0/24
|class="cell_br"|Remaining NIC should be used here. If using a [[Setting Up a PXE Server in Fedora|PXE server]], this should be a bootable NIC.
|-
|class="cell_blr"|Storage Network
|class="cell_br"|SN
|class="cell_br"|eth1
|class="cell_br"|10.0.0.0/24
|class="cell_br"|Fastest NIC should be used here.
|-
|class="cell_blr"|Back-Channel Network
|class="cell_br"|BCN
|class="cell_br"|eth2
|class="cell_br"|10.0.1.0/24
|class="cell_br"|NICs with [[IPMI]] piggy-back '''must''' be used here. Second-fastest NIC should be used here.
|}

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:
# If your nodes have [[IPMI]] piggy-backing on a normal NIC, that NIC '''''must''''' be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a ''major'' security risk.
# The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
# If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
# The final NIC should be used for the IFN subnet.

==== Node IP Addresses ====

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

{|
|class="cell_br"| 
|class="cell_tbr"|'''Internet-Facing Network''' (IFN)
|class="cell_tbr"|'''Storage Network''' (SN)
|class="cell_tbr"|'''Back-Channel Network''' (BCN)
|-
|class="cell_blr"|'''an-node01'''
|class="cell_br"|192.168.1.71
|class="cell_br"|10.0.0.71
|class="cell_br"|10.0.1.71
|-
|class="cell_blr"|'''an-node02'''
|class="cell_br"|192.168.1.72
|class="cell_br"|10.0.0.72
|class="cell_br"|10.0.1.72
|}

==== Remove NetworkManager ====

Some cluster software '''will not''' start with NetworkManager installed. This is because is designed to be a highly-adaptive network system that can accommodate frequent changes in the network. To simplify these network transitions for end-users, a lot of reconfiguration of the network is done behind the scenes.

For workstations, this is wonderful. For clustering, this can be disastrous. Transient network outages are already a risk to a cluster's stability!

So first up, make sure that NetworkManager is completely removed from your system. If you used the [[#OS Install|kickstart]] scripts, then it was not installed. Otherwise, run:

<source lang="bash">
yum remove NetworkManager NetworkManager-gnome NetworkManager-openconnect NetworkManager-openvpn NetworkManager-pptp NetworkManager-vpnc cnetworkmanager knetworkmanager knetworkmanager-openvpn knetworkmanager-pptp knetworkmanager-vpnc libproxy-networkmanager yum-NetworkManager-dispatcher
</source>

==== Setup 'network' ====

Before proceeding with network configuration, check to see if your network cards are aligned to the proper ethX network names. If they need to be adjust, please follow this How-To before proceeding:

* [[Changing the ethX to Ethernet Device Mapping in Fedora]]

There are a few ways to configure network in Fedora:
* system-config-network (graphical)
* system-config-network-tui (ncurses)
* Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files. (See: [http://docs.fedoraproject.org/en-US/Fedora/12/html/Deployment_Guide/s1-networkscripts-interfaces.html here] for a full list of options)

Do not proceed until your node's networking is fully configured.

==== Update the Hosts File ====

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

'''Note''': Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.

The updated /etc/hosts file should look like this:

<source lang="bash">
vim /etc/hosts
</source>
<source lang="text">
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# Back-channel IPs to name mapping.
10.0.1.71 an-node01 an-node01.alteeve.com
10.0.1.72 an-node02 an-node02.alteeve.com

# Storage network
10.0.0.71 an-node01.sn
10.0.0.72 an-node02.sn

# Internet facing
192.168.1.71 an-node01.inet
192.168.1.72 an-node02.inet
</source>

To test this, ping both nodes by name and make sure the ping packets are sent on the 10.0.1.0/24 subnet:

<source lang="bash">
ping -c 5 an-node01
</source>
<source lang="text">
PING an-node01 (10.0.1.71) 56(84) bytes of data.
64 bytes from an-node01 (10.0.1.71): icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=2 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=3 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=4 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=5 ttl=64 time=0.055 ms
</source>
<source lang="bash">
ping -c 5 an-node02
</source>
<source lang="text">
PING an-node02 (10.0.1.72) 56(84) bytes of data.
64 bytes from an-node02 (10.0.1.72): icmp_seq=1 ttl=64 time=0.221 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=2 ttl=64 time=0.188 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=3 ttl=64 time=0.217 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=4 ttl=64 time=0.192 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=5 ttl=64 time=0.163 ms
</source>

==== Disable Firewalls ====

Be sure to flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

There will be enough potential sources of problem as it is. Disabling firewalls at this stage will minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.

<source lang="bash">
chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop
</source>

==== Setup SSH Shared Keys ====

'''This is an optional step'''. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. This is not meant to be a security-focused How-To, so please independently study the risks.

If you're a little new to [[SSH]], it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the ''other'' node's root user's directory.

For each user, on each machine you want to connect '''from''', run:

<source lang="bash">
ssh-keygen -t dsa -N "" -f ~/.ssh/id_dsa
</source>
<source lang="text">
Generating public/private dsa key pair.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
c4:cc:10:0a:6b:60:24:e7:57:94:69:e5:10:b2:cb:1f root@an-node01.alteeve.com
The key's randomart image is:
+--[ DSA 1024]----+
|+oo ..B*. |
|o+ o =+B |
| + +. * |
| . o . . |
| o E S |
| . . |
| . |
| |
| |
+-----------------+
</source>

This will create two files; The private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private which '''''must never''''' be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

'''Private key''':
<source lang="bash">
cat ~/.ssh/id_dsa
</source>
<source lang="text">
-----BEGIN DSA PRIVATE KEY-----
MIIBugIBAAKBgQCVo5nzeC/W8HWaFkD4pAmWO7ovf7S01f8HgocKKOYX/nMNxRui
wExBUUUt1b2TxYYQklTd9dn8UAzZ3UohN0CJNDrp//t2wfyACKKc2q+ewao5/svQ
xXTBu2ZPebKPcr1w9p0hq0xSr77Rg3v6Auc6AnC79DM9XkhYkVgNfgrvMwIVAKdY
SgTycvqNEWsJrCTTn5485yWvAoGANQ3rzIxFy2zHWyMlrpA9r5jllgxfaJVx3/Iw
DrEF82lr2dOmG8oYiKAhfvdgNyV7VeM2yxdjcdLPJAHHCx5BZZ7V9KjMzY26wS2r
d58TZJoWmO0nscmcwIbCv2at3fiMpxgr+nzD4tKxFOdWxYBwdmUpIjtJoW8wzY2H
uNghjL0CgYA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDN
bbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EE
D3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5AIUFX97yIig
n2SUltN+10NWUy4oYsA=
-----END DSA PRIVATE KEY-----
</source>

'''Public key''':
<source lang="bash">
cat ~/.ssh/id_dsa.pub
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From '''an-node01''', type:
<source lang="bash">
ssh root@an-node02
</source>
<source lang="text">
The authenticity of host 'an-node02 (10.0.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,10.0.1.72' (RSA) to the list of known hosts.
root@an-node02's password:
Last login: Tue Jul 6 22:28:19 2010 from 192.168.1.102
</source>

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

<source lang="bash">
cat ~/.ssh/authorized_keys
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

= Initial Cluster Setup =

Before we get into specifics, let's take a minute to talk about the major components used in our cluster.

== Core Program Overviews ==

These are the core programs that may be new to you that we will use to build our cluster. Before we configure them, lets take a minute to understand their roles.

=== Cluster Manager ===

The cluster manager, cman, takes the configuration file [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] configuration file and uses it to configure and start the various cluster elements. This is where nodes are defined, fence devices are configured and various cluster tolerances are set.

=== Corosync ===

[http://www.corosync.org Corosync] is, essentially, the clustering kernel which provides the essential services for cluster-aware applications and other cluster services. At it's core is the [[totem]] protocol which provides cluster membership and ordered messaging. On top of this, it implements a number of core clustering services like [[#cpg; Closed Process Group|closed process group messaging]], [[quorum]] and [[confdb]] (an object database).

It's goal is to provide a substantially simpler and more flexible set of APIs to facilitate clustering in Linux. It manages which nodes are in the cluster, it triggers error messages when something fails, manages cluster locking and so on. Most other clustered applications rely on corosync to know when something has happened or to announce when a cluster-related action has taken place.

Please note that the corosync_overview man page is considered out of date at the time of this writing.

==== cpg; Closed Process Group ====

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install corosynclib-devel
man cpg_overview
</source>

The closed process group is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages amongst nodes in a consistent order. It adds [[PID]]s and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group.

=== OpenAIS ===

OpenAIS is now an extension to Corosync that adds an open-source implementation of the [http://www.saforum.org/ Service Availability (SA) Forum's] 'Application Interface Specification' ([[AIS]]). It is an [[API]] and policy designed to be used by applications concerned with maintaining services during faults. AIS implements the 'Availability Management Framework' ([[AMF]]) which, in turn, provides for application fail over, cluster management ([[CLM]]), Checkpointing ([[CKPT]]), Eventing ([[EVT]]), Messaging ([[MSG]]), and Distributed Locking ([[DLOCK]]).

It does this by implementing pluggable shared libraries that get loaded into corosync at runtime. These libraries then can make use of corosync's internal API. Both corosync and openais services provide [[IPC]] interfaces so that application programmers can make use of their behaviour.

In short; applications can use OpenAIS to be cluster-aware. It's libraries are used by some applications, including Pacemaker. In our application, we will only be using it's libraries indirectly.

Please note that the openais_overview man page is considered out of date at the time of this writing.

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install openais
</source>

=== Pacemaker ===

[http://www.clusterlabs.org Pacemaker] is the cluster resource manager. It can be used to trigger scripts to control non cluster-aware applications. In this way, these non cluster-aware applications can be made more highly available. It is considered a replacement for cman's [[rgmanager]].

== Setup Core Programs ==

We now need to edit and create the configuration files for our cluster.

=== Setup cman ===

You may only need to configure this section.

First, install the cluster manager:

<source lang="bash">
yum install cman
</source>

Once installed, we need to create and fill in the [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] file. This is the core configuration file of our cluster and is an [[XML]] formatted file.

Please see: [[Two-Node Fedora 13 cluster.conf]]

<source lang="bash">
vim /etc/cluster/cluster.conf
</source>

The example in that article details many of the common options you will want to use. For a more complete list, please run man 5 cluster.conf for a more complete discussion on this file. Also, please be aware that some arguments can overlap between [[#Setup Corosync|corosync]] and this file.

Once it's setup, we need to verify it. Once you've finished, be sure to run this:

<source lang="bash">
xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf
</source>

If there are errors, fix them. Once it is formatted properly, the contents of you cluster.conf file will be returned followed by "/etc/cluster/cluster.conf validates". Once this is the case, you can move on.

A note; Normally at this stage, you'd use chkconfig to enable cman to start at boot. However, depending on how you use your cluster, you may need to modify the boot order. For now, let's leave it off until we're got all the components enabled. This is particularly important if you will be using [[DRBD]], [[CLVM]] or [[Xen]].

<source lang="bash">
chkconfig cman off
</source>

To see if it's working though, have a terminal open to both nodes and manually start cman. I suggest have two terminal windows open on each node. In one, start tailing /var/log/messages:

<source lang="bash">
clear; tail -f -n 0 /var/log/messages
</source>

Now, with both log files being watched, start cman on both nodes.

'''Note''': You MUST execute the following on both nodes withing the time limit set by post_join_delay (default it 6 seconds). If you wait longer than this, the first node started will [[fence]] the other node, assuming you have fencing configured properly. If fencing is '''not''' configured properly, your first node will hang.

<source lang="bash">
/etc/init.d/cman start
</source>

Examine the log files to verify there were no errors. If there were, fix them. If there was not, fantastic!

'''Warning''': This last step will forcefully reboot your nodes. Do not do this on a production cluster! If it must be done on a production server, stop all non-essential services and be sure you have a good backup, ''just in case''.

Lastly, test that your fencing is configured and working properly. We do this by using a program called fence_node. With both nodes up and running cman, call a fence against the other node. Be sure to be watching the log files.

'''From''' an-node01.alteeve.com

<source lang="bash">
fence_node an-node02.alteeve.com
</source>
<source lang="bash">
fence an-node02.alteeve.com success
</source>

If it worked, when the node comes back up, repeat the process from an-node02.alteeve.com against an-node01.alteeve.com.

=== Setup Corosync ===

'''Note''': In our cluster, we will not need to configure Corosync.

When using cman, as we are here, Corosync's configuration file is ignored.

Most of the settings available in [[Two-Node Fedora 13 corosync.conf|corosync.conf]] can be found in the [[Two-Node Fedora 13 cluster.conf|cluster.conf]] file. Please see the cluster.conf article for a comprehensive list of what is and is not supported.

If you want to configure corosync and not use cman, please see the following article:

* [[Two-Node Fedora 13 corosync.conf]]

=== Setup Pacemaker ===

ToDO.

= Fencing =

Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.

* The Cluster Admin's Mantra:
** '''The only thing you don't know is what you don't know'''.

Just because one node loses communication with another node, it '''cannot''' be assume that the silent node is dead!

== What is it? ==

"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a '''split-brain''' condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.

Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:

* Power
** Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
* Blocking
** Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.

With power fencing, the term used is "STONITH", literally, '''S'''hoot '''T'''he '''O'''ther '''N'''ode '''I'''n '''T'''he '''H'''ead. Picture it like an old west dual. If one node is dead, the other node is going to win the dual by default and the dead node will just be shot again. When both nodes are alive, however, the faster node will win and will "kill" (power off or reset) the slower node before it has a chance to fire. Once this dual is over, the surviving node can then access the shared resource confident that it is the only one working on it.

== Misconception ==

It is a '''very''' common mistake to ignore fencing when first starting to learn about clustering. Often people think ''"It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster."''.

'''''Wrong!'''''

For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, [[cman]] and related daemons will fail if they can't find a fence agent to use.

Secondly; Testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.

== Implementation ==

How you implement fencing depends largely on what kind of fence device(s) you have. This is where I need help from you, dear reader. I only have access to [[Node Assassin]] fence devices on my lab's cluster. If you have other fence devices, like [[IPMI]], [[DRAC]] or so on, please let me know so that I can expand this section.

=== Implementing Node Assassin ===

In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced daemon. We'll cover the details of the [[cluster.conf]] file in a moment.

When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:
<source lang="xml">
<cluster name="an-cluster" config_version="1">
<clusternodes>
<clusternode name="an-node02.alteeve.com" nodeid="2">
<fence>
<method name="node_assassin">
<device name="motoko" port="02" action="off"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="motoko" agent="fence_na" quiet="true"
ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
</fencedevice>
</fencedevices>
</cluster>
</source>

If the cluster manager determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name, which is motoko in this case. It then looks in the <fencedevices> section for the device with the matching name. From there, it gets the information needed to find and access the fence device. Once it connects to the fence device, it then passes the options set in an-node02.alteeve.com's <fence> argument.

So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments:
* ipaddr=motoko.alteeve.com
* login=motoko
* passwd=secret
* quiet=true
* port=2
* action=off

How the fence agent acts on these arguments varies depending on the fence device itself. In general terms, the 'fence_na' fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on, which could be a power jack, a power or reset button, a network switch port and so on. Finally, it tells the device what action to take.

Once the device completes, it returns a success or failed message. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.

If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.

== Fence Devices ==

Many major [[OEM]]s have their own remote management devices that can serve as fence devices. Examples are [http://dell.ca Dell]'s 'DRAC' (Dell Remote Access Controller), [http://hp.ca HP]'s iLO (Integrate Lights Out), [http://ibm.ca IBM]'s 'RSA' (Remote Supervisor Adapter), [http://sun.ca Sun]'s 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via [[IPMI]], Intelligent Power Management Interface.

In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.

Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.

=== IPMI ===

[[IPMI]] is perhaps the most common method used for fencing nodes in a cluster. It requires having a system board with an IPMI baseboard management controller (BMC). IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

To start, we need to install the IPMI user software:

<source lang="bash">
</source>

=== Node Assassin ===

A cheap alternative is the [[Node Assassin]], an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.

'''Full Disclosure''': Node Assassin was created by me, with much help from others, for this paper.

= Clean Up =

Some daemons may have been installed or dragged in during the setup of the cluster. Most notable is heartbeat which is not compatible with [[RHCS]]. The following command will disable various daemons from starting at boot.

As for the rest of the daemons below, it is safe to run even when the daemons aren't installed or have already been removed. Of course, skip the ones you actually want to use. This assumes that the nodes have been built according to this HowTo.

<source lang="bash">
chkconfig heartbeat off; chkconfig iscsid off; chkconfig iptables off; chkconfig ip6tables off
</source>

= So You Have a Cluster - Now What? =

Building the cluster infrastructure was pretty easy, wasn't it?

But what will you '''''do''''' with it?

{|style="width: 75%; text-align: center; padding: 10px; border-spacing: 15px;"
|-
!style="width: 100%; border: 1px solid #dfdfdf;" colspan="3"|Choose Your Own Adventure Cluster
|-
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|style="width: 34%; border: 1px solid #dfdfdf;"| [[Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM|Xen-Based Virtual Machine Host on DRBD+CLVM]] High Availability VM Host
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|}

= Thanks =

* A '''huge''' thanks to [http://iplink.net Interlink Connectivity]! They hire me as a contractor and have allowed me to extend these docs while working on their clusters. Development of these How-Tos would be much slower if not for them. If you need hosting or colo services, drop them a line. They're website is a bit out of date though, so disregard the prices. :)
* To '''sdake''' of [http://corosync.org corosync] for helping me sort out the '''plock''' component and corosync in general.
* To '''Angus Salkeld''' for helping me nail down the Corosync and OpenAIS differences.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013922.html HJ Lee] from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013925.html Steven Dake] for clarifying the to_x vs. logoutput: x arguments in openais.conf.
* To [http://dk.linkedin.com/in/fabbione Fabio Massimo Di Nitto] for helping me get caught up with clustering and VMs.

{{footer}}

Abandoned - Two Node Fedora 13 Cluster

2010-09-09T06:18:21Z

Facade: /* Platform */

{{howto_header}}

'''''Notice''''', '''Sep. 08, 2010''' - ''Draft 1'': This is the first draft of this HowTo. It has not been peer-reviewed yet, so please proceed with caution and on a test cluster. Any and all feedback is much appreciated.

= Overview =

This paper has one goal;

# How to assemble the simplest cluster possible, a '''2-Node Cluster''', which you can then expand on for your own needs.

With this completed, you can then jump into "step 2" papers that will show various uses of a two node cluster:

# How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time.
# How to create simple resources that can move between nodes. Examples will be a simple PostgreSQL database, DHCP, DNS and web servers.

== Prerequisites ==

It is expected that you are already comfortable with the Linux command line, specifically [[bash]], and that you are familiar with general administrative tasks in Red Hat based distributions, specifically [[Fedora]]. You will also need to be comfortable using editors like [[vim]], [[nano]] or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.

You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, [[multicast]], broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.

This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.

== Platform ==

This paper will implement the [[Red Hat]] Cluster Suite using the Fedora v13 distribution. This paper uses the [[x86_64]] repositories, however, if you are on an [[i386]] (32 bit) system, you should be able to follow along fine. Simply replace x86_64 with i386 or i686 in package names.

You can either download the stock Fedora 13 DVD ISO (currently at version 5.4 which is used in this document), or you can try out the alpha [[#AN!Cluster Install|AN!Cluster Install]] DVD. (4.3GB iso). If you use the latter, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant Fedora distro, please [[Digimer|contact me]] and let me know what your trouble was.

== Why Fedora 13? ==

Generally speaking, I much prefer to use a server-oriented Linux distribution like [[CentOS]], [[Debian]] or similar. However, there have been many recent changes in the Linux-Clustering world that have made all of the currently available server-class distributions obsolete. With luck, once [[Red Hat]] Enterprise Linux and [[CentOS]] version 6 is released, this will change.

Until then, [[Fedora]] version 13 provides the most up to date binary releases of the new implementation of the clustering stack available. For this reason, FC13 is the best choice in clustering, if you want to be current. To mitigate some of the issues introduced by using a workstation distribution, many packages will be stripped out of the default install.

== Focus ==

Clusters can serve to solve three problems; '''Reliability''', '''Performance''' and '''Scalability'''.

This focus of the cluster described in this paper is primarily '''reliability'''. Second to this, '''scalability''' will be the priority leaving '''performance''' to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper.

== Goal ==

At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" resources. That is, resources that exists on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources.

This paper should also server to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper.

= Begin =

Let's begin!

== Hardware ==

We will need two physical servers each with the following hardware:
* One or more multi-core [[CPU]]s with Virtualization support.
* Three network cards; At least one should be gigabit or faster.
* One or more hard drives.
* You will need some form of a [[fence|fence device]]. This can be an [[IPMI]]-enabled server, a [http://nodeassassin.org Node Assassin], a fenceable [http://www.apc.com/products/resource/include/techspec_index.cfm?base_sku=AP7900&tab=features PDU] or similar.

This paper uses the following hardware:
* ASUS [http://support.asus.com/search/search.aspx?keyword=m4a78l-m&SLanguage=en-us M4A78L-M]
* AMD Athlon II x2 250
* 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
* 1x Intel 82540 PCI NICs
* 1x D-Link DGE-560T
* Node Assassin

This is not an endorsement of the above hardware. I bought what was within my budget that would server the purposes of creating this document. What you purchase shouldn't matter, so long at the minimum requirements are met.

== Pre-Assembly ==

With multiple NICs, it is quite likely that the mapping of physical devices to logical ethX devices may not be ideal. This is a particular issue if you decide to [[Setting Up a PXE Server in Fedora|network boot]] your install media. In that case, if the wrong NIC is chosen for eth0, you will be presented with a list of MAC addresses to attempt setup with.

Before you assemble your servers, record their network cards' [[MAC]] addresses. I like to keep simply text files like these:

<source lang="bash">
cat an-node01.mac
</source>
<source lang="text">
90:E6:BA:71:82:EA eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:53 eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:46:E4 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

<source lang="bash">
cat an-node02.mac
</source>
<source lang="text">
90:E6:BA:71:82:D8 eth0 # Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
00:21:91:19:96:5A eth1 # D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter
00:0E:0C:59:45:78 eth2 # Intel Corporation 82540EM Gigabit Ethernet Controller
</source>

Feel free to record the information in any way that suits you the best.

== OS Install ==

Start with a stock Fedora 13 install. This How-To uses Fedora 13 x86_64, however it should be fairly easy to adapt to other recent Fedora version. This document is also attempting to be easily ported to [[CentOS]]/[[RHEL]] version 6 once it is released. This will not adapt well to CentOS/RHEL version 5 though... Much of the cluster stack has changed dramatically since it was initially released.

These are sample kickstart script used by this paper. Be sure to set your how password string and network settings.

'''''Warning'''''! These kickstart scripts '''''will erase your hard drive'''''! Adapt them, don't blindly use them.

Generic cluster node kickstart scripts.
* [[Fedora13 KS an-node01.ks|an-node01.ks]]
* [[Fedora13 KS an-node02.ks|an-node02.ks]]

== AN!Cluster Install ==

If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.

* Download the custom '''AN!Cluster v0.2.001''' Install DVD. (4.5[[GiB]] iso). (Currently disabled - Reworking for F13)

== Post OS Install ==

Once the OS is installed, we need to do a couple things;

# Disable selinux
# Setup networking.
# Change the default run-level.

=== Disable 'selinux' ===

To allow this paper to focus on clustering, we will disable selinux. Obviously, this introduces security issues that you may not be comfortable with. If you wish to leave selinux enabled, it will be up to you to sort out issues that crop up.

To disable selinux, edit /etc/selinux/config and change SELINUX=enforcing to SELINUX=permissive. You will need to reboot in order for the changes to take effect.

=== Change the Default Run-Level ===

'''This is an optional step'''. It improves performance only, it is not a required step.

If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents the window manager, like Gnome or KDE, from starting at boot. This frees up a fair of memory and system resources and reducing the possible attack vectors.

To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:

<source lang="bash">
vim /etc/inittab
</source>
<source lang="text">
id:3:initdefault:
</source>
<source lang="bash">
init 3
</source>

=== Setup Networking ===

We need to remove NetworkManager, enable network, configure the ifcfg-eth* files and then start the network daemon.

==== Network Layout ====

This setup expects you to have three physical network cards connected to three independent networks. To have a common vernacular, lets use this table to describe them:

{|
!class="cell_all"|Network Description
!class="cell_tbr"|Short Name
!class="cell_tbr"|Device Name
!class="cell_tbr"|Suggested Subnet
!class="cell_tbr"|NIC Properties
|-
|class="cell_blr"|Internet-Facing Network
|class="cell_br"|IFN
|class="cell_br"|eth0
|class="cell_br"|192.168.1.0/24
|class="cell_br"|Remaining NIC should be used here. If using a [[Setting Up a PXE Server in Fedora|PXE server]], this should be a bootable NIC.
|-
|class="cell_blr"|Storage Network
|class="cell_br"|SN
|class="cell_br"|eth1
|class="cell_br"|10.0.0.0/24
|class="cell_br"|Fastest NIC should be used here.
|-
|class="cell_blr"|Back-Channel Network
|class="cell_br"|BCN
|class="cell_br"|eth2
|class="cell_br"|10.0.1.0/24
|class="cell_br"|NICs with [[IPMI]] piggy-back '''must''' be used here. Second-fastest NIC should be used here.
|}

Take note of these concerns when planning which NIC to use on each subnet. These issues are presented in the order that they must be addressed in:
# If your nodes have [[IPMI]] piggy-backing on a normal NIC, that NIC '''''must''''' be used on BCN subnet. Having your fence device accessible on a subnet that can be remotely accessed can pose a ''major'' security risk.
# The fastest NIC should be used for your SN subnet. Be sure to know which NICs support the largest jumbo frames when considering this.
# If you still have two NICs to choose from, use the fastest remaining NIC for your BCN subnet. This will minimize the time it takes to perform tasks like hot-migration of live virtual machines.
# The final NIC should be used for the IFN subnet.

==== Node IP Addresses ====

Obviously, the IP addresses you give to your nodes should be ones that suit you best. In this example, the following IP addresses are used:

{|
|class="cell_br"| 
|class="cell_tbr"|'''Internet-Facing Network''' (IFN)
|class="cell_tbr"|'''Storage Network''' (SN)
|class="cell_tbr"|'''Back-Channel Network''' (BCN)
|-
|class="cell_blr"|'''an-node01'''
|class="cell_br"|192.168.1.71
|class="cell_br"|10.0.0.71
|class="cell_br"|10.0.1.71
|-
|class="cell_blr"|'''an-node02'''
|class="cell_br"|192.168.1.72
|class="cell_br"|10.0.0.72
|class="cell_br"|10.0.1.72
|}

==== Remove NetworkManager ====

Some cluster software '''will not''' start with NetworkManager installed. This is because is designed to be a highly-adaptive network system that can accommodate frequent changes in the network. To simplify these network transitions for end-users, a lot of reconfiguration of the network is done behind the scenes.

For workstations, this is wonderful. For clustering, this can be disastrous. Transient network outages are already a risk to a cluster's stability!

So first up, make sure that NetworkManager is completely removed from your system. If you used the [[#OS Install|kickstart]] scripts, then it was not installed. Otherwise, run:

<source lang="bash">
yum remove NetworkManager NetworkManager-gnome NetworkManager-openconnect NetworkManager-openvpn NetworkManager-pptp NetworkManager-vpnc cnetworkmanager knetworkmanager knetworkmanager-openvpn knetworkmanager-pptp knetworkmanager-vpnc libproxy-networkmanager yum-NetworkManager-dispatcher
</source>

==== Setup 'network' ====

Before proceeding with network configuration, check to see if your network cards are aligned to the proper ethX network names. If they need to be adjust, please follow this How-To before proceeding:

* [[Changing the ethX to Ethernet Device Mapping in Fedora]]

There are a few ways to configure network in Fedora:
* system-config-network (graphical)
* system-config-network-tui (ncurses)
* Directly editing the /etc/sysconfig/network-scripts/ifcfg-eth* files. (See: [http://docs.fedoraproject.org/en-US/Fedora/12/html/Deployment_Guide/s1-networkscripts-interfaces.html here] for a full list of options)

Do not proceed until your node's networking is fully configured.

==== Update the Hosts File ====

Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we remove any existing hostname entries and then add the following to the /etc/hosts file:

'''Note''': Any pre-existing entries matching the name returned by uname -n must be removed from /etc/hosts. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.

Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.

The updated /etc/hosts file should look like this:

<source lang="bash">
vim /etc/hosts
</source>
<source lang="text">
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

# Back-channel IPs to name mapping.
10.0.1.71 an-node01 an-node01.alteeve.com
10.0.1.72 an-node02 an-node02.alteeve.com

# Storage network
10.0.0.71 an-node01.sn
10.0.0.72 an-node02.sn

# Internet facing
192.168.1.71 an-node01.inet
192.168.1.72 an-node02.inet
</source>

To test this, ping both nodes by name and make sure the ping packets are sent on the 10.0.1.0/24 subnet:

<source lang="bash">
ping -c 5 an-node01
</source>
<source lang="text">
PING an-node01 (10.0.1.71) 56(84) bytes of data.
64 bytes from an-node01 (10.0.1.71): icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=2 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=3 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=4 ttl=64 time=0.055 ms
64 bytes from an-node01 (10.0.1.71): icmp_seq=5 ttl=64 time=0.055 ms
</source>
<source lang="bash">
ping -c 5 an-node02
</source>
<source lang="text">
PING an-node02 (10.0.1.72) 56(84) bytes of data.
64 bytes from an-node02 (10.0.1.72): icmp_seq=1 ttl=64 time=0.221 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=2 ttl=64 time=0.188 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=3 ttl=64 time=0.217 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=4 ttl=64 time=0.192 ms
64 bytes from an-node02 (10.0.1.72): icmp_seq=5 ttl=64 time=0.163 ms
</source>

==== Disable Firewalls ====

Be sure to flush netfilter tables and disable iptables and ip6tables from starting on our nodes.

There will be enough potential sources of problem as it is. Disabling firewalls at this stage will minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.

<source lang="bash">
chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop
</source>

==== Setup SSH Shared Keys ====

'''This is an optional step'''. Setting up shared keys will allow your nodes to pass files between one another and execute commands remotely without needing to enter a password. This is obviously somewhat risky from a security point of view. As such, it is up to you whether you do this or not. This is not meant to be a security-focused How-To, so please independently study the risks.

If you're a little new to [[SSH]], it can be a bit confusing keeping connections straight in you head. When you connect to a remote machine, you start the connection on your machine as the user you are logged in as. This is the source user. When you call the remote machine, you tell the machine what user you want to log in as. This is the remote user.

You will need to create an SSH key for each source user, and then you will need to copy the newly generated public key to each remote machine's user directory that you want to connect to. In this example, we want to connect to either node, from either node, as the root user. So we will create a key for each node's root user and then copy the generated public key to the ''other'' node's root user's directory.

For each user, on each machine you want to connect '''from''', run:

<source lang="bash">
ssh-keygen -t dsa -N "" -f ~/.ssh/id_dsa
</source>
<source lang="text">
Generating public/private dsa key pair.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
The key fingerprint is:
c4:cc:10:0a:6b:60:24:e7:57:94:69:e5:10:b2:cb:1f root@an-node01.alteeve.com
The key's randomart image is:
+--[ DSA 1024]----+
|+oo ..B*. |
|o+ o =+B |
| + +. * |
| . o . . |
| o E S |
| . . |
| . |
| |
| |
+-----------------+
</source>

This will create two files; The private key called ~/.ssh/id_dsa and the public key called ~/.ssh/id_dsa.pub. The private which '''''must never''''' be group or world readable! That is, it should be set to mode 0600.

The two files should look like:

'''Private key''':
<source lang="bash">
cat ~/.ssh/id_dsa
</source>
<source lang="text">
-----BEGIN DSA PRIVATE KEY-----
MIIBugIBAAKBgQCVo5nzeC/W8HWaFkD4pAmWO7ovf7S01f8HgocKKOYX/nMNxRui
wExBUUUt1b2TxYYQklTd9dn8UAzZ3UohN0CJNDrp//t2wfyACKKc2q+ewao5/svQ
xXTBu2ZPebKPcr1w9p0hq0xSr77Rg3v6Auc6AnC79DM9XkhYkVgNfgrvMwIVAKdY
SgTycvqNEWsJrCTTn5485yWvAoGANQ3rzIxFy2zHWyMlrpA9r5jllgxfaJVx3/Iw
DrEF82lr2dOmG8oYiKAhfvdgNyV7VeM2yxdjcdLPJAHHCx5BZZ7V9KjMzY26wS2r
d58TZJoWmO0nscmcwIbCv2at3fiMpxgr+nzD4tKxFOdWxYBwdmUpIjtJoW8wzY2H
uNghjL0CgYA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDN
bbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EE
D3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5AIUFX97yIig
n2SUltN+10NWUy4oYsA=
-----END DSA PRIVATE KEY-----
</source>

'''Public key''':
<source lang="bash">
cat ~/.ssh/id_dsa.pub
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Copy the public key and then ssh normally into the remote machine as the root user. Create a file called ~/.ssh/authorized_keys and paste in the key.

From '''an-node01''', type:
<source lang="bash">
ssh root@an-node02
</source>
<source lang="text">
The authenticity of host 'an-node02 (10.0.1.72)' can't be established.
RSA key fingerprint is d4:1b:68:5f:fa:ef:0f:0c:16:e7:f9:0a:8d:69:3e:c5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'an-node02,10.0.1.72' (RSA) to the list of known hosts.
root@an-node02's password:
Last login: Tue Jul 6 22:28:19 2010 from 192.168.1.102
</source>

You will now be logged into an-node02 as the root user. Create the ~/.ssh/authorized_keys file and paste into it the public key from an-node01. If the remote machine's user hasn't used ssh yet, their ~/.ssh directory will not exist.

<source lang="bash">
cat ~/.ssh/authorized_keys
</source>
<source lang="text">
ssh-dss AAAAB3NzaC1kc3MAAACBAJWjmfN4L9bwdZoWQPikCZY7ui9/tLTV/weChwoo5hf+cw3FG6LATEFRRS3VvZPFhhCSVN312fxQDNndSiE3QIk0Oun/+3bB/IAIopzar57Bqjn+y9DFdMG7Zk95so9yvXD2nSGrTFKvvtGDe/oC5zoCcLv0Mz1eSFiRWA1+Cu8zAAAAFQCnWEoE8nL6jRFrCawk05+ePOclrwAAAIA1DevMjEXLbMdbIyWukD2vmOWWDF9olXHf8jAOsQXzaWvZ06YbyhiIoCF+92A3JXtV4zbLF2Nx0s8kAccLHkFlntX0qMzNjbrBLat3nxNkmhaY7SexyZzAhsK/Zq3d+IynGCv6fMPi0rEU51bFgHB2ZSkiO0mhbzDNjYe42CGMvQAAAIA+bW0A/fZaUP6nk00tCeVzv+TUNxN5OJ3fiMuLVswvMKE2WaDnUgDNbbvzMJGf6v6irAbkJk9EbGGld91WAdtWzjBhZhuuHXptKiwAP3+YMJS7EJRs84EED3Sy3xCD+Wp3UkxrMZvaLC9onCZU+BqWjlD8wz6XJNYob/piF+jG5A== root@an-node01.alteeve.com
</source>

Now log out and then log back into the remote machine. This time, the connection should succeed without having entered a password!

= Initial Cluster Setup =

Before we get into specifics, let's take a minute to talk about the major components used in our cluster.

== Core Program Overviews ==

These are the core programs that may be new to you that we will use to build our cluster. Before we configure them, lets take a minute to understand their roles.

=== Cluster Manager ===

The cluster manager, cman, takes the configuration file [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] configuration file and uses it to configure and start the various cluster elements. This is where nodes are defined, fence devices are configured and various cluster tolerances are set.

=== Corosync ===

[http://www.corosync.org Corosync] is, essentially, the clustering kernel which provides the essential services for cluster-aware applications and other cluster services. At it's core is the [[totem]] protocol which provides cluster membership and ordered messaging. On top of this, it implements a number of core clustering services like [[#cpg; Closed Process Group|closed process group messaging]], [[quorum]] and [[confdb]] (an object database).

It's goal is to provide a substantially simpler and more flexible set of APIs to facilitate clustering in Linux. It manages which nodes are in the cluster, it triggers error messages when something fails, manages cluster locking and so on. Most other clustered applications rely on corosync to know when something has happened or to announce when a cluster-related action has taken place.

Please note that the corosync_overview man page is considered out of date at the time of this writing.

==== cpg; Closed Process Group ====

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install corosynclib-devel
man cpg_overview
</source>

The closed process group is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages amongst nodes in a consistent order. It adds [[PID]]s and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group.

=== OpenAIS ===

OpenAIS is now an extension to Corosync that adds an open-source implementation of the [http://www.saforum.org/ Service Availability (SA) Forum's] 'Application Interface Specification' ([[AIS]]). It is an [[API]] and policy designed to be used by applications concerned with maintaining services during faults. AIS implements the 'Availability Management Framework' ([[AMF]]) which, in turn, provides for application fail over, cluster management ([[CLM]]), Checkpointing ([[CKPT]]), Eventing ([[EVT]]), Messaging ([[MSG]]), and Distributed Locking ([[DLOCK]]).

It does this by implementing pluggable shared libraries that get loaded into corosync at runtime. These libraries then can make use of corosync's internal API. Both corosync and openais services provide [[IPC]] interfaces so that application programmers can make use of their behaviour.

In short; applications can use OpenAIS to be cluster-aware. It's libraries are used by some applications, including Pacemaker. In our application, we will only be using it's libraries indirectly.

Please note that the openais_overview man page is considered out of date at the time of this writing.

'''Note''': Only install the following if you wish to review the man page.

<source lang="bash">
yum install openais
</source>

=== Pacemaker ===

[http://www.clusterlabs.org Pacemaker] is the cluster resource manager. It can be used to trigger scripts to control non cluster-aware applications. In this way, these non cluster-aware applications can be made more highly available. It is considered a replacement for cman's [[rgmanager]].

== Setup Core Programs ==

We now need to edit and create the configuration files for our cluster.

=== Setup cman ===

You may only need to configure this section.

First, install the cluster manager:

<source lang="bash">
yum install cman
</source>

Once installed, we need to create and fill in the [[Two-Node Fedora 13 cluster.conf|/etc/cluster/cluster.conf]] file. This is the core configuration file of our cluster and is an [[XML]] formatted file.

Please see: [[Two-Node Fedora 13 cluster.conf]]

<source lang="bash">
vim /etc/cluster/cluster.conf
</source>

The example in that article details many of the common options you will want to use. For a more complete list, please run man 5 cluster.conf for a more complete discussion on this file. Also, please be aware that some arguments can overlap between [[#Setup Corosync|corosync]] and this file.

Once it's setup, we need to verify it. Once you've finished, be sure to run this:

<source lang="bash">
xmllint --relaxng /usr/share/cluster/cluster.rng /etc/cluster/cluster.conf
</source>

If there are errors, fix them. Once it is formatted properly, the contents of you cluster.conf file will be returned followed by "/etc/cluster/cluster.conf validates". Once this is the case, you can move on.

A note; Normally at this stage, you'd use chkconfig to enable cman to start at boot. However, depending on how you use your cluster, you may need to modify the boot order. For now, let's leave it off until we're got all the components enabled. This is particularly important if you will be using [[DRBD]], [[CLVM]] or [[Xen]].

<source lang="bash">
chkconfig cman off
</source>

To see if it's working though, have a terminal open to both nodes and manually start cman. I suggest have two terminal windows open on each node. In one, start tailing /var/log/messages:

<source lang="bash">
clear; tail -f -n 0 /var/log/messages
</source>

Now, with both log files being watched, start cman on both nodes.

'''Note''': You MUST execute the following on both nodes withing the time limit set by post_join_delay (default it 6 seconds). If you wait longer than this, the first node started will [[fence]] the other node, assuming you have fencing configured properly. If fencing is '''not''' configured properly, your first node will hang.

<source lang="bash">
/etc/init.d/cman start
</source>

Examine the log files to verify there were no errors. If there were, fix them. If there was not, fantastic!

'''Warning''': This last step will forcefully reboot your nodes. Do not do this on a production cluster! If it must be done on a production server, stop all non-essential services and be sure you have a good backup, ''just in case''.

Lastly, test that your fencing is configured and working properly. We do this by using a program called fence_node. With both nodes up and running cman, call a fence against the other node. Be sure to be watching the log files.

'''From''' an-node01.alteeve.com

<source lang="bash">
fence_node an-node02.alteeve.com
</source>
<source lang="bash">
fence an-node02.alteeve.com success
</source>

If it worked, when the node comes back up, repeat the process from an-node02.alteeve.com against an-node01.alteeve.com.

=== Setup Corosync ===

'''Note''': In our cluster, we will not need to configure Corosync.

When using cman, as we are here, Corosync's configuration file is ignored.

Most of the settings available in [[Two-Node Fedora 13 corosync.conf|corosync.conf]] can be found in the [[Two-Node Fedora 13 cluster.conf|cluster.conf]] file. Please see the cluster.conf article for a comprehensive list of what is and is not supported.

If you want to configure corosync and not use cman, please see the following article:

* [[Two-Node Fedora 13 corosync.conf]]

=== Setup Pacemaker ===

ToDO.

= Fencing =

Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.

* The Cluster Admin's Mantra:
** '''The only thing you don't know is what you don't know'''.

Just because one node loses communication with another node, it '''cannot''' be assume that the silent node is dead!

== What is it? ==

"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a '''split-brain''' condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.

Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:

* Power
** Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
* Blocking
** Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.

With power fencing, the term used is "STONITH", literally, '''S'''hoot '''T'''he '''O'''ther '''N'''ode '''I'''n '''T'''he '''H'''ead. Picture it like an old west dual. If one node is dead, the other node is going to win the dual by default and the dead node will just be shot again. When both nodes are alive, however, the faster node will win and will "kill" (power off or reset) the slower node before it has a chance to fire. Once this dual is over, the surviving node can then access the shared resource confident that it is the only one working on it.

== Misconception ==

It is a '''very''' common mistake to ignore fencing when first starting to learn about clustering. Often people think ''"It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster."''.

'''''Wrong!'''''

For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, [[cman]] and related daemons will fail if they can't find a fence agent to use.

Secondly; Testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.

== Implementation ==

How you implement fencing depends largely on what kind of fence device(s) you have. This is where I need help from you, dear reader. I only have access to [[Node Assassin]] fence devices on my lab's cluster. If you have other fence devices, like [[IPMI]], [[DRAC]] or so on, please let me know so that I can expand this section.

=== Implementing Node Assassin ===

In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced daemon. We'll cover the details of the [[cluster.conf]] file in a moment.

When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:
<source lang="xml">
<cluster name="an-cluster" config_version="1">
<clusternodes>
<clusternode name="an-node02.alteeve.com" nodeid="2">
<fence>
<method name="node_assassin">
<device name="motoko" port="02" action="off"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="motoko" agent="fence_na" quiet="true"
ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
</fencedevice>
</fencedevices>
</cluster>
</source>

If the cluster manager determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name, which is motoko in this case. It then looks in the <fencedevices> section for the device with the matching name. From there, it gets the information needed to find and access the fence device. Once it connects to the fence device, it then passes the options set in an-node02.alteeve.com's <fence> argument.

So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments:
* ipaddr=motoko.alteeve.com
* login=motoko
* passwd=secret
* quiet=true
* port=2
* action=off

How the fence agent acts on these arguments varies depending on the fence device itself. In general terms, the 'fence_na' fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on, which could be a power jack, a power or reset button, a network switch port and so on. Finally, it tells the device what action to take.

Once the device completes, it returns a success or failed message. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.

If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.

== Fence Devices ==

Many major [[OEM]]s have their own remote management devices that can serve as fence devices. Examples are [http://dell.ca Dell]'s 'DRAC' (Dell Remote Access Controller), [http://hp.ca HP]'s iLO (Integrate Lights Out), [http://ibm.ca IBM]'s 'RSA' (Remote Supervisor Adapter), [http://sun.ca Sun]'s 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via [[IPMI]], Intelligent Power Management Interface.

In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.

Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.

=== IPMI ===

[[IPMI]] is perhaps the most common method used for fencing nodes in a cluster. It requires having a system board with an IPMI baseboard management controller (BMC). IPMI is often the underlying technology used by many OEM-branded remote access tools. If you don't see a specific fence agent for your server's remote access application, experiment with generic IPMI tools.

To start, we need to install the IPMI user software:

<source lang="bash">
</source>

=== Node Assassin ===

A cheap alternative is the [[Node Assassin]], an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.

'''Full Disclosure''': Node Assassin was created by me, with much help from others, for this paper.

= Clean Up =

Some daemons may have been installed or dragged in during the setup of the cluster. Most notable is heartbeat which is not compatible with [[RHCS]]. The following command will disable various daemons from starting at boot.

As for the rest of the daemons below, it is safe to run even when the daemons aren't installed or have already been removed. Of course, skip the ones you actually want to use. This assumes that the nodes have been built according to this HowTo.

<source lang="bash">
chkconfig heartbeat off; chkconfig iscsid off; chkconfig iptables off; chkconfig ip6tables off
</source>

= So You Have a Cluster - Now What? =

Building the cluster infrastructure was pretty easy, wasn't it?

But what will you '''''do''''' with it?

{|style="width: 75%; text-align: center; padding: 10px; border-spacing: 15px;"
|-
!style="width: 100%; border: 1px solid #dfdfdf;" colspan="3"|Choose Your Own Adventure Cluster
|-
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|style="width: 34%; border: 1px solid #dfdfdf;"| [[Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM|Xen-Based Virtual Machine Host on DRBD+CLVM]] High Availability VM Host
|style="width: 33%; border: 1px solid #dfdfdf;"|  
|}

= Thanks =

* A '''huge''' thanks to [http://iplink.net Interlink Connectivity]! They hire me as a contractor and have allowed me to extend these docs while working on their clusters. Development of these How-Tos would be much slower if not for them. If you need hosting or colo services, drop them a line. They're website is a bit out of date though, so disregard the prices. :)
* To '''sdake''' of [http://corosync.org corosync] for helping me sort out the '''plock''' component and corosync in general.
* To '''Angus Salkeld''' for helping me nail down the Corosync and OpenAIS differences.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013922.html HJ Lee] from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
* To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013925.html Steven Dake] for clarifying the to_x vs. logoutput: x arguments in openais.conf.
* To [http://dk.linkedin.com/in/fabbione Fabio Massimo Di Nitto] for helping me get caught up with clustering and VMs.

{{footer}}