Abandoned - 2-Node EL5 Cluster: Difference between revisions
| m moved 2-Node EL5 Cluster to Abandoned - 2-Node EL5 Cluster | |||
| (133 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
| {{howto_header}} | {{howto_header}} | ||
| This  | {{warning|1=This is an older article and should only be used for historical reference. The only up to date clustering tutorial is: [[Red Hat Cluster Service 2 Tutorial]].}} | ||
| ---- | |||
| I've restarted this tutorial from scratch in the [[Red Hat Cluster Service 2 Tutorial]]. This tutorial is for reference only and will be deleted once I finish the new tutorial. Please do not follow this tutorial, but it should be ok to use as a reference. | |||
| - '''Mar. 14, 2011'''. | |||
| ---- | |||
| Related articles: | |||
| * [[Node Assassin]] | |||
| ** The prototype from this project is complete and is used as the fence device in this article. | |||
| = Progress = | |||
| '''May 25, 2010''': Happy Towel Day! Despite the lack of updates right here, a lot has been happening. I gave my first talked based on this paper to TLUG and am now working to expand it into a full [[Cluster Workshop 2010|cluster workshop]]. I'm still sorting out where and how I'll split this talk off, but until then, this should be useful now up to the working cluster stage. | |||
| '''Mar. 26, 2010''': I've been side-tracked this week getting the next version of [[Node Assassin]]'s hardware done. This is the version that will implement independent power sensing to tell when a node is truly on or off. Once done, I'll be back to finish this paper. | |||
| '''Mar. 17, 2010''': Sorted out and mostly completed the LVM section. Just need to sort out some sample cluster-aware formatting examples before moving on to the Xen virtual machine provisioning.  | |||
| '''Mar. 15, 2010''': Finished working through the paper as it is so far. Tomorrow or Tuesday I will begin expanding on it. | |||
| '''Mar. 14, 2010''': Happy π-day all! I've moved the body of this paper to [[2-Node CentOS5 Cluster working|here]] and will start moving pieces back. I'm switching over to more of "choose your own adventure" style, given the various possible configurations available in clustering. I won't pretend to cover them all, but I will at least be able to insert pointers to different layouts at points where you may want to branch out from this document's path. Also, I have changed my approach to providing iSCSI/SAN and thus will remove discussion of that component until the [[3+ Node CentOS5 Cluster + SoftSAN]] paper where this paper will act as a prerequisite. | |||
| = Overview = | = Overview = | ||
| This  | This paper has two goals; | ||
| # How to assemble the simplest cluster possible, a '''2 Node Cluster''', which you can then expand on for your own needs. | |||
| # How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time. | |||
| == Prerequisites == | |||
| =  | It is expected that you are already comfortable with the Linux command line, specifically <span class="code">[[bash]]</span>, and that you are familiar with general administrative tasks in Red Hat based distributions, specifically [[CentOS]]. You will also need to be comfortable using editos like [[vim]], [[nano]] or similar. This paper uses <span class="code">vim</span> in examples. Simply substitute your favourite editor in it's place. | ||
| You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, [[multicast]], broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding. | |||
| This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided. | |||
| == Platform == | |||
| This paper will implement the [[Red Hat]] Cluster Suite using the [[CentOS]] binary-compatible distribution. This paper uses the [[x86_64]] repositories, however, if you are on an [[i386]] (32 bit) system, you should be able to following along fine. Simply replace <span class="code">x86_64</span> with <span class="code">.i386</span> in package names.   | |||
| You can either download the stock CentOS 5-series DVD ISO (currently at version <span class="code">5.4</span> which is used in this document), or you can try out the alpha [[#AN!Cluster Install|AN!Cluster Install]] DVD. (4.3GB iso). If you use the later, please test it out on a development or test cluster. If you have any problems with the <span class="code">AN!Cluster</span> variant CentOS distro, please [[Digimer|contact me]] and let me know what your trouble was. | |||
| == Focus == | |||
| Clusters can serve to solve three problems; '''Reliability''', '''Performance''' and '''Scalability'''. | |||
| This focus of the cluster described in this paper is primarily '''reliability'''. Second to this, '''scalability''' will be the priority leaving '''performance''' to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper. | |||
| == Goal == | |||
| At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" virtual machine. That is, a virtual machine that exists on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources. | |||
| This paper should also server to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper. | |||
| =  | = Begin =   | ||
| Let's begin! | |||
| ==  | == Hardware == | ||
| We will need two physical servers each with the following hardware: | |||
| * One or more multi-core [[CPU]]s with Virtualization support. | |||
| * Three network cards; At least one should be gigabit or faster. | |||
| * One or more hard drives. | |||
| This paper uses the following hardware: | |||
| * ASUS M4A78L-M | * ASUS [http://support.asus.com/search/search.aspx?keyword=m4a78l-m&SLanguage=en-us M4A78L-M] | ||
| * AMD Athlon II x2 250   | * AMD Athlon II x2 250   | ||
| * 2GB Kingston DDR2 KVR800D2N6K2/4G (split between the two nodes) | * 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes) | ||
| *  | * 1x Intel 82540 PCI NICs | ||
| * 1x D-Link DGE-560T | |||
| This is not an endorsement of the above hardware. I bought what was within my budget that would server the purposes of creating this document. What you purchase shouldn't matter, so long at the minimum requirements are met. | |||
| ==  | == OS Install == | ||
| This  | Start with a stock CentOS 5.x install. This How-To uses CentOS 5.4 x86_64, however it should be fairly easy to adapt to other CentOS 5*, [[RHEL]]5 or other RHEL5-based distributions. | ||
| These are sample kickstart script used by this paper. Be sure to set your how password string and network settings. | |||
| '''''Warning'''''! These kickstart scripts '''''will erase your hard drive'''''! Adapt them, don't blindly use them. | |||
| Generic cluster node kickstart scripts. | |||
| * [[an-node01.ks]] | |||
| * [[an-node02.ks]] | |||
| * [[c5_generic_node.ks]] - New kickstart for automatically detecting and configuring storage. | |||
| == AN!Cluster Install == | |||
| If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an <span class="code">an-cluster</span> directory with all the configuration files. | |||
| * Download the custom '''AN!Cluster v0.1.006''' Install DVD. (4.5[[GiB]] iso). (Currently disabled - Reworking for F13) | |||
| == Post OS Install == | == Post OS Install == | ||
| Line 154: | Line 100: | ||
| Once the OS is installed, we need to do some ground work.   | Once the OS is installed, we need to do some ground work.   | ||
| # Setup networking. | |||
| # Limit [[dom0]]'s memory. | # Limit [[dom0]]'s memory. | ||
| # Change the default run-level. | # Change the default run-level. | ||
| # Change when '''xend''' starts. | |||
| #  | |||
| === Post-Install Network Configuration === | |||
| This cluster uses Xen, which fairly dramatically impacts networking. Terms you need to be familiar with are: | |||
| * dom0 | |||
| ** This is the "first" virtual machine with special access to the underlying hardware. This looks like the host operating system but is in fact just another virtual server running under Xen. This is also the virtual machine that can directly see the Xen networking infrastructure. | |||
| * domu | |||
| ** These are the virtual servers setup in and managed by the dom0 virtual machine. These are what most people think of when talking about "virtual servers" under Xen. | |||
| ==== Ethernet Devices and Subnets ==== | |||
| The most important thing to do after the install is to identify which <span class="code">ethX</span> device matches which network card. This is important in two cases; | |||
| * The fastest network card should be allocated to the [[DRBD]] partition. | |||
| * If you have [[IPMI]] piggy-backed on a physical network card, it should be allocated to the back-channel subnet. | |||
| This paper has the following configuration: | |||
| * <span class="code">eth0</span>; Internet-polluted subnet. | |||
| * <span class="code">eth1</span>; [[DRBD]] subnet. | |||
| * <span class="code">eth2</span>; Back-channel subnet. | |||
| To change which <span class="code">ethX</span> device maps to which ethernet card, please see: | |||
| * [[Changing the ethX to Ethernet Device Mapping in Red Hat/CentOS]] | |||
| If you are unfamiliar with how networking works in Xen, please read this article: | |||
| * [[Networking in Xen]] | |||
| ==== Choosing your Subnets ==== | |||
| There will be three subnets in our two node cluster; | |||
| * Internet-polluted subnet; <span class="code">192.168.1.0/24</span> | |||
| ** This subnet will ultimately be directly accessible only by the firewall virtual server. All other virtual machines and the node's dom0s will access the internet via the firewall for security reasons. During setup though, the 'dom0' servers will directly access this subnet. | |||
| * [[DRBD]] subnet; <span class="code">10.0.0.0/24</span> | |||
| ** Only the two 'dom0' servers will access to this subnet. It is used for DRBD communication and as a backup for the totem ring protocol. | |||
| * Back-channel; <span class="code">10.0.1.0/24</span> | |||
| ** This is the private subnet used for communication between the '''dom0''' and '''domU''' virtual servers. This subnet will have no direct access to the internet. This paper will use the <span class="code">10.0.1.0/24</span> subnet for this  | |||
| I like to assign the same last octal to a given node's subnets. This helps me keep track of which node I am working with at any given time. Here is how I setup my two nodes: | |||
| * an-node01 | |||
| ** eth0: 192.168.1.71 | |||
| ** eth1: 10.0.0.71 | |||
| ** eth2: 10.0.1.71 | |||
| * an-node02 | |||
| ** eth0: 192.168.1.72 | |||
| ** eth1: 10.0.0.72 | |||
| ** eth2: 10.0.1.72 | |||
| ==== /etc/hosts ==== | |||
| Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we add the following to the <span class="code">/etc/hosts</span> file: | |||
| <source lang="bash"> | |||
| vim /etc/hosts | |||
| </source> | |||
| <source lang="text"> | |||
| # By back-channel IPs to name mapping. | |||
| 10.0.1.71	an-node01 an-node01.alteeve.com | |||
| 10.0.1.72	an-node02 an-node02.alteeve.com | |||
| </source> | |||
| '''Note''': Delete any pre-existing entries matching the name returned by <span class="code">uname -n</span>. There is a good chance there will be an entry that resolves to <span class="code">127.0.0.1</span> which would cause problems later. | |||
| Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by <span class="code">uname -n</span> is resolvable to the back-channel subnet. I like to add a short-form name for convenience. | |||
| ==== iptables ==== | |||
| Be sure to flush netfilter tables and disable <span class="code">iptables</span> and <span class="code">ip6tables</span> from starting on your nodes. This is because the 'dom0' servers will not be connected directly to the Internet and we want to minimize the chance of an errant <span class="code">iptables</span> rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced. | |||
| <source lang="bash"> | |||
| chkconfig --level 2345 iptables off | |||
| /etc/init.d/iptables stop | |||
| chkconfig --level 2345 ip6tables off | |||
| /etc/init.d/ip6tables stop | |||
| </source> | |||
| === Limit dom0's Memory === | === Limit dom0's Memory === | ||
| Normally, dom0 will claim and use memory not allocated to virtual machines. This can cause trouble  | Normally, 'dom0' will claim and use memory not allocated to virtual machines. This can cause trouble if, for example, you've moved a [[VM]] off of a node and then want to move it or another VM back. For a period of time, the node will claim that there is not enough free memory for the migration. By setting a hard limit of dom0's memory usage, this scenario won't happen and you will not need to delay migrations. | ||
| To do this, add <span class="code">dom0_mem=512M</span> to the Xen kernel image's first <span class="code">module</span> line in [[grub]]. For example, you should have a line like: | To do this, add <span class="code">dom0_mem=512M</span> to the Xen kernel image's first <span class="code">module</span> line in [[grub]]. For example, you should have a line like in your grub configuration file: | ||
| <source lang="bash"> | |||
| vim /boot/grub/menu.lst | |||
| </source> | |||
| <source lang="text"> | <source lang="text"> | ||
| title CentOS (2.6.18-164. | title CentOS (2.6.18-164.15.1.el5) | ||
| 	root (hd0,0) | |||
| 	kernel /vmlinuz-2.6.18-164.15.1.el5 ro root=/dev/an-lvm01/lv01 rhgb quiet dom0_mem=512M | |||
| 	initrd /initrd-2.6.18-164.15.1.el5.img | |||
| </source> | </source> | ||
| You can change the '<span class="code">512M</span>' with the amount of RAM you want to allocate to dom0. | You can change the '<span class="code">512M</span>' with the amount of RAM you want to allocate to dom0. Note that if you used the AN!Cluster install DVD or the AN!Cluster kickstart files, this should already be set for you. | ||
| '''''REMEMBER'''''! | '''''REMEMBER'''''! | ||
| Line 181: | Line 202: | ||
| === Change the Default Run-Level === | === Change the Default Run-Level === | ||
| If you don't plan to work on your nodes directly, it makes sense to switch the default run | If you don't plan to work on your nodes directly, it makes sense to switch the default run level from <span class="code">5</span> to <span class="code">3</span>. This prevents Gnome from starting at boot, thus freeing up a lot of memory and system resources and reducing the possible attack vectors. | ||
| To do this, edit  | To do this, edit <span class="code">/etc/inittab</span>, change the <span class="code">id:5:initdefault:</span> line to <span class="code">id:3:initdefault:</span> and then switch to run level 3: | ||
| <source lang="bash"> | |||
| vim /etc/inittab | |||
| </source> | |||
| <source lang="text"> | |||
| id:3:initdefault: | |||
| </source> | |||
| <source lang="bash"> | <source lang="bash"> | ||
| init 3 | init 3 | ||
| Line 191: | Line 218: | ||
| ==== Change when xend starts ==== | ==== Change when xend starts ==== | ||
| Normally, <span class="code">xend</span> starts at priority <span class="code">98</span> in <span class="code">/etc/rc.X/</span>. This can cause problems with other packages that expect the network to be stable. This is because <span class="code">xend</span> takes all the networks down when it starts. To prevent these problems, we will move the <span class="code">xend</span> init script to  | Normally, <span class="code">xend</span> starts at priority <span class="code">98</span> in <span class="code">/etc/rc.X/</span>. This can cause problems with other packages that expect the network to be stable. This is because <span class="code">xend</span> takes all the networks down when it starts. To prevent these problems, we will move the <span class="code">xend</span> init script to start priority <span class="code">11</span>. We'll also adapt the stop priority to <span class="code">89</span>, though this is less critical | ||
| First, edit the actual initialization script and change the line '<span class="code"># chkconfig: 2345 98 01</span>' to '<span class="code">chkconfig: 2345 11 89</span>'. | First, edit the actual initialization script and change the line '<span class="code"># chkconfig: 2345 98 01</span>' to '<span class="code"># chkconfig: 2345 11 89</span>'. | ||
| <source lang="bash"> | <source lang="bash"> | ||
| vim /etc/init.d/xend | vim /etc/init.d/xend | ||
| </source> | </source> | ||
| <source lang="bash"> | <source lang="bash"> | ||
| # chkconfig: 2345 11 89 | # chkconfig: 2345 11 89 | ||
| </source> | </source> | ||
| Now, use <span class="code">chkconfig</span> to change the  | Now, use <span class="code">chkconfig</span> to change the apply the changes: | ||
| <source lang="bash"> | <source lang="bash"> | ||
| Line 216: | Line 236: | ||
| </source> | </source> | ||
| You should now see the  | You should now see the symlink <span class="code">/etc/rc3.d/S11xend</span> and <span class="code">/etc/rc3.d/K89xend</span>. | ||
| = Initial Cluster Setup = | |||
| Before we get into specifics, let's take a minute to talk about the major components used in our cluster. | |||
| == Core Program == | |||
| These are the core programs that may be new to you that we will use to build our cluster. | |||
| ===  | === OpenAIS/Corosync === | ||
| === Pacemaker === | |||
| === DRBD === | |||
| === LVM === | |||
| ===  | === Xen === | ||
| == dom0 Setup == | |||
| Some things, like cluster-aware [[LVM]], won't work until the cluster is setup. For this reason, we need to setup the cluster infrastructure before going any further. | |||
| If you didn't read up on [[Networking in Xen]] works, now would be a very good time to do so. A lot of the networking from here on in will seem cryptic otherwise when it's actually fairly straight forward. | |||
| === Adding New NICs to Xen === | |||
| ==  | By default, <span class="code">xend</span> only manages <span class="code">eth0</span>. We need to add <span class="code">eth2</span> and, if you wish, <span class="code">eth1</span>. Personally, I like to put all my ethernet devices under Xen's control for future flexibility, but this opens a possible security vector as a bridge is created for the DRBD subnet. Whether you add it or not I will leave to your preferences. | ||
| You can see which devices are under Xen's control by running <span class="code">ifconfig</span> and checking to see if there is a <span class="code">pethX</span> corresponding to each <span class="code">ethX</span> device. For example, here is what you would see if only <span class="code">eth0</span> was under Xen's control: | |||
| <source lang="bash"> | <source lang="bash"> | ||
| Line 279: | Line 281: | ||
|            inet6 addr: fe80::92e6:baff:fe71:82d8/64 Scope:Link |            inet6 addr: fe80::92e6:baff:fe71:82d8/64 Scope:Link | ||
|            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 |            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:121 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:97 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes: |            RX bytes:20605 (20.1 KiB)  TX bytes:16270 (15.8 KiB) | ||
| eth1      Link encap:Ethernet  HWaddr 00: | eth1      Link encap:Ethernet  HWaddr 00:21:91:19:96:5A    | ||
|            inet addr:10.0.0.71  Bcast:10.0.0.255  Mask:255.255.255.0 |            inet addr:10.0.0.71  Bcast:10.0.0.255  Mask:255.255.255.0 | ||
|            inet6 addr: fe80:: |            inet6 addr: fe80::221:91ff:fe19:965a/64 Scope:Link | ||
|            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 |            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:45 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:53 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:1000   |            collisions:0 txqueuelen:1000   | ||
|            RX bytes: |            RX bytes:9139 (8.9 KiB)  TX bytes:10259 (10.0 KiB) | ||
|            Interrupt:16  | |||
| eth2      Link encap:Ethernet  HWaddr 00: | eth2      Link encap:Ethernet  HWaddr 00:0E:0C:59:45:78    | ||
|            inet addr:10.0.1.71  Bcast:10.0.1.255  Mask:255.255.255.0 |            inet addr:10.0.1.71  Bcast:10.0.1.255  Mask:255.255.255.0 | ||
|            inet6 addr: fe80:: |            inet6 addr: fe80::20e:cff:fe59:4578/64 Scope:Link | ||
|            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 |            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:45 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:62 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen: |            collisions:0 txqueuelen:100  | ||
|            RX bytes: |            RX bytes:9790 (9.5 KiB)  TX bytes:11102 (10.8 KiB) | ||
|            Base address:0xec00 Memory:febe0000-fec00000  | |||
| lo        Link encap:Local Loopback    | lo        Link encap:Local Loopback    | ||
| Line 308: | Line 310: | ||
|            inet6 addr: ::1/128 Scope:Host |            inet6 addr: ::1/128 Scope:Host | ||
|            UP LOOPBACK RUNNING  MTU:16436  Metric:1 |            UP LOOPBACK RUNNING  MTU:16436  Metric:1 | ||
|            RX packets: |            RX packets:8 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes: |            RX bytes:560 (560.0 b)  TX bytes:560 (560.0 b) | ||
| peth0     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | peth0     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | ||
|            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link |            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link | ||
|            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 |            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:126 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:110 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:1000   |            collisions:0 txqueuelen:1000   | ||
|            RX bytes: |            RX bytes:20923 (20.4 KiB)  TX bytes:18352 (17.9 KiB) | ||
|            Interrupt:252 Base address:0x6000   |            Interrupt:252 Base address:0x6000   | ||
| Line 325: | Line 327: | ||
|            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link |            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link | ||
|            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 |            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:103 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:126 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes: |            RX bytes:19306 (18.8 KiB)  TX bytes:20935 (20.4 KiB) | ||
| virbr0    Link encap:Ethernet  HWaddr 00:00:00:00:00:00    | virbr0    Link encap:Ethernet  HWaddr 00:00:00:00:00:00    | ||
| Line 335: | Line 337: | ||
|            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 |            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 | ||
|            RX packets:0 errors:0 dropped:0 overruns:0 frame:0 |            RX packets:0 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:49 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes:0 (0.0 b)  TX bytes: |            RX bytes:0 (0.0 b)  TX bytes:9640 (9.4 KiB) | ||
| xenbr0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | xenbr0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | ||
|            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 |            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:148 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 |            TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes: |            RX bytes:24256 (23.6 KiB)  TX bytes:0 (0.0 b) | ||
| </source> | </source> | ||
| You'll notice that there is no <span class="code">peth1</span> or <span class="code">peth2</span> device, nor their associated virtual devices or bridges | You'll notice that there is no <span class="code">peth1</span> or <span class="code">peth2</span> device, nor their associated virtual devices or bridges. | ||
| === Create /etc/xen/scripts/an-network-script === | === Create /etc/xen/scripts/an-network-script === | ||
| Line 353: | Line 355: | ||
| This script will be used by Xen to create bridges for all NICs. | This script will be used by Xen to create bridges for all NICs. | ||
| Please note  | Please note three things; | ||
| # You don't need to use the name '<span class="code">an-network-script</span>'. I suggest this name mainly to keep in line with the rest of the 'AN!x' naming used here. | # You don't need to use the name '<span class="code">an-network-script</span>'. I suggest this name mainly to keep in line with the rest of the 'AN!x' naming used here. | ||
| # If you install <span class="code">convirt</span>, it will create it's own bridge script called <span class="code">convirt-xen-multibridge</span>.   | # If you install <span class="code">convirt</span>, it will create it's own bridge script called <span class="code">convirt-xen-multibridge</span>. Other tools may do something similar. | ||
| # Adding <span class="code">eth1</span> is optional, as we know ahead of time that <span class="code">eth1</span> will not be made available to any virtual machines as it is dedicated to [[DRBD]] and <span class="code">totem</span>. I'm adding it here because I like having things consistent; Do whichever makes more sense to you. | |||
| First, <span class="code">touch</span> the file and then <span class="code">chmod</span> it to be executable. | First, <span class="code">touch</span> the file and then <span class="code">chmod</span> it to be executable. | ||
| Line 376: | Line 379: | ||
| </source> | </source> | ||
| Now tell Xen to reference that script by editing <span class="code">/etc/xen/xend-config.sxp</span>: | Now tell Xen to reference that script by editing <span class="code">/etc/xen/xend-config.sxp</span> file and changing the <span class="code">network-script</span> argument to point to this new script (this is line 91 in the default <span class="code">xend-config.sxp</span> script): | ||
| <source lang="bash"> | <source lang="bash"> | ||
| vim /etc/xen/xend-config.sxp | vim /etc/xen/xend-config.sxp | ||
| </source> | </source> | ||
| <source lang="text"> | <source lang="text"> | ||
| #(network-script network-bridge) | #(network-script network-bridge) | ||
| Line 396: | Line 392: | ||
| <source lang="bash"> | <source lang="bash"> | ||
| /etc/init.d/xend  | /etc/init.d/xend restart | ||
| restart  | |||
| </source> | </source> | ||
| Line 412: | Line 405: | ||
|            inet6 addr: fe80::92e6:baff:fe71:82d8/64 Scope:Link |            inet6 addr: fe80::92e6:baff:fe71:82d8/64 Scope:Link | ||
|            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 |            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:274 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:190 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes: |            RX bytes:33479 (32.6 KiB)  TX bytes:33376 (32.5 KiB) | ||
| eth1      Link encap:Ethernet  HWaddr 00: | eth1      Link encap:Ethernet  HWaddr 00:21:91:19:96:5A    | ||
|            inet addr:10.0.0.71  Bcast:10.0.0.255  Mask:255.255.255.0 |            inet addr:10.0.0.71  Bcast:10.0.0.255  Mask:255.255.255.0 | ||
|            inet6 addr: fe80:: |            inet6 addr: fe80::221:91ff:fe19:965a/64 Scope:Link | ||
|            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 |            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:0 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:33 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes: |            RX bytes:0 (0.0 b)  TX bytes:10393 (10.1 KiB) | ||
| eth2      Link encap:Ethernet  HWaddr 00: | eth2      Link encap:Ethernet  HWaddr 00:0E:0C:59:45:78    | ||
|            inet addr:10.0.1.71  Bcast:10.0.1.255  Mask:255.255.255.0 |            inet addr:10.0.1.71  Bcast:10.0.1.255  Mask:255.255.255.0 | ||
|            inet6 addr: fe80:: |            inet6 addr: fe80::20e:cff:fe59:4578/64 Scope:Link | ||
|            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 |            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 | ||
|            RX packets:0 errors:0 dropped:0 overruns:0 frame:0 |            RX packets:0 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:28 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes:0 (0.0 b)  TX bytes: |            RX bytes:0 (0.0 b)  TX bytes:9964 (9.7 KiB) | ||
| lo        Link encap:Local Loopback    | lo        Link encap:Local Loopback    | ||
| Line 439: | Line 432: | ||
|            inet6 addr: ::1/128 Scope:Host |            inet6 addr: ::1/128 Scope:Host | ||
|            UP LOOPBACK RUNNING  MTU:16436  Metric:1 |            UP LOOPBACK RUNNING  MTU:16436  Metric:1 | ||
|            RX packets: |            RX packets:8 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes: |            RX bytes:560 (560.0 b)  TX bytes:560 (560.0 b) | ||
| peth0     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | peth0     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | ||
|            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link |            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link | ||
|            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 |            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:281 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:204 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:1000   |            collisions:0 txqueuelen:1000   | ||
|            RX bytes: |            RX bytes:33929 (33.1 KiB)  TX bytes:35540 (34.7 KiB) | ||
|            Interrupt:252 Base address:0x6000   |            Interrupt:252 Base address:0x6000   | ||
| Line 456: | Line 449: | ||
|            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link |            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link | ||
|            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 |            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:45 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:86 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:1000   |            collisions:0 txqueuelen:1000   | ||
|            RX bytes: |            RX bytes:9139 (8.9 KiB)  TX bytes:20652 (20.1 KiB) | ||
|            Interrupt:16  | |||
| peth2     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | peth2     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | ||
|            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link |            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link | ||
|            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 |            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:45 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:90 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen: |            collisions:0 txqueuelen:100  | ||
|            RX bytes: |            RX bytes:9790 (9.5 KiB)  TX bytes:21066 (20.5 KiB) | ||
|            Base address:0xec00 Memory:febe0000-fec00000  | |||
| vif0.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | vif0.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | ||
|            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link |            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link | ||
|            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 |            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:200 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:281 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes: |            RX bytes:37668 (36.7 KiB)  TX bytes:33941 (33.1 KiB) | ||
| vif0.1    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | vif0.1    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | ||
|            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link |            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link | ||
|            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 |            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:33 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes: |            RX bytes:10393 (10.1 KiB)  TX bytes:0 (0.0 b) | ||
| vif0.2    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | vif0.2    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | ||
|            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link |            inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link | ||
|            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 |            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:28 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 |            TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes: |            RX bytes:9964 (9.7 KiB)  TX bytes:0 (0.0 b) | ||
| virbr0    Link encap:Ethernet  HWaddr 00:00:00:00:00:00    | virbr0    Link encap:Ethernet  HWaddr 00:00:00:00:00:00    | ||
| Line 500: | Line 493: | ||
|            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 |            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1 | ||
|            RX packets:0 errors:0 dropped:0 overruns:0 frame:0 |            RX packets:0 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets: |            TX packets:49 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes:0 (0.0 b)  TX bytes: |            RX bytes:0 (0.0 b)  TX bytes:9640 (9.4 KiB) | ||
| xenbr0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | xenbr0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | ||
|            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 |            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:151 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 |            TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes: |            RX bytes:24426 (23.8 KiB)  TX bytes:0 (0.0 b) | ||
| xenbr1    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | xenbr1    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | ||
|            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 |            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:33 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 |            TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes: |            RX bytes:9931 (9.6 KiB)  TX bytes:0 (0.0 b) | ||
| xenbr2    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | xenbr2    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF    | ||
|            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 |            UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1 | ||
|            RX packets: |            RX packets:28 errors:0 dropped:0 overruns:0 frame:0 | ||
|            TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 |            TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 | ||
|            collisions:0 txqueuelen:0   |            collisions:0 txqueuelen:0   | ||
|            RX bytes: |            RX bytes:9572 (9.3 KiB)  TX bytes:0 (0.0 b) | ||
| </source> | </source> | ||
| == cluster.conf  | == Fencing == | ||
| Before proceeding with the <span class="code">cluster.conf</span> file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important. | |||
| The  | * The Cluster Admin's Mantra: | ||
| ** '''The only thing you don't know is what you don't know'''. | |||
| Just because one node loses communication with another node, it '''cannot''' be assume that the silent node is dead! | |||
| === What is it? === | |||
| "Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a '''split-brain''' condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy. | |||
| Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways: | |||
| * Power | |||
| ** Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type. | |||
| * Blocking | |||
| ** Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node. | |||
| With power fencing, the term used is "STONITH", literally, '''S'''hoot '''T'''he '''O'''ther '''N'''ode '''I'''n '''T'''he '''H'''ead. Picture it like an old west dual. If one node is dead, the other node is going to win the dual by default and the dead node will just be shot again. When both nodes are alive, however, the faster node will win and will "kill" (power off or reset) the slower node before it has a chance to fire. Once this dual is over, the surviving node can then access the shared resource confident that it is the only one working on it. | |||
| === Misconception === | |||
| It is a '''very''' common mistake to ignore fencing when first starting to learn about clustering. Often people think ''"It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster."''. | |||
| '''''Wrong!''''' | |||
| For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, [[cman]] and related daemons will fail if they can't find a fence agent to use. | |||
| Secondly; Testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to. | |||
| === Implementation === | |||
| In Red Hat's cluster software, the fence device(s) are configured in the main <span class="code">/etc/cluster.conf</span> cluster configuration file. This configuration is then acted on via the <span class="code">fenced</span> daemon. We'll cover the details of the [[cluster.conf]] file in a moment. | |||
| When the cluster determines that a node needs to be fenced, the <span class="code">fenced</span> daemon will consult the <span class="code">cluster.conf</span> file for information on how to access the fence device. Given this <span class="code">cluster.conf</span> snippet: | |||
| <source lang="xml"> | <source lang="xml"> | ||
| < | <cluster name="an-cluster" config_version="1"> | ||
| - | |||
| 	<clusternodes> | 	<clusternodes> | ||
| 		<clusternode name="an-node02.alteeve.com" nodeid="2"> | |||
| 		<clusternode name=" | |||
| 			<fence> | 			<fence> | ||
| 				<method name="node_assassin"> | |||
| 					<device name="motoko" port="02" action="off"/> | |||
| 				<method name=" | |||
| 					<device name=" | |||
| 				</method> | 				</method> | ||
| 			</fence> | 			</fence> | ||
| 		</clusternode> | 		</clusternode> | ||
| 	</clusternodes> | 	</clusternodes> | ||
| 	<fencedevices> | 	<fencedevices> | ||
| 		<fencedevice name="motoko" agent="fence_na" quiet="true" | |||
| 		ipaddr="motoko.alteeve.com" login="motoko" passwd="secret"> | |||
| 		<fencedevice name=" | |||
| 		ipaddr=" | |||
| 		</fencedevice> | 		</fencedevice> | ||
| 	</fencedevices> | 	</fencedevices> | ||
| </cluster> | </cluster> | ||
| </source> | </source> | ||
| Once  | If the cluster manager determines that the node <span class="code">an-node02.alteeve.com</span> needs to be fenced, it looks at the first (and only, in this case) <span class="code"><fence></span> entry's <span class="code">name</span>, which is <span class="code">motoko</span> in this case. It then looks in the <span class="code"><fencedevices></span> section for the device with the matching <span class="code">name</span>. From there, it gets the information needed to find and access the fence device. Once it connects to the fence device, it then passes the options set in <span class="code">an-node02.alteeve.com</span>'s <span class="code"><fence></span> argument. | ||
| So in this example, <span class="code">fenced</span> looks up the details on the <span class="code">motoko</span> Node Assassin fence device. It calls the <span class="code">fence_na</span> program, called a fence agent, and passes the following arguments: | |||
| * <span class="code">ipaddr=motoko.alteeve.com</span> | |||
| * <span class="code">login=motoko</span> | |||
| * <span class="code">passwd=secret</span> | |||
| * <span class="code">quiet=true</span> | |||
| * <span class="code">port=2</span> | |||
| * <span class="code">action=off</span> | |||
| How the fence agent acts on these arguments varies depending on the fence device itself. In general terms, the '<span class="code">fence_na</span>' fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the <span class="code">ipaddr</span> argument. Once connected, it will authenticate using the <span class="code">login</span> and <span class="code">passwd</span> arguments. Once authenticated, it tells the device what <span class="code">port</span> to act on, which could be a power jack, a power or reset button, a network switch port and so on. Finally, it tells the device what <span class="code">action</span> to take. | |||
| Once the device completes, it returns a success or failed message. If the first attempt fails, the fence agent will try the next <span class="code"><fence></span> method, if a second exists. It will keep trying fence devices in the order they are found in the <span class="code">cluster.conf</span> file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one. | |||
| If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node. | |||
| === Fence Devices === | |||
| Many major [[OEM]]s have their own remote management devices that can serve as fence devices. Examples are [http://dell.ca Dell]'s 'DRAC' (Dell Remote Access Controller), [http://hp.ca HP]'s iLO (Integrate Lights Out), [http://ibm.ca IBM]'s 'RSA' (Remote Supervisor Adapter), [http://sun.ca Sun]'s 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via [[IPMI]], Intelligent Power Management Interface.  | |||
| In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server. | |||
| Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone. | |||
| === Node Assassin === | |||
| A cheap alternative is the [[Node Assassin]], an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware. | |||
| '''Full Disclosure''': Node Assassin was created by me, with much help from others, for this paper. | |||
| == Core Files == | |||
| There are two main configuration files that need to be setup now. | |||
| === cluster.conf === | |||
| The core of the cluster is the <span class="code">/etc/cluster/cluster.conf</span> [[XML]] configuration file. It contains information about the cluster itself, what nodes are to be used, how to fence each node, what fence devices exist plus miscellaneous other configuration options. | |||
| By default, there is no <span class="code">cluster.conf</span>, so you need to start by creating it: | |||
| <source lang="bash"> | <source lang="bash"> | ||
| touch /etc/cluster/cluster.conf | |||
| </source> | </source> | ||
| Here is the one '''AN!Cluster''' uses, with in-line comments, mostly from the <span class="code">man cluster.conf</span> page. | |||
| * [[Two-Node CentOS 5 cluster.conf|cluster.conf]] | |||
| Once you're comfortable with your changes to the file, you need to validate it. Run: | |||
| <source lang="bash"> | <source lang="bash"> | ||
| xmllint --relaxng /usr/share/system-config-cluster/misc/cluster.ng /etc/cluster/cluster.conf | |||
| </source> | </source> | ||
| If there are errors, address them. Once you see <span class="code">/etc/cluster/cluster.conf validates</span>, you can proceed to the next step. | |||
| '''Note''': If you are using [[Node Assassin]] and the XML validation fails, be sure to get the updated [[Node_Assassin#XML_Validation_Support|cluster.ng]] validation file! | |||
| === openais.conf === | |||
| Where <span class="code">cluster.conf</span> is the core configuration file for the cluster, <span class="code">OpenAIS</span> is the master of ceremonies. It implements all the cluster functions referencing first it's own <span class="code">/etc/ais/openais.conf</span> file and then the <span class="code">/etc/cluster/cluster.conf</span> file. You can think of this file as a "low level" configuration file controlling the underlying mechanics of the cluster where <span class="code">cluster.conf</span> contains the specific cluster configuration. | |||
| Unlike <span class="code">cluster.conf</span>, there is a default <span class="code">openais.conf</span> config file. It's a good habit to back default files up in case you need to start over. | |||
| When reviewing the <span class="code">openais.conf</span> file below, please take the time to read the comments in the file. There are many aspects of clustering that will make sense if you understand the various OpenAIS configuration options. | |||
| * [[Two-Node openais.conf|openais.conf]] | |||
| Once you are comfortable, backup and edit <span class="code">openais.conf</span>: | |||
| <source lang="bash"> | <source lang="bash"> | ||
| vim /etc/ais/openais.conf | |||
| </source> | |||
| <source lang="perl"> | |||
| # This is a skeleton example configuration file. | |||
| #  | |||
| # Totem Protocol options. | # Totem Protocol options. | ||
| totem { | totem { | ||
|         version: 2 | |||
|         secauth: off | |||
|         threads: 0 | |||
|         rrp_mode: passive | |||
|         interface { | |||
|                 # This is the back-channel subnet, which is the primary network | |||
|                 # for the totem protocol. | |||
|                 ringnumber: 0 | |||
|                 bindnetaddr: 10.0.1.0 | |||
|                 mcastaddr: 226.94.1.1 | |||
|                 mcastport: 5405 | |||
|         } | |||
|         interface { | |||
|                 # This is the DRBD subnet, which acts as a secondary, backup | |||
|                 # network for the totem protocol. | |||
|                 ringnumber: 1 | |||
|                 bindnetaddr: 10.0.0.0 | |||
|                 mcastaddr: 227.94.1.1 | |||
|                 mcastport: 5406 | |||
|         } | |||
| } | } | ||
| #  | # Enable logging. | ||
| logging { | logging { | ||
|         to_syslog: yes | |||
| } | } | ||
| # AMF,  | # Disable AMF, it's not supported yet. | ||
| amf { | amf { | ||
|         mode: disabled | |||
| } | } | ||
| </source> | </source> | ||
| == Cluster First Start == | |||
| If everything up until now was done right, you should be able to start your cluster for the first time. It can be useful to have a separate terminal window open with a <span class="code">tail</span> watching <span class="code">/var/log/messages</span> so that you can see if there are any problems. | |||
| On both nodes, in dedicated terminals, run: | |||
| <source lang="bash"> | |||
| clear; tail -f -n 0 /var/log/messages | |||
| </source> | |||
| This next step must be run on both nodes as soon as possible. If you try to start one node and wait too long to start the other node, the first node will think there is a problem and it will fence the second node. Remember the <span class="code"><fence_daemon post_join_delay="60"></fence_daemon></span> line in <span class="code">cluster.conf</span>? This is where it comes into play. The value you set it the "window" you have to start both nodes before a fence is issued. The default is <span class="code">6</span> seconds, and the above line changed that to <span class="code">60</span> seconds. | |||
| On both nodes, in different terminals, check that <span class="code">cman</span> is indeed stopped, then start it up: | |||
| <source lang="bash"> | |||
| /etc/init.d/cman status | |||
| </source> | |||
| <source lang="text"> | |||
| ccsd is stopped | |||
| </source> | |||
| <source lang="bash"> | |||
| /etc/init.d/cman start | |||
| </source> | |||
| If all goes well, you should see something like this in each node's <span class="code">/var/log/messages</span> file: | |||
| <source lang="text"> | |||
| May 10 20:54:32 an-node01 kernel: DLM (built Mar 17 2010 12:05:05) installed | |||
| May 10 20:54:32 an-node01 kernel: GFS2 (built Mar 17 2010 12:05:47) installed | |||
| May 10 20:54:32 an-node01 kernel: Lock_DLM (built Mar 17 2010 12:05:54) installed | |||
| May 10 20:54:33 an-node01 ccsd[11167]: Starting ccsd 2.0.115:  | |||
| May 10 20:54:33 an-node01 ccsd[11167]:  Built: Dec  8 2009 09:20:54  | |||
| May 10 20:54:33 an-node01 ccsd[11167]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved.  | |||
| May 10 20:54:33 an-node01 ccsd[11167]: cluster.conf (cluster name = an-cluster, version = 1) found.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6'  | |||
| May 10 20:54:35 an-node01 openais[11175]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [MAIN ] Copyright (C) 2006 Red Hat, Inc.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [MAIN ] AIS Executive Service: started and ready to provide service.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [MAIN ] Using default multicast address of 239.192.147.72  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms)  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] token hold (386 ms) retransmits before loss (20 retrans)  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] join (60 ms) send_join (0 ms) consensus (4800 ms) merge (200 ms)  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs)  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1500  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] window size per rotation (50 messages) maximum messages per rotation (17 messages)  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] send threads (0 threads)  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] RRP token expired timeout (495 ms)  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] RRP token problem counter (2000 ms)  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] RRP threshold (10 problem count)  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] RRP mode set to none.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] heartbeat_failures_allowed (0)  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] max_network_delay (50 ms)  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes).  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] The network interface [10.0.1.71] is now up.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Created or loaded sequence id 0.10.0.1.71 for this ring.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering GATHER state from 15.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CMAN ] CMAN 2.0.115 (built Dec  8 2009 09:20:58) started  | |||
| May 10 20:54:35 an-node01 openais[11175]: [MAIN ] Service initialized 'openais CMAN membership service 2.01'  | |||
| May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais extended virtual synchrony service'  | |||
| May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais cluster membership service B.01.01'  | |||
| May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais availability management framework B.01.01'  | |||
| May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais checkpoint service B.01.01'  | |||
| May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais event service B.01.01'  | |||
| May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais distributed locking service B.01.01'  | |||
| May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais message service B.01.01'  | |||
| May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais configuration service'  | |||
| May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais cluster closed process group service v1.01'  | |||
| May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais cluster config database access v1.01'  | |||
| May 10 20:54:35 an-node01 openais[11175]: [SYNC ] Not using a virtual synchrony filter.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Creating commit token because I am the rep.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Saving state aru 0 high seq received 0  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Storing new sequence id for ring 4  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering COMMIT state.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering RECOVERY state.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] position [0] member 10.0.1.71:  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] previous ring seq 0 rep 10.0.1.71  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] aru 0 high delivered 0 received flag 1  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Did not need to originate any messages in recovery.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Sending initial ORF token  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] CLM CONFIGURATION CHANGE  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] New Configuration:  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Left:  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Joined:  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] CLM CONFIGURATION CHANGE  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] New Configuration:  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] 	r(0) ip(10.0.1.71)   | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Left:  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Joined:  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] 	r(0) ip(10.0.1.71)   | |||
| May 10 20:54:35 an-node01 openais[11175]: [SYNC ] This node is within the primary component and will provide service.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering OPERATIONAL state.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CMAN ] quorum regained, resuming activity  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] got nodejoin message 10.0.1.71  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering GATHER state from 11.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Creating commit token because I am the rep.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Saving state aru a high seq received a  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Storing new sequence id for ring 8  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering COMMIT state.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering RECOVERY state.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] position [0] member 10.0.1.71:  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] previous ring seq 4 rep 10.0.1.71  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] aru a high delivered a received flag 1  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] position [1] member 10.0.1.72:  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] previous ring seq 4 rep 10.0.1.72  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] aru c high delivered c received flag 1  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Did not need to originate any messages in recovery.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Sending initial ORF token  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] CLM CONFIGURATION CHANGE  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] New Configuration:  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] 	r(0) ip(10.0.1.71)   | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Left:  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Joined:  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] CLM CONFIGURATION CHANGE  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] New Configuration:  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] 	r(0) ip(10.0.1.71)   | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] 	r(0) ip(10.0.1.72)   | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Left:  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Joined:  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] 	r(0) ip(10.0.1.72)   | |||
| May 10 20:54:35 an-node01 openais[11175]: [SYNC ] This node is within the primary component and will provide service.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering OPERATIONAL state.  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] got nodejoin message 10.0.1.71  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CLM  ] got nodejoin message 10.0.1.72  | |||
| May 10 20:54:35 an-node01 openais[11175]: [CPG  ] got joinlist message from node 2  | |||
| May 10 20:54:36 an-node01 ccsd[11167]: Initial status:: Quorate  | |||
| </source> | |||
| == Set the Cluster to Start at Boot == | |||
| Simply use <span class="code">chkconfig</span> to tell <span class="code">cman</span> to start on boot: | |||
| <source lang="bash"> | |||
| chkconfig cman on | |||
| </source> | |||
| You should now see <span class="code">cman</span> at start level <span class="code">21</span>: | |||
| <source lang="bash"> | |||
| ls -lah /etc/rc3.d/ |grep cman | |||
| </source> | |||
| <source lang="text"> | |||
| lrwxrwxrwx  1 root root   14 Mar 16 09:45 S21cman -> ../init.d/cman | |||
| </source> | |||
| Done! You now how you first fully functioning cluster! | |||
| = DRBD = | = DRBD = | ||
| [[DRBD]] will be used to provide  | [[DRBD]] will be used to provide a real-time, redundant block device. On top of this, a new [[LVM]] [[PV]] will be created for a virtual machine that will be able to "float" between the two nodes. This way, should one of the nodes fail, the virtual machine would be able to quickly be brought back up on the surviving node with minimal interruption. When you have planned down time, you will be able to "hot migrate" the virtual machine from one node to the other with nothing more that a short pause while the virtual machine's RAM is frozen and copied over to the other node, a process that usually takes a few seconds to a minute. | ||
| == Install == | == Install == | ||
| The <span class="code">drbd83</span> and <span class="code">kmod-drbd83-xen</span> packages are not included in the default CentOS installation media, so we will need to install them now: | |||
| <source lang="bash"> | <source lang="bash"> | ||
| yum -y install drbd83.x86_64 kmod-drbd83-xen.x86_64 | yum -y install drbd83.x86_64 kmod-drbd83-xen.x86_64 | ||
| </source> | </source> | ||
| Before we configure DRBD, we will need to create an LVM LV to host it. | |||
| == Create the LVM Logical Volume == | == Create the LVM Logical Volume == | ||
| Most of the remaining space on either node's [[LVM]] [[PV]] will be allocated to a new [[LV]]. This new LV will host either node's side of the DRBD resource | Most of the remaining space on either node's [[LVM]] [[PV]] will be allocated to a new [[LV]]. This new LV will host either node's side of the DRBD resource. | ||
| First, you need to see how much space you have left on you LVM PV: | |||
| <source lang="bash"> | |||
| pvscan | |||
| </source> | |||
| <source lang="text"> | |||
|   PV /dev/sda2   VG an-lvm01   lvm2 [465.50 GB / 443.97 GB free] | |||
|   Total: 1 [465.50 GB] / in use: 1 [465.50 GB] / in no VG: 0 [0   ] | |||
| </source> | |||
| On my nodes, each of which has a single 500GB drive, I've allocated only 20GB to dom0 so I've got over 440GB left free. I like to leave a bit of space unallocated because I never know where I might need it, so I will allocate 400GB even to DRBD and keep the remaining 44GB set aside for future growth. The space you have left and how you want to allocate is an exercise you must settle based on your own needs. | |||
| ' | Next, check that the name you will give to the new LV isn't used yet: | ||
| <source lang="bash"> | <source lang="bash"> | ||
| lvscan | |||
| </source> | |||
| <source lang="text"> | |||
|   ACTIVE            '/dev/an-lvm01/lv01' [19.53 GB] inherit | |||
|   ACTIVE            '/dev/an-lvm01/lv00' [2.00 GB] inherit | |||
| </source> | </source> | ||
| I can see from the above output that <span class="code">lv00</span> and <span class="code">lv01</span> are used, so I will use <span class="code">lv02</span> for my DRBD partition. Of course, you can use <span class="code">drbd</span> or pretty much anything else you want. | |||
| Now that I know I want to create a 400GB logical volume called <span class="code">lv02</span>, I can proceed. | |||
| Create the Logical Volume for the DRBD device on each node. The next two commands show what I need to call on my nodes, and will match what you need to run if you used the [[#AN!Cluster Install|AN!Cluster Install]] DVD. If you ran your own install, be sure to edit the following arguments to match your nodes: | |||
| On <span class="code">an-node01</span>: | |||
| <source lang="bash"> | |||
| lvcreate -L 400G -n lv02 /dev/an-lvm01 | |||
| </source> | |||
| On <span class="code">an-node02</span>: | |||
| <source lang="bash"> | |||
| lvcreate -L 400G -n lv02 /dev/an-lvm02 | |||
| </source> | |||
| <source lang="text"> | |||
|   Logical volume "lv02" created | |||
| </source> | |||
| If I re-run <span class="code">lvscan</span> now, I will see the new volume: | |||
| <source lang="bash"> | <source lang="bash"> | ||
| lvscan | |||
| </source> | |||
| <source lang="text"> | |||
|   ACTIVE            '/dev/an-lvm01/lv01' [19.53 GB] inherit | |||
|   ACTIVE            '/dev/an-lvm01/lv00' [2.00 GB] inherit | |||
|   ACTIVE            '/dev/an-lvm01/lv02' [400.00 GB] inherit | |||
| </source> | </source> | ||
| We can now proceed with the DRBD setup! | |||
| == Create or Edit /etc/drbd.conf == | == Create or Edit /etc/drbd.conf == | ||
| DRBD is controlled from a single <span class="code">/etc/drbd.conf</span> configuration file that must be identical on both nodes. This file tells DRBD what devices to use on each node, what interface to use and so on. | |||
| * [[2-Node drbd.conf|drbd.conf]] | |||
| Full details on all the <span class="code">drbd.conf</span> configuration file directives and arguments can be found [http://www.drbd.org/users-guide/re-drbdconf.html here]. | |||
| <source lang="bash"> | <source lang="bash"> | ||
| global { | global { | ||
| 	usage-count yes; | |||
| } | } | ||
| common { | common { | ||
| 	protocol C; | |||
| 	syncer { | |||
| 		rate 15M; | |||
| 	} | |||
| } | } | ||
| resource r0 { | resource r0 { | ||
| 	device    /dev/drbd0; | |||
| 	net { | |||
| 		allow-two-primaries; | |||
| 		after-sb-0pri discard-zero-changes; | |||
| 		after-sb-1pri discard-secondary; | |||
| 		after-sb-2pri disconnect; | |||
| 	} | |||
| 	startup {   | |||
| 		become-primary-on both; | |||
| 		disk  | 	} | ||
| 	meta-disk	internal; | |||
| 	on an-node01.alteeve.com { | |||
| 		address		192.168.2.71:7789; | |||
| 		disk		/dev/sda4; | |||
| 	} | |||
| 	on an-node02.alteeve.com { | |||
| 		address		10.0.0.72:7789; | |||
| 		disk		/dev/sda4; | |||
| 	} | |||
| } | } | ||
| </source> | </source> | ||
| Line 1,075: | Line 959: | ||
| <source lang="text"> | <source lang="text"> | ||
|   --==  Thank you for participating in the global usage survey  ==-- | |||
| The server's response is: | |||
| you are the 10464th user to install this version | |||
| # /etc/drbd.conf | # /etc/drbd.conf | ||
| common { | common { | ||
| Line 1,083: | Line 971: | ||
| } | } | ||
| # resource r0 on  | # resource r0 on an-node01.alteeve.com: not ignored, not stacked | ||
| resource r0 { | resource r0 { | ||
|      on  |      on an-node01.alteeve.com { | ||
|          device           /dev/drbd0 minor 0; |          device           /dev/drbd0 minor 0; | ||
|          disk             /dev/ |          disk             /dev/an-lvm01/lv02; | ||
|          address          ipv4 10.0.0.71:7789; |          address          ipv4 10.0.0.71:7789; | ||
|          meta-disk        internal; |          meta-disk        internal; | ||
|      } |      } | ||
|      on  |      on an-node02.alteeve.com { | ||
|          device           /dev/drbd0 minor 0; |          device           /dev/drbd0 minor 0; | ||
|          disk             /dev/ |          disk             /dev/an-lvm02/lv02; | ||
|          address          ipv4 10.0.0.72:7789; |          address          ipv4 10.0.0.72:7789; | ||
|          meta-disk        internal; |          meta-disk        internal; | ||
| Line 1,118: | Line 1,006: | ||
| '''Both''' | '''Both''' | ||
| <source lang="bash"> | <source lang="bash"> | ||
| /etc/init.d/drbd restart | /etc/init.d/drbd restart | ||
| Line 1,124: | Line 1,013: | ||
| You should see output like this: | You should see output like this: | ||
| <source lang="text"> | <source lang="text"> | ||
| Restarting all DRBD resources: Could not stat("/proc/drbd"): No such file or directory | |||
| ERROR: Module drbd does not exist in /proc/modules | |||
| . | |||
| </source> | |||
| Don't worry about those errors. | |||
| You can verify that it started properly by checking the drbd daemon's status and by checking what is in <span class="code">/proc/drbd</span>. | |||
| Check the daemon: | |||
| <source lang="bash"> | |||
| /etc/init.d/drbd status | |||
| </source> | </source> | ||
| <source lang="text"> | |||
| drbd driver loaded OK; device status: | |||
| version: 8.3.2 (api:88/proto:86-90) | |||
| GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by mockbuild@v20z-x86-64.home.local, 2009-08-29 14:08:07 | |||
| m:res  cs         ro                   ds                 p  mounted  fstype | |||
| 0:r0   Connected  Secondary/Secondary  UpToDate/UpToDate  C | |||
| </source> | |||
| Check the special procfs file: | |||
| <source lang="bash"> | <source lang="bash"> | ||
| cat /proc/drbd | |||
| </source> | |||
| <source lang="text"> | |||
| GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by mockbuild@v20z-x86-64.home.local, 2009-08-29 14:08:07 | |||
|  0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r---- | |||
|     ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 | |||
| </source> | </source> | ||
| ''' | If you see the output above, you're good to proceed. | ||
| '''Both''': | |||
| '''''MADI''''': Try skipping this command on the next build, it may no longer be needed. | |||
| Initiate the device by run the following commands one at a time: | |||
| <source lang="bash"> | <source lang="bash"> | ||
| drbdadm create-md r0 | drbdadm create-md r0   | ||
| </source> | |||
| <source lang="text"> | |||
| Device '0' is configured! | |||
| </source> | </source> | ||
| '''Primary''': Start the sync between the two nodes by calling: | '''Primary''': | ||
| Start the sync between the two nodes by calling: | |||
| <source lang="bash"> | <source lang="bash"> | ||
| drbdadm -- --overwrite-data-of-peer primary r0 | drbdadm -- --overwrite-data-of-peer primary r0 | ||
| </source> | </source> | ||
| '''Secondary''': At this point, we need to promote the secondary node to 'Primary' position. | '''Secondary''': | ||
| At this point, we need to promote the secondary node to 'Primary' position. | |||
| <source lang="bash"> | <source lang="bash"> | ||
| drbdadm primary r0 | drbdadm primary r0 | ||
| </source> | </source> | ||
| '''Both''': Make sure that both nodes are  | '''Both''':   | ||
| Make sure that both nodes are Primary process by running: | |||
| <source lang="bash"> | <source lang="bash"> | ||
| cat /proc/drbd | |||
| </source> | </source> | ||
| <source lang="text"> | <source lang="text"> | ||
| version: 8.3.2 (api:88/proto:86-90) | version: 8.3.2 (api:88/proto:86-90) | ||
| GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by mockbuild@v20z-x86-64.home.local, 2009-08-29 14:08:07 | GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by mockbuild@v20z-x86-64.home.local, 2009-08-29 14:08:07 | ||
|   0: cs: |   0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r---- | ||
|      ns: |      ns:524288 nr:0 dw:0 dr:524288 al:0 bm:127 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 | ||
| </source> | </source> | ||
| The DRBD partition has no file system, you should not see the devices sync'ing at this point. | |||
| = LVM = | = LVM = | ||
| [[LVM]]  | If you used the AN!Cluster kickstart files, or if you based your install on them, then you are already using [[LVM]] on the cluster nodes as the underlying system for all but the <span class="code">/boot</span> partition. Each node should have a [[VG]] named the same as the node itself with three [[VG]]s on them. | ||
| Now we will "stack" LVM by creating a [[PV]] on top of the new [[DRBD]] partition, <span class="code">/dev/drbd0</span>, that we created in the previous step. Since this new LVM [[PV]] will exist on top of the shared DRBD partition, whatever get written to it's logical volumes will be immediately available on either node, regardless of which node actually initiated the write.  | |||
| This capability is the underlying reason for creating this cluster; Neither machine is truly needed so if one machine dies, anything on top of the DRBD partition will still be available. When the failed machine returns, the surviving node will have a list of what blocks changed while the other node was gone and can use this list to quickly re-sync the other server. | |||
| == Making LVM Cluster-Aware == | |||
| Normally, LVM is run on a single server. This means that at any time, the LVM can write data to the underlying drive and not need to worry if any other device might change anything. In clusters, this isn't the case. The other node could try to write to the shared storage, so then nodes need to enable "locking" to prevent the two nodes from trying to work on the same bit of data at the same time.   | |||
| The process of enabling this locking is known as making LVM "cluster-aware". | |||
| === Updating '/etc/lvm/lvm.conf' === | |||
| '''Note''': With [[EL5]].5, this step is only needed with using software [[RAID]] as LVM sees both the <span class="code">/dev/mdX</span> and <span class="code">/dev/drbdX</span> devices as LVM PVs and defaults to using the RAID device, which fails when creating LVs. See [https://bugzilla.redhat.com/show_bug.cgi?id=530881 here] for details. | |||
| To hide software RAID devices, we need to change the <span class="code">filter</span> in <span class="code">/etc/lvm/lvm.conf</span> to include a regular expression that matches the name of our DRBD device and rejects everything else. We created our DRBD device as <span class="code">/dev/drbd0</span>, so changing the <span class="code">filter</span> to <span class="code">filter = [ "a|drbd.*|", "a|sd.*|", "r|.*|" ]</span> ('''a'''ccept <span class="code">drbd</span> devices and devices with the <span class="code">sd*</span> "scsi" names, '''r'''eject everything else) will do this. Edit <span class="code">lvm.conf</span> and change it to match this: | |||
| <source lang="bash"> | |||
| vim /etc/lvm/lvm.conf | |||
| </source> | |||
| <source lang="text"> | |||
|     # By default we accept every block device: | |||
|     #filter = [ "a/.*/" ] | |||
|     filter = [ "a|drbd.*|", "a|sd.*|", "r|.*|" ] | |||
| </source> | |||
| === Enabling Cluster Locking === | |||
| LVM has a built-in tool called <span class="code">lvmconf</span> that can be used to enable LVM locking. Simply run: | |||
| <source lang="bash"> | |||
| lvmconf --enable-cluster | |||
| </source> | |||
| There won't be any output from that command. | |||
| By default, <span class="code">clvmd</span>, the cluster lvm daemon, is stopped and not set to run on boot. Now that we've enabled LVM locking, we need to start it: | |||
| <source lang="bash"> | <source lang="bash"> | ||
| /etc/init.d/clvmd status | |||
| </source> | |||
| <source lang="text"> | |||
| clvmd is stopped | |||
| active volumes: lv00 lv01 lv02 | |||
| </source> | </source> | ||
| As expected, it is stopped, so lets start it and then use <span class="code">chkconfig</span> to enable it at boot. | |||
| <source lang="bash"> | |||
| /etc/init.d/clvmd start | |||
| </source> | |||
| <source lang="text"> | |||
| Stopping clvm:                                             [  OK  ] | |||
| Starting clvmd:                                            [  OK  ] | |||
| Activating VGs:   3 logical volume(s) in volume group "an-lvm01" now active | |||
|                                                            [  OK  ] | |||
| </source> | |||
| <source lang="bash"> | <source lang="bash"> | ||
| chkconfig clvmd on | |||
| ls -lah /etc/rc3.d/ |grep clvmd | |||
| </source> | </source> | ||
| <source lang="text"> | <source lang="text"> | ||
|    - | lrwxrwxrwx  1 root root   15 Mar 16 12:48 S24clvmd -> ../init.d/clvmd | ||
| </source> | </source> | ||
| We can see that it is now set to start at position <span class="code">24</span>. | |||
| == Creating a new PV using the DRBD Partition == | |||
| We can now proceed with setting up the new DRBD-based LVM physical volume. Once the PV is created, we can create a new volume group and start allocating space to logical volumes. | |||
| '''Note''': As we will be using our DRBD device, and as it is a shared block device, most of the following commands only need to be run on one node. Once the block device changes in any way, those changes will near-instantly appear on the other node. For this reason, unless explicitly stated to do so, only run the following commands on one node. | |||
| To setup the DRBD partition as an LVM PV, run <span class="code">pvcreate</span>: | |||
| <source lang="bash"> | <source lang="bash"> | ||
| pvcreate /dev/drbd0 | pvcreate /dev/drbd0 | ||
| Line 1,232: | Line 1,175: | ||
| </source> | </source> | ||
| Now, on both nodes, check that the new physical volume is visible by using <span class="code">pvdisplay</span>: | |||
| <source lang="bash"> | <source lang="bash"> | ||
| pvdisplay | pvdisplay | ||
| Line 1,248: | Line 1,190: | ||
|    Free PE               1407 |    Free PE               1407 | ||
|    Allocated PE          13489 |    Allocated PE          13489 | ||
|    PV UUID                |    PV UUID               IpySTY-a9BY-31XE-Bxd4-H9sp-OEJG-kP7dtg | ||
|    "/dev/ |    "/dev/drbd0" is a new physical volume of "399.99 GB" | ||
|    --- NEW Physical volume --- |    --- NEW Physical volume --- | ||
|    PV Name               /dev/ |    PV Name               /dev/drbd0 | ||
|    VG Name                 |    VG Name                 | ||
|    PV Size               399.99 GB |    PV Size               399.99 GB | ||
| Line 1,260: | Line 1,202: | ||
|    Free PE               0 |    Free PE               0 | ||
|    Allocated PE          0 |    Allocated PE          0 | ||
|    PV UUID                |    PV UUID               S6OkVh-NlwQ-BaUn-k5LI-1iTo-pu8V-Uq3qE2 | ||
| </source> | |||
| If you see <span class="code">PV Name /dev/drbd0</span> on both nodes, then your DRBD setup and LVM configuration changes are working perfectly! | |||
| == Creating a VG on the new PV == | |||
| Now we need to create the volume group using the <span class="code">vgcreate</span> command: | |||
| <source lang="bash"> | |||
| vgcreate -c y drbd_vg0 /dev/drbd0 | |||
| </source> | |||
| <source lang="text"> | |||
|   Clustered volume group "drbd_vg0" successfully created | |||
| </source> | </source> | ||
| Now we'll check that the new VG is visible on both nodes using <span class="code">vgdisplay</span>: | |||
| ==  | <source lang="bash"> | ||
| vgdisplay | |||
| </source> | |||
| <source lang="text"> | |||
|   --- Volume group --- | |||
|   VG Name               an-lvm01 | |||
|   System ID              | |||
|   Format                lvm2 | |||
|   Metadata Areas        1 | |||
|   Metadata Sequence No  4 | |||
|   VG Access             read/write | |||
|   VG Status             resizable | |||
|   MAX LV                0 | |||
|   Cur LV                3 | |||
|   Open LV               3 | |||
|   Max PV                0 | |||
|   Cur PV                1 | |||
|   Act PV                1 | |||
|   VG Size               465.50 GB | |||
|   PE Size               32.00 MB | |||
|   Total PE              14896 | |||
|   Alloc PE / Size       13489 / 421.53 GB | |||
|   Free  PE / Size       1407 / 43.97 GB | |||
|   VG UUID               C0kHFA-OTo8-Gshr-3wIw-3Q0I-eT3X-A9Y0NA | |||
|   --- Volume group --- | |||
|   VG Name               drbd_vg0 | |||
|   System ID              | |||
|   Format                lvm2 | |||
|   Metadata Areas        1 | |||
|   Metadata Sequence No  2 | |||
|   VG Access             read/write | |||
|   VG Status             resizable | |||
|   Clustered             yes | |||
|   Shared                no | |||
|   MAX LV                0 | |||
|   Cur LV                1 | |||
|   Open LV               0 | |||
|   Max PV                0 | |||
|   Cur PV                1 | |||
|   Act PV                1 | |||
|   VG Size               399.98 GB | |||
|   PE Size               4.00 MB | |||
|   Total PE              102396 | |||
|   Alloc PE / Size       5120 / 20.00 GB | |||
|   Free  PE / Size       97276 / 379.98 GB | |||
|   VG UUID               TmlQmv-eViK-7Ubr-Dyck-0u86-uEWJ-rDOt9i | |||
| </source> | |||
| If the new VG is visible on both nodes, we are ready to create our first logical volume using the <span class="code">lvcreate</span> tool. | |||
| == Creating the First Two LVs on the new VG == | |||
| Now we'll create two simple 20 GiB logical volumes. This first one will be a shared GFS store for source ISOs and the second will be used for our first virtual machine. | |||
| <source lang="bash"> | |||
| lvcreate -L 20G -n iso_store drbd_vg0 | |||
| lvcreate -L 20G -n vm01 drbd_vg0 | |||
| </source> | |||
| <source lang="text"> | |||
|   Logical volume "iso_store" created | |||
|   Logical volume "vm01" created | |||
| </source> | |||
| As before, we will check that the new logical volume is visible from both nodes by using the <span class="code">lvdisplay</span> command: | |||
| <source lang="bash"> | |||
| lvdisplay | |||
| </source> | |||
| <source lang="text"> | |||
|   --- Logical volume --- | |||
|   LV Name                /dev/an-lvm02/lv01 | |||
|   VG Name                an-lvm02 | |||
|   LV UUID                Dy2MNa-EUxN-9x6f-ovkj-NCpk-nlV2-kr5QBb | |||
|   LV Write Access        read/write | |||
|   LV Status              available | |||
|   # open                 1 | |||
|   LV Size                19.53 GB | |||
|   Current LE             625 | |||
|   Segments               1 | |||
|   Allocation             inherit | |||
|   Read ahead sectors     auto | |||
|   - currently set to     256 | |||
|   Block device           253:0 | |||
|   --- Logical volume --- | |||
|   LV Name                /dev/an-lvm02/lv00 | |||
|   VG Name                an-lvm02 | |||
|   LV UUID                xkBu7j-wtOe-ORr3-68qJ-u0ux-Qif4-stw5SY | |||
|   LV Write Access        read/write | |||
|   LV Status              available | |||
|   # open                 1 | |||
|   LV Size                2.00 GB | |||
|   Current LE             64 | |||
|   Segments               1 | |||
|   Allocation             inherit | |||
|   Read ahead sectors     auto | |||
|   - currently set to     256 | |||
|   Block device           253:1 | |||
|   --- Logical volume --- | |||
|   LV Name                /dev/an-lvm02/lv02 | |||
|   VG Name                an-lvm02 | |||
|   LV UUID                R20GH1-wQKq-WgUR-x1gx-Yzzp-WjND-WHAjEO | |||
|   LV Write Access        read/write | |||
|   LV Status              available | |||
|   # open                 2 | |||
|   LV Size                400.00 GB | |||
|   Current LE             12800 | |||
|   Segments               1 | |||
|   Allocation             inherit | |||
|   Read ahead sectors     auto | |||
|   - currently set to     256 | |||
|   Block device           253:2 | |||
|   --- Logical volume --- | |||
|   LV Name                /dev/drbd_vg0/iso_store | |||
|   VG Name                drbd_vg0 | |||
|   LV UUID                svJx35-KDXK-ojD2-UDAA-Ah9t-UgUl-ijekhf | |||
|   LV Write Access        read/write | |||
|   LV Status              available | |||
|   # open                 0 | |||
|   LV Size                20.00 GB | |||
|   Current LE             5120 | |||
|   Segments               1 | |||
|   Allocation             inherit | |||
|   Read ahead sectors     auto | |||
|   - currently set to     256 | |||
|   Block device           253:3 | |||
|   --- Logical volume --- | |||
|   LV Name                /dev/drbd_vg0/vm01 | |||
|   VG Name                drbd_vg0 | |||
|   LV UUID                sceLmK-ZJIp-fN5g-RMaS-j5sq-NuY5-7hIwhP | |||
|   LV Write Access        read/write | |||
|   LV Status              available | |||
|   # open                 0 | |||
|   LV Size                20.00 GB | |||
|   Current LE             5120 | |||
|   Segments               1 | |||
|   Allocation             inherit | |||
|   Read ahead sectors     auto | |||
|   - currently set to     256 | |||
|   Block device           253:4 | |||
| </source> | |||
| The last two are the new logical volumes. | |||
| == Create the Shared GFS FileSystem == | |||
| GFS is a cluster-aware file system that can be simultaneously mounted on two or more nodes at once. We will use it as a place to store ISOs that we'll use to provision our virtual machines. | |||
| The following example is designed for the cluster used in this paper.  | |||
| * If you have more than 2 nodes, increase the <span class="code">-j 2</span> to the number of nodes you want to mount this file system on. | |||
| * If your cluster is named something other than <span class="code">an-cluster</span> (as set in the <span class="code">cluster.conf</span> file), change <span class="code">-t an-cluster:iso_store</span> to match you cluster's name. The <span class="code">iso_store</span> can be whatever you like, but it must be unique in the cluster. I tend to use a name that matches the LV name, but this is my own preference and is not required. | |||
| To format the partition run: | |||
| <source lang="bash"> | |||
| mkfs.gfs2 -p lock_dlm -j 2 -t an-cluster:iso_store /dev/drbd_vg0/iso_store | |||
| </source> | |||
| If you are prompted, press <span class="code">y</span> to proceed. | |||
| Once the format completes, you can mount <span class="code">/dev/drbd_vg0/iso_store</span> as you would a normal file system. | |||
| '''Both''': | '''Both''': | ||
| To complete the example, lets mount the GFS2 partition we made just now on <span class="code">/shared</span>. | |||
| <source lang="bash"> | <source lang="bash"> | ||
| mkdir /shared | |||
| mount /dev/drbd_vg0/iso_store /shared | |||
| </source> | |||
| Done! | |||
| == Growing a GFS2 Partition == | |||
| To grow a GFS2 partition, you must know where it is mounted. You can not grow an unmounted GFS2 partition, as odd as that may seem at first. Also, you only need to run grow commands from one node. Once completed, all nodes will see and use the new free space automatically. | |||
| This requires two steps to complete: | |||
| # Extend the underlying LVM logical volume | |||
| # Grow the actual GFS2 partition | |||
| === Extend the LVM LV === | |||
| To keep things simple, we'll just use some of the free space we left on our <span class="code">/dev/drbd0</span> LVM physical volume. If you need to add more storage to your LVM first, please follow the instructions in the article: "[[Adding Space to an LVM]]" before proceeding. | |||
| Let's add <span class="code">50GB</span> to our GFS2 logical volume <span class="code">/dev/drbd_vg0/iso_store</span> from the <span class="code">/dev/drbd0</span> physical volume, which we know is available because we left more than that back when we first setup our LVM. To actually add the space, we need to use the <span class="code">lvextend</span> command: | |||
| <source lang="bash"> | |||
| lvextend -L +50G /dev/drbd_vg0/iso_store /dev/drbd0 | |||
| </source> | </source> | ||
| Which should return: | |||
| <source lang="text"> | <source lang="text"> | ||
|   Extending logical volume iso_store to 70.00 GB | |||
|   Logical volume iso_store successfully resized | |||
| </source> | </source> | ||
| If we run <span class="code">lvdisplay /dev/drbd_vg0/iso_store</span> now, we should see the extra space. | |||
| <source lang="text"> | <source lang="text"> | ||
|   --- Logical volume --- | |||
|   LV Name                /dev/drbd_vg0/iso_store | |||
|   VG Name                drbd_vg0 | |||
|   LV UUID                svJx35-KDXK-ojD2-UDAA-Ah9t-UgUl-ijekhf | |||
|   LV Write Access        read/write | |||
|   LV Status              available | |||
|   # open                 1 | |||
|   LV Size                70.00 GB | |||
|   Current LE             17920 | |||
|   Segments               2 | |||
|   Allocation             inherit | |||
|   Read ahead sectors     auto | |||
|   - currently set to     256 | |||
|   Block device           253:3 | |||
| </source> | </source> | ||
| You're now ready to proceed. | |||
| === Grow The GFS2 Partition === | |||
| This step is pretty simple, but you need to enter the commands exactly. Also, you'll want to do a dry-run first and address any resulting errors before issuing the final <span class="code">gfs2_grow</span> command. | |||
| To get the exact name to use when calling <span class="code">gfs2_grow</span>, run the following command: | |||
| <source lang="bash"> | |||
| gfs2_tool df | |||
| </source> | |||
| <source lang="text"> | <source lang="text"> | ||
| /shared: | |||
|   SB lock proto = "lock_dlm" | |||
|   SB lock table = "an-cluster:iso_store" | |||
|   SB ondisk format = 1801 | |||
|   SB multihost format = 1900 | |||
|   Block size = 4096 | |||
|   Journals = 2 | |||
|   Resource Groups = 80 | |||
|   Mounted lock proto = "lock_dlm" | |||
|   Mounted lock table = "an-cluster:iso_store" | |||
|   Mounted host data = "jid=1:id=196610:first=0" | |||
|   Journal number = 1 | |||
|   Lock module flags = 0 | |||
|   Local flocks = FALSE | |||
|   Local caching = FALSE | |||
|   Type           Total Blocks   Used Blocks    Free Blocks    use%            | |||
|   ------------------------------------------------------------------------ | |||
|   data           5242304        1773818        3468486        34% | |||
|   inodes         3468580        94             3468486        0% | |||
| </source> | |||
| From this output, we know that GFS2 expects the name "<span class="code">/shared</span>". Even adding something as simple as a trailing slash ''will not work''. The program we will use is called <span class="code">gfs2_grow</span> with the <span class="code">-T</span> switch to run the command as a test to work out possible errors. | |||
| For example, if you added the trailing slash, this is the kind of error you would see: | |||
| '''Bad command''': | |||
| <source lang="bash"> | |||
| gfs_grow -T /shared/ | |||
| </source> | |||
| <source lang="bash"> | |||
| GFS Filesystem /shared/ not found | |||
| </source> | |||
| Once we get it right, it will look like this: | |||
| <source lang="bash"> | |||
| gfs_grow -T /shared | |||
| </source> | |||
| <source lang="bash"> | |||
| (Test mode--File system will not be changed) | |||
| FS: Mount Point: /shared | |||
| FS: Device:      /dev/mapper/drbd_vg0-iso_store | |||
| FS: Size:        5242878 (0x4ffffe) | |||
| FS: RG size:     65535 (0xffff) | |||
| DEV: Size:       18350080 (0x1180000) | |||
| The file system grew by 51200MB. | |||
| gfs2_grow complete. | |||
| </source> | |||
| This looks good! We're now ready to re-run the command without the <span class="code">-T</span> switch: | |||
| <source lang="bash"> | |||
| gfs_grow /shared | |||
| </source> | |||
| <source lang="bash"> | |||
| FS: Mount Point: /shared | |||
| FS: Device:      /dev/mapper/drbd_vg0-iso_store | |||
| FS: Size:        5242878 (0x4ffffe) | |||
| FS: RG size:     65535 (0xffff) | |||
| DEV: Size:       18350080 (0x1180000) | |||
| The file system grew by 51200MB. | |||
| gfs2_grow complete. | |||
| </source> | </source> | ||
| You can check that the new space is available on both nodes now using a simple call like <span class="code">df -h</span>. | |||
| = Creating Our Floating VM = | |||
| = Convirt = | |||
| <source lang="bash"> | <source lang="bash"> | ||
| yum -y install pygtk2 vte vnc tunctl dnsmasq bridge-utils | |||
| cd /etc/yum.repos.d | |||
| wget --no-cache http://www.convirture.com/repos/definitions/rhel/5.x/convirt.repo | |||
| yum -y install convirt | |||
| /usr/share/convirt/install/managed_server/scripts/convirt-tool setup | |||
| </source> | |||
| After running 'convirt-tool setup', comment out the 'convirt-xen-multibridge' entry added to 'vim /etc/xen/xend-config.sxp'. | |||
| Start 'convirt': | |||
| <source lang="bash"> | |||
| convirt & | |||
| </source> | </source> | ||
| Remove "QA Lab" and "Desktop" groups.  | |||
| right-click on 'Servers' and choose 'Add Server'. For each node, enter it's hostname (ie: an-node01 and an-node02) and each machine's root password. Leave the 'Xen Protocol' as 'XML-RPC'. | |||
| = Pacemaker = | |||
| In short, Pacemaker is a cluster resource manager. | |||
| Pacemaker runs on top of [[OpenAIS]] and handles clustered resources. For example, it can be used to move around a shared IP address, bring up services on the surviving node after a failure and restore services when a cluster node rejoins. | |||
| == Installing Pacemaker == | |||
| First, add the [[Adding the DAG Repository to CentOS|DAG]] repositories to your system. | |||
| Download and install EPEL: | |||
| <source lang="bash"> | <source lang="bash"> | ||
| cd /etc/yum.repos.d/ | |||
| rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/x86_64/epel-release-5-3.noarch.rpm | |||
| wget -O /etc/yum.repos.d/pacemaker.repo http://clusterlabs.org/rpm/epel-5/clusterlabs.repo | |||
| yum install pacemaker heartbeat libibverbs librdmacm  | |||
| </source> | </source> | ||
| <source lang=" | |||
| = ToDo = | |||
| This is a list of "ToDo"s for this paper. | |||
| * Update and integrate [[Adding Space to an LVM]]. | |||
| * Once done, point to [[Creating a Custom CentOS-derived Distribution]]. | |||
| * Add a section of setting up [[SSH_Tutorial|shared ssh keys]]. | |||
| * [[Bridging in Fedora Core 13]] | |||
| * [[Setting Up a PXE Server in Fedora]] | |||
| = Random = | |||
| This is a sandbox for notes to be integrated later. | |||
| == Provision VM == | |||
| <source lang="bash"> | |||
| virt-install -n rhel6-01 -r 1024 --vcpus=1 --cpuset=1 --os-type=linux --os-variant=rhel6 -c /dev/sr0 --hvm --virt-type=kvm --disk /dev/vg_an-node02/lv_rhel6-01 --network bridge=br0 --vnc | |||
| </source> | </source> | ||
| Line 1,317: | Line 1,583: | ||
| * To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013922.html HJ Lee] from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types. | * To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013922.html HJ Lee] from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types. | ||
| * To [https://lists.linux-foundation.org/pipermail/openais/2010-February/013925.html Steven Dake] for clarifying the <span class="code">to_x</span> vs. <span class="code">logoutput: x</span> arguments in <span class="code">openais.conf</span>. | |||
| * To [http://dk.linkedin.com/in/fabbione Fabio Massimo Di Nitto] for helping me get caught up with clustering and VMs on FC13. | |||
| {{footer}} | {{footer}} | ||
Latest revision as of 16:08, 5 January 2012
| Alteeve Wiki :: How To :: Abandoned - 2-Node EL5 Cluster | 
|  | Warning: This is an older article and should only be used for historical reference. The only up to date clustering tutorial is: Red Hat Cluster Service 2 Tutorial. | 
I've restarted this tutorial from scratch in the Red Hat Cluster Service 2 Tutorial. This tutorial is for reference only and will be deleted once I finish the new tutorial. Please do not follow this tutorial, but it should be ok to use as a reference. - Mar. 14, 2011.
Related articles:
- Node Assassin
- The prototype from this project is complete and is used as the fence device in this article.
 
Progress
May 25, 2010: Happy Towel Day! Despite the lack of updates right here, a lot has been happening. I gave my first talked based on this paper to TLUG and am now working to expand it into a full cluster workshop. I'm still sorting out where and how I'll split this talk off, but until then, this should be useful now up to the working cluster stage.
Mar. 26, 2010: I've been side-tracked this week getting the next version of Node Assassin's hardware done. This is the version that will implement independent power sensing to tell when a node is truly on or off. Once done, I'll be back to finish this paper.
Mar. 17, 2010: Sorted out and mostly completed the LVM section. Just need to sort out some sample cluster-aware formatting examples before moving on to the Xen virtual machine provisioning.
Mar. 15, 2010: Finished working through the paper as it is so far. Tomorrow or Tuesday I will begin expanding on it.
Mar. 14, 2010: Happy π-day all! I've moved the body of this paper to here and will start moving pieces back. I'm switching over to more of "choose your own adventure" style, given the various possible configurations available in clustering. I won't pretend to cover them all, but I will at least be able to insert pointers to different layouts at points where you may want to branch out from this document's path. Also, I have changed my approach to providing iSCSI/SAN and thus will remove discussion of that component until the 3+ Node CentOS5 Cluster + SoftSAN paper where this paper will act as a prerequisite.
Overview
This paper has two goals;
- How to assemble the simplest cluster possible, a 2 Node Cluster, which you can then expand on for your own needs.
- How to create a "floating" virtual machine that can move between the two nodes in the event of a node failure, maximizing up time.
Prerequisites
It is expected that you are already comfortable with the Linux command line, specifically bash, and that you are familiar with general administrative tasks in Red Hat based distributions, specifically CentOS. You will also need to be comfortable using editos like vim, nano or similar. This paper uses vim in examples. Simply substitute your favourite editor in it's place.
You are also expected to be comfortable with networking concepts. You will be expected to understand TCP/IP, multicast, broadcast, subnets and netmasks, routing and other relatively basic networking concepts. Please take the time to become familiar with these concepts before proceeding.
This said, where feasible, as much detail as is possible will be provided. For example, all configuration file locations will be shown and functioning sample files will be provided.
Platform
This paper will implement the Red Hat Cluster Suite using the CentOS binary-compatible distribution. This paper uses the x86_64 repositories, however, if you are on an i386 (32 bit) system, you should be able to following along fine. Simply replace x86_64 with .i386 in package names.
You can either download the stock CentOS 5-series DVD ISO (currently at version 5.4 which is used in this document), or you can try out the alpha AN!Cluster Install DVD. (4.3GB iso). If you use the later, please test it out on a development or test cluster. If you have any problems with the AN!Cluster variant CentOS distro, please contact me and let me know what your trouble was.
Focus
Clusters can serve to solve three problems; Reliability, Performance and Scalability.
This focus of the cluster described in this paper is primarily reliability. Second to this, scalability will be the priority leaving performance to be addressed only when it does not impact the first two criteria. This is not to indicate that performance is not a valid priority, it simply isn't the priority of this paper.
Goal
At the end of this paper, you should have a fully functioning two-node array capable of hosting a "floating" virtual machine. That is, a virtual machine that exists on one node and can be easily moved to the other node with minimal effort and down time. This should conclude with a solid foundation for adding more virtual servers up to the limit of your cluster's resources.
This paper should also server to show how to build the foundation of any other cluster configuration. This paper has a core focus of introducing the main issues that come with clustering and hopes to serve as a foundation for any cluster configuration outside the scope of this paper.
Begin
Let's begin!
Hardware
We will need two physical servers each with the following hardware:
- One or more multi-core CPUs with Virtualization support.
- Three network cards; At least one should be gigabit or faster.
- One or more hard drives.
This paper uses the following hardware:
- ASUS M4A78L-M
- AMD Athlon II x2 250
- 2GB Kingston DDR2 KVR800D2N6K2/4G (4GB kit split between the two nodes)
- 1x Intel 82540 PCI NICs
- 1x D-Link DGE-560T
This is not an endorsement of the above hardware. I bought what was within my budget that would server the purposes of creating this document. What you purchase shouldn't matter, so long at the minimum requirements are met.
OS Install
Start with a stock CentOS 5.x install. This How-To uses CentOS 5.4 x86_64, however it should be fairly easy to adapt to other CentOS 5*, RHEL5 or other RHEL5-based distributions.
These are sample kickstart script used by this paper. Be sure to set your how password string and network settings.
Warning! These kickstart scripts will erase your hard drive! Adapt them, don't blindly use them.
Generic cluster node kickstart scripts.
- an-node01.ks
- an-node02.ks
- c5_generic_node.ks - New kickstart for automatically detecting and configuring storage.
AN!Cluster Install
If you are feeling brave, below is a link to a custom install DVD that contains the kickstart scripts to setup nodes and an an-cluster directory with all the configuration files.
- Download the custom AN!Cluster v0.1.006 Install DVD. (4.5GiB iso). (Currently disabled - Reworking for F13)
Post OS Install
Once the OS is installed, we need to do some ground work.
- Setup networking.
- Limit dom0's memory.
- Change the default run-level.
- Change when xend starts.
Post-Install Network Configuration
This cluster uses Xen, which fairly dramatically impacts networking. Terms you need to be familiar with are:
- dom0
- This is the "first" virtual machine with special access to the underlying hardware. This looks like the host operating system but is in fact just another virtual server running under Xen. This is also the virtual machine that can directly see the Xen networking infrastructure.
 
- domu
- These are the virtual servers setup in and managed by the dom0 virtual machine. These are what most people think of when talking about "virtual servers" under Xen.
 
Ethernet Devices and Subnets
The most important thing to do after the install is to identify which ethX device matches which network card. This is important in two cases;
- The fastest network card should be allocated to the DRBD partition.
- If you have IPMI piggy-backed on a physical network card, it should be allocated to the back-channel subnet.
This paper has the following configuration:
- eth0; Internet-polluted subnet.
- eth1; DRBD subnet.
- eth2; Back-channel subnet.
To change which ethX device maps to which ethernet card, please see:
If you are unfamiliar with how networking works in Xen, please read this article:
Choosing your Subnets
There will be three subnets in our two node cluster;
- Internet-polluted subnet; 192.168.1.0/24
- This subnet will ultimately be directly accessible only by the firewall virtual server. All other virtual machines and the node's dom0s will access the internet via the firewall for security reasons. During setup though, the 'dom0' servers will directly access this subnet.
 
- DRBD subnet; 10.0.0.0/24
- Only the two 'dom0' servers will access to this subnet. It is used for DRBD communication and as a backup for the totem ring protocol.
 
- Back-channel; 10.0.1.0/24
- This is the private subnet used for communication between the dom0 and domU virtual servers. This subnet will have no direct access to the internet. This paper will use the 10.0.1.0/24 subnet for this
 
I like to assign the same last octal to a given node's subnets. This helps me keep track of which node I am working with at any given time. Here is how I setup my two nodes:
- an-node01
- eth0: 192.168.1.71
- eth1: 10.0.0.71
- eth2: 10.0.1.71
 
- an-node02
- eth0: 192.168.1.72
- eth1: 10.0.0.72
- eth2: 10.0.1.72
 
/etc/hosts
Some applications expect to be able to call nodes by their name. To accommodate this, and to ensure that inter-node communication takes place on the back-channel subnet, we add the following to the /etc/hosts file:
vim /etc/hosts
# By back-channel IPs to name mapping.
10.0.1.71	an-node01 an-node01.alteeve.com
10.0.1.72	an-node02 an-node02.alteeve.com
Note: Delete any pre-existing entries matching the name returned by uname -n. There is a good chance there will be an entry that resolves to 127.0.0.1 which would cause problems later.
Obviously, adapt the names and IPs to match your nodes and subnets. The only critical thing is to make sure that the name returned by uname -n is resolvable to the back-channel subnet. I like to add a short-form name for convenience.
iptables
Be sure to flush netfilter tables and disable iptables and ip6tables from starting on your nodes. This is because the 'dom0' servers will not be connected directly to the Internet and we want to minimize the chance of an errant iptables rule messing up our configuration. If, before launch, you wish to implement a firewall, feel free to do so but be sure to thoroughly test your cluster to ensure no problems were introduced.
chkconfig --level 2345 iptables off
/etc/init.d/iptables stop
chkconfig --level 2345 ip6tables off
/etc/init.d/ip6tables stop
Limit dom0's Memory
Normally, 'dom0' will claim and use memory not allocated to virtual machines. This can cause trouble if, for example, you've moved a VM off of a node and then want to move it or another VM back. For a period of time, the node will claim that there is not enough free memory for the migration. By setting a hard limit of dom0's memory usage, this scenario won't happen and you will not need to delay migrations.
To do this, add dom0_mem=512M to the Xen kernel image's first module line in grub. For example, you should have a line like in your grub configuration file:
vim /boot/grub/menu.lst
title CentOS (2.6.18-164.15.1.el5)
	root (hd0,0)
	kernel /vmlinuz-2.6.18-164.15.1.el5 ro root=/dev/an-lvm01/lv01 rhgb quiet dom0_mem=512M
	initrd /initrd-2.6.18-164.15.1.el5.img
You can change the '512M' with the amount of RAM you want to allocate to dom0. Note that if you used the AN!Cluster install DVD or the AN!Cluster kickstart files, this should already be set for you.
REMEMBER!
If you update your kernel, be sure to re-add this argument to the new kernel's argument list.
Change the Default Run-Level
If you don't plan to work on your nodes directly, it makes sense to switch the default run level from 5 to 3. This prevents Gnome from starting at boot, thus freeing up a lot of memory and system resources and reducing the possible attack vectors.
To do this, edit /etc/inittab, change the id:5:initdefault: line to id:3:initdefault: and then switch to run level 3:
vim /etc/inittab
id:3:initdefault:
init 3
Change when xend starts
Normally, xend starts at priority 98 in /etc/rc.X/. This can cause problems with other packages that expect the network to be stable. This is because xend takes all the networks down when it starts. To prevent these problems, we will move the xend init script to start priority 11. We'll also adapt the stop priority to 89, though this is less critical
First, edit the actual initialization script and change the line '# chkconfig: 2345 98 01' to '# chkconfig: 2345 11 89'.
vim /etc/init.d/xend
# chkconfig: 2345 11 89
Now, use chkconfig to change the apply the changes:
chkconfig --del xend
chkconfig --add xend
You should now see the symlink /etc/rc3.d/S11xend and /etc/rc3.d/K89xend.
Initial Cluster Setup
Before we get into specifics, let's take a minute to talk about the major components used in our cluster.
Core Program
These are the core programs that may be new to you that we will use to build our cluster.
OpenAIS/Corosync
Pacemaker
DRBD
LVM
Xen
dom0 Setup
Some things, like cluster-aware LVM, won't work until the cluster is setup. For this reason, we need to setup the cluster infrastructure before going any further.
If you didn't read up on Networking in Xen works, now would be a very good time to do so. A lot of the networking from here on in will seem cryptic otherwise when it's actually fairly straight forward.
Adding New NICs to Xen
By default, xend only manages eth0. We need to add eth2 and, if you wish, eth1. Personally, I like to put all my ethernet devices under Xen's control for future flexibility, but this opens a possible security vector as a bridge is created for the DRBD subnet. Whether you add it or not I will leave to your preferences.
You can see which devices are under Xen's control by running ifconfig and checking to see if there is a pethX corresponding to each ethX device. For example, here is what you would see if only eth0 was under Xen's control:
ifconfig
eth0      Link encap:Ethernet  HWaddr 90:E6:BA:71:82:D8  
          inet addr:192.168.1.71  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::92e6:baff:fe71:82d8/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:121 errors:0 dropped:0 overruns:0 frame:0
          TX packets:97 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:20605 (20.1 KiB)  TX bytes:16270 (15.8 KiB)
eth1      Link encap:Ethernet  HWaddr 00:21:91:19:96:5A  
          inet addr:10.0.0.71  Bcast:10.0.0.255  Mask:255.255.255.0
          inet6 addr: fe80::221:91ff:fe19:965a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:45 errors:0 dropped:0 overruns:0 frame:0
          TX packets:53 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:9139 (8.9 KiB)  TX bytes:10259 (10.0 KiB)
          Interrupt:16 
eth2      Link encap:Ethernet  HWaddr 00:0E:0C:59:45:78  
          inet addr:10.0.1.71  Bcast:10.0.1.255  Mask:255.255.255.0
          inet6 addr: fe80::20e:cff:fe59:4578/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:45 errors:0 dropped:0 overruns:0 frame:0
          TX packets:62 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:9790 (9.5 KiB)  TX bytes:11102 (10.8 KiB)
          Base address:0xec00 Memory:febe0000-fec00000 
lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:560 (560.0 b)  TX bytes:560 (560.0 b)
peth0     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:126 errors:0 dropped:0 overruns:0 frame:0
          TX packets:110 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:20923 (20.4 KiB)  TX bytes:18352 (17.9 KiB)
          Interrupt:252 Base address:0x6000 
vif0.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:103 errors:0 dropped:0 overruns:0 frame:0
          TX packets:126 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:19306 (18.8 KiB)  TX bytes:20935 (20.4 KiB)
virbr0    Link encap:Ethernet  HWaddr 00:00:00:00:00:00  
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:49 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:9640 (9.4 KiB)
xenbr0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:148 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:24256 (23.6 KiB)  TX bytes:0 (0.0 b)
You'll notice that there is no peth1 or peth2 device, nor their associated virtual devices or bridges.
Create /etc/xen/scripts/an-network-script
This script will be used by Xen to create bridges for all NICs.
Please note three things;
- You don't need to use the name 'an-network-script'. I suggest this name mainly to keep in line with the rest of the 'AN!x' naming used here.
- If you install convirt, it will create it's own bridge script called convirt-xen-multibridge. Other tools may do something similar.
- Adding eth1 is optional, as we know ahead of time that eth1 will not be made available to any virtual machines as it is dedicated to DRBD and totem. I'm adding it here because I like having things consistent; Do whichever makes more sense to you.
First, touch the file and then chmod it to be executable.
touch /etc/xen/scripts/an-network-script
chmod 755 /etc/xen/scripts/an-network-script
Now edit it to contain the following:
vim /etc/xen/scripts/an-network-script
#!/bin/sh
dir=$(dirname "$0")
"$dir/network-bridge" "$@" vifnum=0 netdev=eth0 bridge=xenbr0
"$dir/network-bridge" "$@" vifnum=1 netdev=eth1 bridge=xenbr1
"$dir/network-bridge" "$@" vifnum=2 netdev=eth2 bridge=xenbr2
Now tell Xen to reference that script by editing /etc/xen/xend-config.sxp file and changing the network-script argument to point to this new script (this is line 91 in the default xend-config.sxp script):
vim /etc/xen/xend-config.sxp
#(network-script network-bridge)
(network-script an-network-script)
Now restart xend
/etc/init.d/xend restart
If everything worked, you should now be able to run ifconfig and see that all the ethX devices have matching pethX, virtual and bridge devices.
ifconfig
eth0      Link encap:Ethernet  HWaddr 90:E6:BA:71:82:D8  
          inet addr:192.168.1.71  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::92e6:baff:fe71:82d8/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:274 errors:0 dropped:0 overruns:0 frame:0
          TX packets:190 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:33479 (32.6 KiB)  TX bytes:33376 (32.5 KiB)
eth1      Link encap:Ethernet  HWaddr 00:21:91:19:96:5A  
          inet addr:10.0.0.71  Bcast:10.0.0.255  Mask:255.255.255.0
          inet6 addr: fe80::221:91ff:fe19:965a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:33 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:10393 (10.1 KiB)
eth2      Link encap:Ethernet  HWaddr 00:0E:0C:59:45:78  
          inet addr:10.0.1.71  Bcast:10.0.1.255  Mask:255.255.255.0
          inet6 addr: fe80::20e:cff:fe59:4578/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:28 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:9964 (9.7 KiB)
lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:560 (560.0 b)  TX bytes:560 (560.0 b)
peth0     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:281 errors:0 dropped:0 overruns:0 frame:0
          TX packets:204 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:33929 (33.1 KiB)  TX bytes:35540 (34.7 KiB)
          Interrupt:252 Base address:0x6000 
peth1     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:45 errors:0 dropped:0 overruns:0 frame:0
          TX packets:86 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:9139 (8.9 KiB)  TX bytes:20652 (20.1 KiB)
          Interrupt:16 
peth2     Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:45 errors:0 dropped:0 overruns:0 frame:0
          TX packets:90 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:9790 (9.5 KiB)  TX bytes:21066 (20.5 KiB)
          Base address:0xec00 Memory:febe0000-fec00000 
vif0.0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:200 errors:0 dropped:0 overruns:0 frame:0
          TX packets:281 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:37668 (36.7 KiB)  TX bytes:33941 (33.1 KiB)
vif0.1    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:33 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:10393 (10.1 KiB)  TX bytes:0 (0.0 b)
vif0.2    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:28 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:9964 (9.7 KiB)  TX bytes:0 (0.0 b)
virbr0    Link encap:Ethernet  HWaddr 00:00:00:00:00:00  
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:49 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:9640 (9.4 KiB)
xenbr0    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:151 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:24426 (23.8 KiB)  TX bytes:0 (0.0 b)
xenbr1    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:33 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:9931 (9.6 KiB)  TX bytes:0 (0.0 b)
xenbr2    Link encap:Ethernet  HWaddr FE:FF:FF:FF:FF:FF  
          UP BROADCAST RUNNING NOARP  MTU:1500  Metric:1
          RX packets:28 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:9572 (9.3 KiB)  TX bytes:0 (0.0 b)
Fencing
Before proceeding with the cluster.conf file, you must understand what fencing is, how it is used in Red Hat/CentOS clusters and why it is so important.
- The Cluster Admin's Mantra:
- The only thing you don't know is what you don't know.
 
Just because one node loses communication with another node, it cannot be assume that the silent node is dead!
What is it?
"Fencing" is the act of isolating a malfunctioning node. The goal is to prevent a split-brain condition where two nodes think the other member is dead and continue to use a shared resource. When this happens, file system corruption is almost guaranteed. Another dangerous scenario would be if one node paused while writing to a disk, the other node decides it's dead and starts to replay the journal, then the first node recovers and completes the write. The results would be equally disastrous. If you are lucky enough to not lose the shared file system, you will be faced with the task of determining what data got written to which node, merging that data and/or overwriting the node you trust the least. This 'best case' is still pretty lousy.
Fencing, isolating a node from altering shared disks, can be accomplished in a couple of ways:
- Power
- Power fencing is where a device is used to cut the power to a malfunctioning node. This is probably the most common type.
 
- Blocking
- Blocking is often implemented at the network level. This type of fencing leaves the node alone, but disconnects it from the storage network. Often this is done by a switch which prevents traffic coming from the fenced node.
 
With power fencing, the term used is "STONITH", literally, Shoot The Other Node In The Head. Picture it like an old west dual. If one node is dead, the other node is going to win the dual by default and the dead node will just be shot again. When both nodes are alive, however, the faster node will win and will "kill" (power off or reset) the slower node before it has a chance to fire. Once this dual is over, the surviving node can then access the shared resource confident that it is the only one working on it.
Misconception
It is a very common mistake to ignore fencing when first starting to learn about clustering. Often people think "It's just for production systems, I don't need to worry about it yet because I don't care what happens to my test cluster.".
Wrong!
For the most practical reason; the cluster software will block all I/O transactions when it can't guarantee a fence operation succeeded. The result is that your cluster will essentially "lock up". Likewise, cman and related daemons will fail if they can't find a fence agent to use.
Secondly; Testing our cluster will involve inducing errors. Without proper fencing, there is a high probability that our shared file system will be corrupted. That would force the need to start over, making your learning take a lot longer than it needs to.
Implementation
In Red Hat's cluster software, the fence device(s) are configured in the main /etc/cluster.conf cluster configuration file. This configuration is then acted on via the fenced daemon. We'll cover the details of the cluster.conf file in a moment.
When the cluster determines that a node needs to be fenced, the fenced daemon will consult the cluster.conf file for information on how to access the fence device. Given this cluster.conf snippet:
<cluster name="an-cluster" config_version="1">
	<clusternodes>
		<clusternode name="an-node02.alteeve.com" nodeid="2">
			<fence>
				<method name="node_assassin">
					<device name="motoko" port="02" action="off"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="motoko" agent="fence_na" quiet="true"
		ipaddr="motoko.alteeve.com" login="motoko" passwd="secret">
		</fencedevice>
	</fencedevices>
</cluster>
If the cluster manager determines that the node an-node02.alteeve.com needs to be fenced, it looks at the first (and only, in this case) <fence> entry's name, which is motoko in this case. It then looks in the <fencedevices> section for the device with the matching name. From there, it gets the information needed to find and access the fence device. Once it connects to the fence device, it then passes the options set in an-node02.alteeve.com's <fence> argument.
So in this example, fenced looks up the details on the motoko Node Assassin fence device. It calls the fence_na program, called a fence agent, and passes the following arguments:
- ipaddr=motoko.alteeve.com
- login=motoko
- passwd=secret
- quiet=true
- port=2
- action=off
How the fence agent acts on these arguments varies depending on the fence device itself. In general terms, the 'fence_na' fence agent will create a connection to the device at the IP address (or resolvable name, as in this case) specified in the ipaddr argument. Once connected, it will authenticate using the login and passwd arguments. Once authenticated, it tells the device what port to act on, which could be a power jack, a power or reset button, a network switch port and so on. Finally, it tells the device what action to take.
Once the device completes, it returns a success or failed message. If the first attempt fails, the fence agent will try the next <fence> method, if a second exists. It will keep trying fence devices in the order they are found in the cluster.conf file until it runs out of devices. If it fails to fence the node, most daemons will "block", that is, lock up and stop responding until the issue is resolved. The logic for this is that a locked up cluster is better than a corrupted one.
If any of the fence devices succeed though, the cluster will know that it is safe to proceed and will reconfigure the cluster without the defective node.
Fence Devices
Many major OEMs have their own remote management devices that can serve as fence devices. Examples are Dell's 'DRAC' (Dell Remote Access Controller), HP's iLO (Integrate Lights Out), IBM's 'RSA' (Remote Supervisor Adapter), Sun's 'SSP' (System Service Processor) and so on. Smaller manufacturers implement remote management via IPMI, Intelligent Power Management Interface.
In the above devices, fencing is implemented via a build in or integrated device inside the server. These devices are usually accessible even when the host server is powered off or hard locked. Via these devices, the host server can be powered off, reset and powered on remotely, regardless of the state of the host server.
Block fencing is possible when the device connecting a node to shared resources, like a fiber-channel SAN switch, provides a method of logically "unplugging" a defective node from the shared resource, leaving the node itself alone.
Node Assassin
A cheap alternative is the Node Assassin, an open-hardware, open source fence device. It was built to allow the use of commodity system boards that lacked remote management support found on more expensive, server class hardware.
Full Disclosure: Node Assassin was created by me, with much help from others, for this paper.
Core Files
There are two main configuration files that need to be setup now.
cluster.conf
The core of the cluster is the /etc/cluster/cluster.conf XML configuration file. It contains information about the cluster itself, what nodes are to be used, how to fence each node, what fence devices exist plus miscellaneous other configuration options.
By default, there is no cluster.conf, so you need to start by creating it:
touch /etc/cluster/cluster.conf
Here is the one AN!Cluster uses, with in-line comments, mostly from the man cluster.conf page.
Once you're comfortable with your changes to the file, you need to validate it. Run:
xmllint --relaxng /usr/share/system-config-cluster/misc/cluster.ng /etc/cluster/cluster.conf
If there are errors, address them. Once you see /etc/cluster/cluster.conf validates, you can proceed to the next step.
Note: If you are using Node Assassin and the XML validation fails, be sure to get the updated cluster.ng validation file!
openais.conf
Where cluster.conf is the core configuration file for the cluster, OpenAIS is the master of ceremonies. It implements all the cluster functions referencing first it's own /etc/ais/openais.conf file and then the /etc/cluster/cluster.conf file. You can think of this file as a "low level" configuration file controlling the underlying mechanics of the cluster where cluster.conf contains the specific cluster configuration.
Unlike cluster.conf, there is a default openais.conf config file. It's a good habit to back default files up in case you need to start over.
When reviewing the openais.conf file below, please take the time to read the comments in the file. There are many aspects of clustering that will make sense if you understand the various OpenAIS configuration options.
Once you are comfortable, backup and edit openais.conf:
vim /etc/ais/openais.conf
# This is a skeleton example configuration file.
# Totem Protocol options.
totem {
        version: 2
        secauth: off
        threads: 0
        rrp_mode: passive
        interface {
                # This is the back-channel subnet, which is the primary network
                # for the totem protocol.
                ringnumber: 0
                bindnetaddr: 10.0.1.0
                mcastaddr: 226.94.1.1
                mcastport: 5405
        }
        interface {
                # This is the DRBD subnet, which acts as a secondary, backup
                # network for the totem protocol.
                ringnumber: 1
                bindnetaddr: 10.0.0.0
                mcastaddr: 227.94.1.1
                mcastport: 5406
        }
}
# Enable logging.
logging {
        to_syslog: yes
}
# Disable AMF, it's not supported yet.
amf {
        mode: disabled
}
Cluster First Start
If everything up until now was done right, you should be able to start your cluster for the first time. It can be useful to have a separate terminal window open with a tail watching /var/log/messages so that you can see if there are any problems.
On both nodes, in dedicated terminals, run:
clear; tail -f -n 0 /var/log/messages
This next step must be run on both nodes as soon as possible. If you try to start one node and wait too long to start the other node, the first node will think there is a problem and it will fence the second node. Remember the <fence_daemon post_join_delay="60"></fence_daemon> line in cluster.conf? This is where it comes into play. The value you set it the "window" you have to start both nodes before a fence is issued. The default is 6 seconds, and the above line changed that to 60 seconds.
On both nodes, in different terminals, check that cman is indeed stopped, then start it up:
/etc/init.d/cman status
ccsd is stopped
/etc/init.d/cman start
If all goes well, you should see something like this in each node's /var/log/messages file:
May 10 20:54:32 an-node01 kernel: DLM (built Mar 17 2010 12:05:05) installed
May 10 20:54:32 an-node01 kernel: GFS2 (built Mar 17 2010 12:05:47) installed
May 10 20:54:32 an-node01 kernel: Lock_DLM (built Mar 17 2010 12:05:54) installed
May 10 20:54:33 an-node01 ccsd[11167]: Starting ccsd 2.0.115: 
May 10 20:54:33 an-node01 ccsd[11167]:  Built: Dec  8 2009 09:20:54 
May 10 20:54:33 an-node01 ccsd[11167]:  Copyright (C) Red Hat, Inc.  2004  All rights reserved. 
May 10 20:54:33 an-node01 ccsd[11167]: cluster.conf (cluster name = an-cluster, version = 1) found. 
May 10 20:54:35 an-node01 openais[11175]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6' 
May 10 20:54:35 an-node01 openais[11175]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. 
May 10 20:54:35 an-node01 openais[11175]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. 
May 10 20:54:35 an-node01 openais[11175]: [MAIN ] AIS Executive Service: started and ready to provide service. 
May 10 20:54:35 an-node01 openais[11175]: [MAIN ] Using default multicast address of 239.192.147.72 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms) 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] token hold (386 ms) retransmits before loss (20 retrans) 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] join (60 ms) send_join (0 ms) consensus (4800 ms) merge (200 ms) 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs) 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1500 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] window size per rotation (50 messages) maximum messages per rotation (17 messages) 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] send threads (0 threads) 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] RRP token expired timeout (495 ms) 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] RRP token problem counter (2000 ms) 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] RRP threshold (10 problem count) 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] RRP mode set to none. 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] heartbeat_failures_allowed (0) 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] max_network_delay (50 ms) 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes). 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] The network interface [10.0.1.71] is now up. 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Created or loaded sequence id 0.10.0.1.71 for this ring. 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering GATHER state from 15. 
May 10 20:54:35 an-node01 openais[11175]: [CMAN ] CMAN 2.0.115 (built Dec  8 2009 09:20:58) started 
May 10 20:54:35 an-node01 openais[11175]: [MAIN ] Service initialized 'openais CMAN membership service 2.01' 
May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais extended virtual synchrony service' 
May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais cluster membership service B.01.01' 
May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais availability management framework B.01.01' 
May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais checkpoint service B.01.01' 
May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais event service B.01.01' 
May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais distributed locking service B.01.01' 
May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais message service B.01.01' 
May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais configuration service' 
May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais cluster closed process group service v1.01' 
May 10 20:54:35 an-node01 openais[11175]: [SERV ] Service initialized 'openais cluster config database access v1.01' 
May 10 20:54:35 an-node01 openais[11175]: [SYNC ] Not using a virtual synchrony filter. 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Creating commit token because I am the rep. 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Saving state aru 0 high seq received 0 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Storing new sequence id for ring 4 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering COMMIT state. 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering RECOVERY state. 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] position [0] member 10.0.1.71: 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] previous ring seq 0 rep 10.0.1.71 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] aru 0 high delivered 0 received flag 1 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Did not need to originate any messages in recovery. 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Sending initial ORF token 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] CLM CONFIGURATION CHANGE 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] New Configuration: 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Left: 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Joined: 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] CLM CONFIGURATION CHANGE 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] New Configuration: 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] 	r(0) ip(10.0.1.71)  
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Left: 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Joined: 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] 	r(0) ip(10.0.1.71)  
May 10 20:54:35 an-node01 openais[11175]: [SYNC ] This node is within the primary component and will provide service. 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering OPERATIONAL state. 
May 10 20:54:35 an-node01 openais[11175]: [CMAN ] quorum regained, resuming activity 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] got nodejoin message 10.0.1.71 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering GATHER state from 11. 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Creating commit token because I am the rep. 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Saving state aru a high seq received a 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Storing new sequence id for ring 8 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering COMMIT state. 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering RECOVERY state. 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] position [0] member 10.0.1.71: 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] previous ring seq 4 rep 10.0.1.71 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] aru a high delivered a received flag 1 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] position [1] member 10.0.1.72: 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] previous ring seq 4 rep 10.0.1.72 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] aru c high delivered c received flag 1 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Did not need to originate any messages in recovery. 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] Sending initial ORF token 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] CLM CONFIGURATION CHANGE 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] New Configuration: 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] 	r(0) ip(10.0.1.71)  
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Left: 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Joined: 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] CLM CONFIGURATION CHANGE 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] New Configuration: 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] 	r(0) ip(10.0.1.71)  
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] 	r(0) ip(10.0.1.72)  
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Left: 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] Members Joined: 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] 	r(0) ip(10.0.1.72)  
May 10 20:54:35 an-node01 openais[11175]: [SYNC ] This node is within the primary component and will provide service. 
May 10 20:54:35 an-node01 openais[11175]: [TOTEM] entering OPERATIONAL state. 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] got nodejoin message 10.0.1.71 
May 10 20:54:35 an-node01 openais[11175]: [CLM  ] got nodejoin message 10.0.1.72 
May 10 20:54:35 an-node01 openais[11175]: [CPG  ] got joinlist message from node 2 
May 10 20:54:36 an-node01 ccsd[11167]: Initial status:: Quorate
Set the Cluster to Start at Boot
Simply use chkconfig to tell cman to start on boot:
chkconfig cman on
You should now see cman at start level 21:
ls -lah /etc/rc3.d/ |grep cman
lrwxrwxrwx  1 root root   14 Mar 16 09:45 S21cman -> ../init.d/cman
Done! You now how you first fully functioning cluster!
DRBD
DRBD will be used to provide a real-time, redundant block device. On top of this, a new LVM PV will be created for a virtual machine that will be able to "float" between the two nodes. This way, should one of the nodes fail, the virtual machine would be able to quickly be brought back up on the surviving node with minimal interruption. When you have planned down time, you will be able to "hot migrate" the virtual machine from one node to the other with nothing more that a short pause while the virtual machine's RAM is frozen and copied over to the other node, a process that usually takes a few seconds to a minute.
Install
The drbd83 and kmod-drbd83-xen packages are not included in the default CentOS installation media, so we will need to install them now:
yum -y install drbd83.x86_64 kmod-drbd83-xen.x86_64
Before we configure DRBD, we will need to create an LVM LV to host it.
Create the LVM Logical Volume
Most of the remaining space on either node's LVM PV will be allocated to a new LV. This new LV will host either node's side of the DRBD resource.
First, you need to see how much space you have left on you LVM PV:
pvscan
  PV /dev/sda2   VG an-lvm01   lvm2 [465.50 GB / 443.97 GB free]
  Total: 1 [465.50 GB] / in use: 1 [465.50 GB] / in no VG: 0 [0   ]
On my nodes, each of which has a single 500GB drive, I've allocated only 20GB to dom0 so I've got over 440GB left free. I like to leave a bit of space unallocated because I never know where I might need it, so I will allocate 400GB even to DRBD and keep the remaining 44GB set aside for future growth. The space you have left and how you want to allocate is an exercise you must settle based on your own needs.
Next, check that the name you will give to the new LV isn't used yet:
lvscan
  ACTIVE            '/dev/an-lvm01/lv01' [19.53 GB] inherit
  ACTIVE            '/dev/an-lvm01/lv00' [2.00 GB] inherit
I can see from the above output that lv00 and lv01 are used, so I will use lv02 for my DRBD partition. Of course, you can use drbd or pretty much anything else you want.
Now that I know I want to create a 400GB logical volume called lv02, I can proceed.
Create the Logical Volume for the DRBD device on each node. The next two commands show what I need to call on my nodes, and will match what you need to run if you used the AN!Cluster Install DVD. If you ran your own install, be sure to edit the following arguments to match your nodes:
On an-node01:
lvcreate -L 400G -n lv02 /dev/an-lvm01
On an-node02:
lvcreate -L 400G -n lv02 /dev/an-lvm02
  Logical volume "lv02" created
If I re-run lvscan now, I will see the new volume:
lvscan
  ACTIVE            '/dev/an-lvm01/lv01' [19.53 GB] inherit
  ACTIVE            '/dev/an-lvm01/lv00' [2.00 GB] inherit
  ACTIVE            '/dev/an-lvm01/lv02' [400.00 GB] inherit
We can now proceed with the DRBD setup!
Create or Edit /etc/drbd.conf
DRBD is controlled from a single /etc/drbd.conf configuration file that must be identical on both nodes. This file tells DRBD what devices to use on each node, what interface to use and so on.
Full details on all the drbd.conf configuration file directives and arguments can be found here.
global {
	usage-count yes;
}
common {
	protocol C;
	
	syncer {
		rate 15M;
	}
}
resource r0 {
	device    /dev/drbd0;
	
	net {
		allow-two-primaries;
		after-sb-0pri discard-zero-changes;
		after-sb-1pri discard-secondary;
		after-sb-2pri disconnect;
	}
	
	startup { 
		become-primary-on both;
	}
	meta-disk	internal;
	
	on an-node01.alteeve.com {
		address		192.168.2.71:7789;
		disk		/dev/sda4;
	}
	
	on an-node02.alteeve.com {
		address		10.0.0.72:7789;
		disk		/dev/sda4;
	}
}
The main things to note are:
- The one argument must match the name returned by the 'uname -n' shell call.
- 'Protocol C' tells DRBD to not tell the OS that a write was complete until both nodes have done so. This effects performance but is required for the later step when we will configure cluster-aware LVM.
With that file in place on both nodes, run the following command and make sure the output is the contents of the file above in a somewhat altered syntax. If you get an error, address it before proceeding.
drbdadm dump
If it's all good, you should see something like this:
  --==  Thank you for participating in the global usage survey  ==--
The server's response is:
you are the 10464th user to install this version
# /etc/drbd.conf
common {
    protocol               C;
    syncer {
        rate             33M;
    }
}
# resource r0 on an-node01.alteeve.com: not ignored, not stacked
resource r0 {
    on an-node01.alteeve.com {
        device           /dev/drbd0 minor 0;
        disk             /dev/an-lvm01/lv02;
        address          ipv4 10.0.0.71:7789;
        meta-disk        internal;
    }
    on an-node02.alteeve.com {
        device           /dev/drbd0 minor 0;
        disk             /dev/an-lvm02/lv02;
        address          ipv4 10.0.0.72:7789;
        meta-disk        internal;
    }
    net {
        allow-two-primaries;
    }
    startup {
        become-primary-on both;
    }
}
Once you see this, you can proceed.
Setup the DRBD Resource r0
From the rest of this section, pay attention to whether you see
- Primary
- Secondary
- Both
These indicate which node to run the following commands on. There is no functional difference between either node, so just randomly choose one to be Primary and the other will be Secondary. Once you've chosen which is which, be consistent with which node you run the commands on. Of course, if a command block is proceeded by Both, run the following code block on both nodes.
Both
/etc/init.d/drbd restart
You should see output like this:
Restarting all DRBD resources: Could not stat("/proc/drbd"): No such file or directory
ERROR: Module drbd does not exist in /proc/modules
.
Don't worry about those errors.
You can verify that it started properly by checking the drbd daemon's status and by checking what is in /proc/drbd.
Check the daemon:
/etc/init.d/drbd status
drbd driver loaded OK; device status:
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by mockbuild@v20z-x86-64.home.local, 2009-08-29 14:08:07
m:res  cs         ro                   ds                 p  mounted  fstype
0:r0   Connected  Secondary/Secondary  UpToDate/UpToDate  C
Check the special procfs file:
cat /proc/drbd
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by mockbuild@v20z-x86-64.home.local, 2009-08-29 14:08:07
 0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
If you see the output above, you're good to proceed.
Both:
MADI: Try skipping this command on the next build, it may no longer be needed.
Initiate the device by run the following commands one at a time:
drbdadm create-md r0
Device '0' is configured!
Primary:
Start the sync between the two nodes by calling:
drbdadm -- --overwrite-data-of-peer primary r0
Secondary:
At this point, we need to promote the secondary node to 'Primary' position.
drbdadm primary r0
Both:
Make sure that both nodes are Primary process by running:
cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by mockbuild@v20z-x86-64.home.local, 2009-08-29 14:08:07
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----
    ns:524288 nr:0 dw:0 dr:524288 al:0 bm:127 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
The DRBD partition has no file system, you should not see the devices sync'ing at this point.
LVM
If you used the AN!Cluster kickstart files, or if you based your install on them, then you are already using LVM on the cluster nodes as the underlying system for all but the /boot partition. Each node should have a VG named the same as the node itself with three VGs on them.
Now we will "stack" LVM by creating a PV on top of the new DRBD partition, /dev/drbd0, that we created in the previous step. Since this new LVM PV will exist on top of the shared DRBD partition, whatever get written to it's logical volumes will be immediately available on either node, regardless of which node actually initiated the write.
This capability is the underlying reason for creating this cluster; Neither machine is truly needed so if one machine dies, anything on top of the DRBD partition will still be available. When the failed machine returns, the surviving node will have a list of what blocks changed while the other node was gone and can use this list to quickly re-sync the other server.
Making LVM Cluster-Aware
Normally, LVM is run on a single server. This means that at any time, the LVM can write data to the underlying drive and not need to worry if any other device might change anything. In clusters, this isn't the case. The other node could try to write to the shared storage, so then nodes need to enable "locking" to prevent the two nodes from trying to work on the same bit of data at the same time.
The process of enabling this locking is known as making LVM "cluster-aware".
Updating '/etc/lvm/lvm.conf'
Note: With EL5.5, this step is only needed with using software RAID as LVM sees both the /dev/mdX and /dev/drbdX devices as LVM PVs and defaults to using the RAID device, which fails when creating LVs. See here for details.
To hide software RAID devices, we need to change the filter in /etc/lvm/lvm.conf to include a regular expression that matches the name of our DRBD device and rejects everything else. We created our DRBD device as /dev/drbd0, so changing the filter to filter = [ "a|drbd.*|", "a|sd.*|", "r|.*|" ] (accept drbd devices and devices with the sd* "scsi" names, reject everything else) will do this. Edit lvm.conf and change it to match this:
vim /etc/lvm/lvm.conf
    # By default we accept every block device:
    #filter = [ "a/.*/" ]
    filter = [ "a|drbd.*|", "a|sd.*|", "r|.*|" ]
Enabling Cluster Locking
LVM has a built-in tool called lvmconf that can be used to enable LVM locking. Simply run:
lvmconf --enable-cluster
There won't be any output from that command.
By default, clvmd, the cluster lvm daemon, is stopped and not set to run on boot. Now that we've enabled LVM locking, we need to start it:
/etc/init.d/clvmd status
clvmd is stopped
active volumes: lv00 lv01 lv02
As expected, it is stopped, so lets start it and then use chkconfig to enable it at boot.
/etc/init.d/clvmd start
Stopping clvm:                                             [  OK  ]
Starting clvmd:                                            [  OK  ]
Activating VGs:   3 logical volume(s) in volume group "an-lvm01" now active
                                                           [  OK  ]
chkconfig clvmd on
ls -lah /etc/rc3.d/ |grep clvmd
lrwxrwxrwx  1 root root   15 Mar 16 12:48 S24clvmd -> ../init.d/clvmd
We can see that it is now set to start at position 24.
Creating a new PV using the DRBD Partition
We can now proceed with setting up the new DRBD-based LVM physical volume. Once the PV is created, we can create a new volume group and start allocating space to logical volumes.
Note: As we will be using our DRBD device, and as it is a shared block device, most of the following commands only need to be run on one node. Once the block device changes in any way, those changes will near-instantly appear on the other node. For this reason, unless explicitly stated to do so, only run the following commands on one node.
To setup the DRBD partition as an LVM PV, run pvcreate:
pvcreate /dev/drbd0
  Physical volume "/dev/drbd0" successfully created
Now, on both nodes, check that the new physical volume is visible by using pvdisplay:
pvdisplay
  --- Physical volume ---
  PV Name               /dev/sda2
  VG Name               san01
  PV Size               465.51 GB / not usable 14.52 MB
  Allocatable           yes 
  PE Size (KByte)       32768
  Total PE              14896
  Free PE               1407
  Allocated PE          13489
  PV UUID               IpySTY-a9BY-31XE-Bxd4-H9sp-OEJG-kP7dtg
   
  "/dev/drbd0" is a new physical volume of "399.99 GB"
  --- NEW Physical volume ---
  PV Name               /dev/drbd0
  VG Name               
  PV Size               399.99 GB
  Allocatable           NO
  PE Size (KByte)       0
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               S6OkVh-NlwQ-BaUn-k5LI-1iTo-pu8V-Uq3qE2
If you see PV Name /dev/drbd0 on both nodes, then your DRBD setup and LVM configuration changes are working perfectly!
Creating a VG on the new PV
Now we need to create the volume group using the vgcreate command:
vgcreate -c y drbd_vg0 /dev/drbd0
  Clustered volume group "drbd_vg0" successfully created
Now we'll check that the new VG is visible on both nodes using vgdisplay:
vgdisplay
  --- Volume group ---
  VG Name               an-lvm01
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  4
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                3
  Open LV               3
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               465.50 GB
  PE Size               32.00 MB
  Total PE              14896
  Alloc PE / Size       13489 / 421.53 GB
  Free  PE / Size       1407 / 43.97 GB
  VG UUID               C0kHFA-OTo8-Gshr-3wIw-3Q0I-eT3X-A9Y0NA
   
  --- Volume group ---
  VG Name               drbd_vg0
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  2
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                1
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               399.98 GB
  PE Size               4.00 MB
  Total PE              102396
  Alloc PE / Size       5120 / 20.00 GB
  Free  PE / Size       97276 / 379.98 GB
  VG UUID               TmlQmv-eViK-7Ubr-Dyck-0u86-uEWJ-rDOt9i
If the new VG is visible on both nodes, we are ready to create our first logical volume using the lvcreate tool.
Creating the First Two LVs on the new VG
Now we'll create two simple 20 GiB logical volumes. This first one will be a shared GFS store for source ISOs and the second will be used for our first virtual machine.
lvcreate -L 20G -n iso_store drbd_vg0
lvcreate -L 20G -n vm01 drbd_vg0
  Logical volume "iso_store" created
  Logical volume "vm01" created
As before, we will check that the new logical volume is visible from both nodes by using the lvdisplay command:
lvdisplay
  --- Logical volume ---
  LV Name                /dev/an-lvm02/lv01
  VG Name                an-lvm02
  LV UUID                Dy2MNa-EUxN-9x6f-ovkj-NCpk-nlV2-kr5QBb
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                19.53 GB
  Current LE             625
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0
   
  --- Logical volume ---
  LV Name                /dev/an-lvm02/lv00
  VG Name                an-lvm02
  LV UUID                xkBu7j-wtOe-ORr3-68qJ-u0ux-Qif4-stw5SY
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                2.00 GB
  Current LE             64
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:1
   
  --- Logical volume ---
  LV Name                /dev/an-lvm02/lv02
  VG Name                an-lvm02
  LV UUID                R20GH1-wQKq-WgUR-x1gx-Yzzp-WjND-WHAjEO
  LV Write Access        read/write
  LV Status              available
  # open                 2
  LV Size                400.00 GB
  Current LE             12800
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2
   
  --- Logical volume ---
  LV Name                /dev/drbd_vg0/iso_store
  VG Name                drbd_vg0
  LV UUID                svJx35-KDXK-ojD2-UDAA-Ah9t-UgUl-ijekhf
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                20.00 GB
  Current LE             5120
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3
   
  --- Logical volume ---
  LV Name                /dev/drbd_vg0/vm01
  VG Name                drbd_vg0
  LV UUID                sceLmK-ZJIp-fN5g-RMaS-j5sq-NuY5-7hIwhP
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                20.00 GB
  Current LE             5120
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:4
The last two are the new logical volumes.
GFS is a cluster-aware file system that can be simultaneously mounted on two or more nodes at once. We will use it as a place to store ISOs that we'll use to provision our virtual machines.
The following example is designed for the cluster used in this paper.
- If you have more than 2 nodes, increase the -j 2 to the number of nodes you want to mount this file system on.
- If your cluster is named something other than an-cluster (as set in the cluster.conf file), change -t an-cluster:iso_store to match you cluster's name. The iso_store can be whatever you like, but it must be unique in the cluster. I tend to use a name that matches the LV name, but this is my own preference and is not required.
To format the partition run:
mkfs.gfs2 -p lock_dlm -j 2 -t an-cluster:iso_store /dev/drbd_vg0/iso_store
If you are prompted, press y to proceed.
Once the format completes, you can mount /dev/drbd_vg0/iso_store as you would a normal file system.
Both:
To complete the example, lets mount the GFS2 partition we made just now on /shared.
mkdir /shared
mount /dev/drbd_vg0/iso_store /shared
Done!
Growing a GFS2 Partition
To grow a GFS2 partition, you must know where it is mounted. You can not grow an unmounted GFS2 partition, as odd as that may seem at first. Also, you only need to run grow commands from one node. Once completed, all nodes will see and use the new free space automatically.
This requires two steps to complete:
- Extend the underlying LVM logical volume
- Grow the actual GFS2 partition
Extend the LVM LV
To keep things simple, we'll just use some of the free space we left on our /dev/drbd0 LVM physical volume. If you need to add more storage to your LVM first, please follow the instructions in the article: "Adding Space to an LVM" before proceeding.
Let's add 50GB to our GFS2 logical volume /dev/drbd_vg0/iso_store from the /dev/drbd0 physical volume, which we know is available because we left more than that back when we first setup our LVM. To actually add the space, we need to use the lvextend command:
lvextend -L +50G /dev/drbd_vg0/iso_store /dev/drbd0
Which should return:
  Extending logical volume iso_store to 70.00 GB
  Logical volume iso_store successfully resized
If we run lvdisplay /dev/drbd_vg0/iso_store now, we should see the extra space.
  --- Logical volume ---
  LV Name                /dev/drbd_vg0/iso_store
  VG Name                drbd_vg0
  LV UUID                svJx35-KDXK-ojD2-UDAA-Ah9t-UgUl-ijekhf
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                70.00 GB
  Current LE             17920
  Segments               2
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3
You're now ready to proceed.
Grow The GFS2 Partition
This step is pretty simple, but you need to enter the commands exactly. Also, you'll want to do a dry-run first and address any resulting errors before issuing the final gfs2_grow command.
To get the exact name to use when calling gfs2_grow, run the following command:
gfs2_tool df
/shared:
  SB lock proto = "lock_dlm"
  SB lock table = "an-cluster:iso_store"
  SB ondisk format = 1801
  SB multihost format = 1900
  Block size = 4096
  Journals = 2
  Resource Groups = 80
  Mounted lock proto = "lock_dlm"
  Mounted lock table = "an-cluster:iso_store"
  Mounted host data = "jid=1:id=196610:first=0"
  Journal number = 1
  Lock module flags = 0
  Local flocks = FALSE
  Local caching = FALSE
  Type           Total Blocks   Used Blocks    Free Blocks    use%           
  ------------------------------------------------------------------------
  data           5242304        1773818        3468486        34%
  inodes         3468580        94             3468486        0%
From this output, we know that GFS2 expects the name "/shared". Even adding something as simple as a trailing slash will not work. The program we will use is called gfs2_grow with the -T switch to run the command as a test to work out possible errors.
For example, if you added the trailing slash, this is the kind of error you would see:
Bad command:
gfs_grow -T /shared/
GFS Filesystem /shared/ not found
Once we get it right, it will look like this:
gfs_grow -T /shared
(Test mode--File system will not be changed)
FS: Mount Point: /shared
FS: Device:      /dev/mapper/drbd_vg0-iso_store
FS: Size:        5242878 (0x4ffffe)
FS: RG size:     65535 (0xffff)
DEV: Size:       18350080 (0x1180000)
The file system grew by 51200MB.
gfs2_grow complete.
This looks good! We're now ready to re-run the command without the -T switch:
gfs_grow /shared
FS: Mount Point: /shared
FS: Device:      /dev/mapper/drbd_vg0-iso_store
FS: Size:        5242878 (0x4ffffe)
FS: RG size:     65535 (0xffff)
DEV: Size:       18350080 (0x1180000)
The file system grew by 51200MB.
gfs2_grow complete.
You can check that the new space is available on both nodes now using a simple call like df -h.
Creating Our Floating VM
Convirt
yum -y install pygtk2 vte vnc tunctl dnsmasq bridge-utils
cd /etc/yum.repos.d
wget --no-cache http://www.convirture.com/repos/definitions/rhel/5.x/convirt.repo
yum -y install convirt
/usr/share/convirt/install/managed_server/scripts/convirt-tool setup
After running 'convirt-tool setup', comment out the 'convirt-xen-multibridge' entry added to 'vim /etc/xen/xend-config.sxp'.
Start 'convirt':
convirt &
Remove "QA Lab" and "Desktop" groups.
right-click on 'Servers' and choose 'Add Server'. For each node, enter it's hostname (ie: an-node01 and an-node02) and each machine's root password. Leave the 'Xen Protocol' as 'XML-RPC'.
Pacemaker
In short, Pacemaker is a cluster resource manager.
Pacemaker runs on top of OpenAIS and handles clustered resources. For example, it can be used to move around a shared IP address, bring up services on the surviving node after a failure and restore services when a cluster node rejoins.
Installing Pacemaker
First, add the DAG repositories to your system.
Download and install EPEL:
cd /etc/yum.repos.d/
rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/x86_64/epel-release-5-3.noarch.rpm
wget -O /etc/yum.repos.d/pacemaker.repo http://clusterlabs.org/rpm/epel-5/clusterlabs.repo
yum install pacemaker heartbeat libibverbs librdmacm
ToDo
This is a list of "ToDo"s for this paper.
- Update and integrate Adding Space to an LVM.
- Once done, point to Creating a Custom CentOS-derived Distribution.
- Add a section of setting up shared ssh keys.
- Bridging in Fedora Core 13
- Setting Up a PXE Server in Fedora
Random
This is a sandbox for notes to be integrated later.
Provision VM
virt-install -n rhel6-01 -r 1024 --vcpus=1 --cpuset=1 --os-type=linux --os-variant=rhel6 -c /dev/sr0 --hvm --virt-type=kvm --disk /dev/vg_an-node02/lv_rhel6-01 --network bridge=br0 --vnc
Thanks
- To HJ Lee from the OpenAIS list for helping me understand the mechanisms controlling the Redundant Ring Protocol's failure detection types.
- To Steven Dake for clarifying the to_x vs. logoutput: x arguments in openais.conf.
- To Fabio Massimo Di Nitto for helping me get caught up with clustering and VMs on FC13.
| Any questions, feedback, advice, complaints or meanderings are welcome. | |||
| Alteeve's Niche! | Alteeve Enterprise Support | Community Support | |
| © 2025 Alteeve. Intelligent Availability® is a registered trademark of Alteeve's Niche! Inc. 1997-2025 | |||
| legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions. | |||