Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM: Difference between revisions

From Alteeve Wiki
Jump to navigation Jump to search
mNo edit summary
No edit summary
Line 25: Line 25:
* Lastly, <span class="code">rgmanager</span> will be the component of [[cman]] that will be configured to manage the automatic migration of the virtual machines when failures occur and when nodes recover.
* Lastly, <span class="code">rgmanager</span> will be the component of [[cman]] that will be configured to manage the automatic migration of the virtual machines when failures occur and when nodes recover.


= Setting up Xen's dom0 =
= Setting Up Xen's dom0 =


It may seem odd to start with [[Xen]] at this stage, but it is going to rather fundamentally alter each node's "host" operating system.
It may seem odd to start with [[Xen]] at this stage, but it is going to rather fundamentally alter each node's "host" operating system.
Line 105: Line 105:
echo "xenfs                  /proc/xen              xenfs  defaults        0 0" >> /etc/fstab
echo "xenfs                  /proc/xen              xenfs  defaults        0 0" >> /etc/fstab
</source>
</source>


=== Make xend play nice with clustering ===
=== Make xend play nice with clustering ===
Line 200: Line 199:
If you see something like this, then you are ready to proceed!
If you see something like this, then you are ready to proceed!


= Building the DRBD Array =
Building the DRBD array requires a few steps. First, raw space on either node must be prepared. Next, DRBD must be told that it is to create a resource using this newly configured raw space. Finally, the new array must be initialized.
== A Map of the Cluster's Storage ==
The layout of the storage in the cluster can quickly become difficult to follow. Below is an [[ASCII]] drawing which should help you see how DRBD will tie in to the rest of the cluster's storage. This map assumes a simple [[TLUG_Talk:_Storage_Technologies_and_Theory#Level_1|RAID level 1]] array underlying each node. If your node has a single hard drive, simply collapse the first two layers into one. Similarly, if your underlying storage is a more complex RAID array, simply expand the number of physical devices at the top level.
<source lang="text">
              Node1                                Node2
          _____  _____                        _____  _____
          | sda | | sdb |                      | sda | | sdb |
          |_____| |_____|                      |_____| |_____|
            |_______|                            |_______|
    _______ ____|___ _______            _______ ____|___ _______
  __|__  __|__    __|__  __|__      __|__  __|__    __|__  __|__
| md0 | | md1 |  | md2 | | md3 |    | md3 | | md2 |  | md1 | | md0 |
|_____| |_____|  |_____| |_____|    |_____| |_____|  |_____| |_____|
    |      |        |      |          |      |        |      |
___|___  _|_  ____|____  |___________|  ____|____  _|_  ___|___
| /boot | | / | | <swap>  |        |        | <swap>  | | / | | /boot |
|_______| |___| |_________|  ______|______  |_________| |___| |_______|
                            | /dev/drbd0  |
                            |_____________|
                                  |
                              ____|______
                              | clvm PV  |
                              |___________|
                                  |
                              _____|_____
                            | drbd_vg0  |
                            |___________|
                                  |
                              _____|_____ ___...____
                            |          |          |
                          ___|___    ___|___    ___|___
                        | lv_X  |  | lv_Y  |  | lv_N  |
                        |_______|  |_______|  |_______|
</source>
== Install The DRBD Tools ==
DRBD has two components; The actual application and tools and the kernel module. The tools provided by Fedora directly are sufficient for our use. You will need to install the following tools:
<source lang="bash">
yum install drbd.x86_64 drbd-xen.x86_64 drbd-utils.x86_64
</source>
== Install The DRBD Kernel Module ==
The kernel module '''must''' match the [[dom0]] kernel that is running. If you update the kernel and neglect to update the DRBD kernel module, the DRBD array '''will not start'''.
To help simplify things, links to pre-compiled DRBD kernel modules are provided. If the kernel version you have installed doesn't match your kernel, instructions on recompiling the DRBD kernel module from source RPM is provided as well.
=== Install Pre-Compiled DRBD Kernel Module RPMs ===
These are the two RPMs you will need to install. Note that these RPMs are compiled against myoung's <span class="code">2.6.32.21_167</span> kernel.
* [https://alteeve.com/files/an-cluster/drbd-km-2.6.32.21_167.xendom0.fc12.x86_64-8.3.7-12.fc13.x86_64.rpm drbd-km-2.6.32.21_167.xendom0.fc12.x86_64-8.3.7-12.fc13.x86_64.rpm] - DRBD kernel module for myoung's 2.6.32.21_167 dom0 kernel (897 KiB)
* [https://alteeve.com/files/an-cluster/drbd-km-debuginfo-8.3.7-12.fc13.x86_64.rpm drbd-km-debuginfo-8.3.7-12.fc13.x86_64.rpm] - Debug info for DRBD kernel module for myoung's 2.6.32.21_167 dom0 kernel (3.2 KiB)
You can install the two above RPMs with this command:
<source lang="bash">
rpm -ivh https://alteeve.com/files/an-cluster/drbd-km-2.6.32.21_167.xendom0.fc12.x86_64-8.3.7-12.fc13.x86_64.rpm https://alteeve.com/files/an-cluster/drbd-km-debuginfo-8.3.7-12.fc13.x86_64.rpm
</source>
=== Building DRBD Kernel Module RPMs From Source ===
If the above RPMs don't work or if the dom0 kernel you are using in any way differs, please follow the steps here to create a DRBD kernel module matched to your running dom0.
First, install the build environment.
<source lang="bash">
yum -y groupinstall "Development Libraries"
yum -y groupinstall "Development Tools"
</source>
Install the kernel headers and development library for the dom0 kernel:
'''Note''': The following commands use <span class="code">--force</span> to get past the fact that the headers for the <span class="code">2.6.33</span> are already installed, thus making RPM think that these are too old and will conflict. Please proceed with caution.
<source lang="bash">
rpm -ivh --force http://fedorapeople.org/~myoung/dom0/x86_64/kernel-headers-2.6.32.21-167.xendom0.fc12.x86_64.rpm http://fedorapeople.org/~myoung/dom0/x86_64/kernel-devel-2.6.32.21-167.xendom0.fc12.x86_64.rpm
</source>
Download, prepare, build and install the source RPM:
<source lang="bash">
rpm -ivh http://fedora.mirror.iweb.ca/releases/13/Everything/source/SRPMS/drbd-8.3.7-2.fc13.src.rpm
cd /root/rpmbuild/SPECS/
rpmbuild -bp drbd.spec
cd /root/rpmbuild/BUILD/drbd-8.3.7/
./configure --enable-spec --with-km
cp /root/rpmbuild/BUILD/drbd-8.3.7/drbd-km.spec /root/rpmbuild/SPECS/
cd /root/rpmbuild/SPECS/
rpmbuild -ba drbd-km.spec
cd /root/rpmbuild/RPMS/x86_64
rpm -Uvh drbd-km-*
</source>
You should be good to go now!
== Allocating Raw Space For DRBD On Each Node ==
If you followed the setup steps provided for in "[[Two Node Fedora 13 Cluster]]", you will have a set amount of unconfigured hard drive space. This is what we will use for the DRBD space on either node. If you've got a different setup, you will need to allocate some raw space before proceeding.
=== Creating a RAID level 1 'md' Device ===
This assumes that you have two raw drives, <span class="code">/dev/sda</span> and <span class="code">/dev/sdb</span>. It further assumes that you've created three partitions which have been assigned to three existing <span class="code">/dev/mdX</span> devices. With these assumptions, we will create <span class="code">/dev/sda4</span> and <span class="code">/dev/sdb4</span> and, using them, create a new <span class="code">/dev/md3</span> device that will host the DRBD partition.
If you do not have two drives, you can stop after creating a new partition. If you have multiple drives and plan to use a different [[TLUG_Talk:_Storage_Technologies_and_Theory#RAID_Levels|RAID levels]], please adjust the follow commands accordingly.
==== Creating The New Partitions ====
'''Warning''': The next steps will have you directly accessing your server's hard drive configuration. Please do not proceed on a live server until you've had a chance to work through these steps on a test server. One mistake can blow away '''''all your data'''''.
Start the <span class="code">fdisk</span> shell
<source lang="bash">
fdisk /dev/sda
</source>
<source lang="text">
WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
        switch off the mode (command 'c') and change display units to
        sectors (command 'u').
Command (m for help):
</source>
View the current configuration with the <span class="code">p</span>rint option
<source lang="bash">
p
</source>
<source lang="text">
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c6fe1
  Device Boot      Start        End      Blocks  Id  System
/dev/sda1              1        5100    40960000  fd  Linux raid autodetect
/dev/sda2            5100        5622    4194304  fd  Linux raid autodetect
/dev/sda3  *        5622        5654      256000  fd  Linux raid autodetect
Command (m for help):
</source>
Now we know for sure that the next free partition number is <span class="code">4</span>. We will now create the <span class="code">n</span>ew partition.
<source lang="bash">
n
</source>
<source lang="text">
Command action
  e  extended
  p  primary partition (1-4)
</source>
We will make it a <span class="code">p</span>rimary partition
<source lang="bash">
p
</source>
<source lang="text">
Selected partition 4
First cylinder (5654-60801, default 5654):
</source>
Then we simply hit <span class="code"><enter></span> to select the default starting block.
<source lang="bash">
<enter>
</source>
<source lang="text">
Using default value 5654
Last cylinder, +cylinders or +size{K,M,G} (5654-60801, default 60801):
</source>
Once again we will press <span class="code"><enter></span> to select the default ending block.
<source lang="bash">
<enter>
</source>
<source lang="text">
Using default value 60801
Command (m for help):
</source>
Now we need to change the <span class="code">t</span>ype of partition that it is.
<source lang="bash">
t
</source>
<source lang="text">
Partition number (1-4):
</source>
We know that we are modifying partition number <span class="code">4</span>.
<source lang="bash">
4
</source>
<source lang="text">
Hex code (type L to list codes):
</source>
Now we need to set the [[hex]] code for the [[Filesystem_List#List_of_Linux_Partition_Types|partition type]] to set. We want to set <span class="code">fd</span>, which defines <span class="code">Linux raid autodetect</span>.
<source lang="bash">
fd
</source>
<source lang="text">
Changed system type of partition 4 to fd (Linux raid autodetect)
</source>
Now check that everything went as expected by once again <span class="code">p</span>rinting the partition table.
<source lang="bash">
p
</source>
<source lang="text">
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c6fe1
  Device Boot      Start        End      Blocks  Id  System
/dev/sda1              1        5100    40960000  fd  Linux raid autodetect
/dev/sda2            5100        5622    4194304  fd  Linux raid autodetect
/dev/sda3  *        5622        5654      256000  fd  Linux raid autodetect
/dev/sda4            5654      60801  442972704+  fd  Linux raid autodetect
Command (m for help):
</source>
There it is. So finally, we need to <span class="code">w</span>rite the changes to the disk.
<source lang="bash">
w
</source>
<source lang="text">
The partition table has been altered!
Calling ioctl() to re-read partition table.
WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
Syncing disks.
</source>
If you see the above message, do '''not''' reboot until both drives have been setup. You might as well reboot once only.
Repeat these steps for the second drive, <span class="code">/dev/sdb</span> and then reboot if needed.
==== Creating The New /dev/mdX Device ====
''If you only have one drive, skip this step.''
Now we need to use <span class="code">mdadm</span> to create the new [[TLUG_Talk:_Storage_Technologies_and_Theory#Level_1|RAID level 1]] device. This will be used as the device that DRBD will directly access.
<source lang="bash">
mdadm --create /dev/md3 --homehost=localhost.localdomain --raid-devices=2 --level=1 /dev/sda4 /dev/sdb4
</source>
<source lang="text">
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
</source>
Seeing as <span class="code">/boot</span> doesn't exist on this device, we can safely ignore this warning.
<source lang="bash">
y
</source>
<source lang="text">
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/md4 started.
</source>
You can now <span class="code">cat /proc/mdstat</span> to verify that it indeed built. If you're interested, you could open a new terminal window and use <span class="code">watch cat /proc/mdstat</span> and watch the array build.
<source lang="bash">
cat /proc/mdstat
</source>
<source lang="text">
md3 : active raid1 sdb4[1] sda4[0]
      442971544 blocks super 1.2 [2/2] [UU]
      [>....................]  resync =  0.8% (3678976/442971544) finish=111.0min speed=65920K/sec
     
md2 : active raid1 sda2[0] sdb2[1]
      4193272 blocks super 1.1 [2/2] [UU]
     
md1 : active raid1 sda1[0] sdb1[1]
      40958908 blocks super 1.1 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk
md0 : active raid1 sda3[0] sdb3[1]
      255988 blocks super 1.0 [2/2] [UU]
     
unused devices: <none>
</source>
Finally, we need to make sure that the new array will start when the system boots. To do this, we'll again use <span class="code">mdadm</span>, but with different options that will have it output data in a format suitable for the <span class="code">/etc/mdadm.conf</span> file. We'll redirect this output to that config file, thus updating it.
<source lang="bash">
mdadm --detail --scan | grep md3 >> /etc/mdadm.conf
cat /etc/mdadm.conf
</source>
<source lang="text">
# mdadm.conf written out by anaconda
MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=b58df6d0:d925e7bb:c156168d:47c01718
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=ac2cf39c:77cd0314:fedb8407:9b945bb5
ARRAY /dev/md2 level=raid1 num-devices=2 UUID=4e513936:4a966f4e:0dd8402e:6403d10d
ARRAY /dev/md3 metadata=1.2 name=localhost.localdomain:3 UUID=f0b6d0c1:490d47e7:91c7e63a:f8dacc21
</source>
You'll note that the last line, which we just added, is different from the previous lines. This isn't a concern, but you are welcome to re-write it to match the existing format if you wish.
Before you proceed, it is strongly advised that you reboot each node and then verify that the new array did in fact start with the system. You ''do not'' need to wait for the sync to finish before rebooting. It will pick up where you left off once rebooted.
== DRBD Configuration Files ==
DRBD uses a global configuration file, <span class="code">/etc/drbd.d/global_common.conf</span>, and one or more resource files. The resource files need to be created in the <span class="code">/etc/drbd.d/</span> directory and must have the suffix <span class="code">.res</span>. For this example, we will create a single resource called <span class="code">r0</span> which we will configure in <span class="code">/etc/drbd.d/r0.res</span>.
=== /etc/drbd.d/global_common.conf ===
The stock <span class="code">/etc/drbd.d/global_common.conf</span> is sane, so we won't bother altering it here.
Full details on all the <span class="code">drbd.conf</span> configuration file directives and arguments can be found [http://www.drbd.org/users-guide/re-drbdconf.html here]. '''Note''': That link doesn't show this new configuration format. Please see [http://www.novell.com/documentation/sle_ha/book_sleha/?page=/documentation/sle_ha/book_sleha/data/sec_ha_drbd_configure.html Novell's] link.
=== /etc/drbd.d/r0.res ===
This is the important part. This defines the resource to use, and must reflect the IP addresses and storage devices that DRBD will use for this resource.
<source lang="bash">
vim /etc/drbd.d/r0.res
</source>
<source lang="bash">
# This is the name of the resource and it's settings. Generally, 'r0' is used
# as the name of the first resource. This is by convention only, though.
resource r0
{
        # This tells DRBD where to make the new resource available at on each
        # node. This is, again, by convention only.
        device    /dev/drbd0;
        # The main argument here tells DRBD that we will have proper locking
        # and fencing, and as such, to allow both nodes to set the resource to
        # 'primary' simultaneously.
        net
        {
                allow-two-primaries;
        }
        # This tells DRBD to automatically set both nodes to 'primary' when the
        # nodes start.
        startup
        {
                become-primary-on both;
        }
        # This tells DRBD to look for and store it's meta-data on the resource
        # itself.
        meta-disk      internal;
        # The name below must match the output from `uname -n` on each node.
        on an-node01.alteeve.com
        {
                # This must be the IP address of the interface on the storage
                # network (an-node01.sn, in this case).
                address        10.0.0.71:7789;
                # This is the underlying partition to use for this resource on
                # this node.
                disk            /dev/md3;
        }
        # Repeat as above, but for the other node.
        on an-node02.alteeve.com
        {
                address        10.0.0.72:7789;
                disk            /dev/md3;
        }
}
</source>
This file must be copied to '''BOTH''' nodes and must match before you proceed.
== Starting The DRBD Resource ==
From the rest of this section, pay attention to whether you see
* '''Node1'''
* '''Node2'''
* '''Both'''
These indicate which node to run the following commands on. There is no functional difference between either node, so just randomly choose one to be '''Node1''' and the other will be '''Node2'''. Once you've chosen which is which, be consistent with which node you run the commands on. Of course, if a command block is proceeded by '''Both''', run the following code block on both nodes.
=== Initialize The Block Device ===
'''Node1'''
This step creates the DRBD meta-data on the new DRBD device. It is only needed when creating new DRBD partitions.
<source lang="bash">
drbdadm create-md r0
</source>
<source lang="text">
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success
</source>
=== Monitoring Progress ===
'''Both'''
I find it very useful to monitor DRBD while running the rest of the setup. To do this, open a second terminal on each node and use <span class="code">watch</span> to keep an eye on <span class="code">/proc/drbd</span>. This way you will be able to monitor the progress of the array in near-real time.
'''Both'''
<source lang="bash">
watch cat /proc/drbd
</source>
At this stage, it should look like this:
<source lang="text">
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@xenmaster002.iplink.net, 2010-09-07 16:02:46
0: cs:Unconfigured
</source>
=== Starting the Resource ===
'''Both'''
This will attach the backing device, <span class="code">/dev/md3</span> in our case, and then start the new resource <span class="code">r0</span>.
<source lang="bash">
drbdadm up r0
</source>
There will be no output at the command line. If you are <span class="code">watch</span>ing <span class="code">/proc/drbd</span> though, you should now see something like this:
<source lang="text">
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@xenmaster002.iplink.net, 2010-09-07 16:02:46
0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:442957988
</source>
That it is <span class="code">Secondary/Secondary</span> and <span class="code">Inconsistent/Inconsistent</span> is expected.
=== Setting the First Primary Node ===
'''Node1'''
As this is a totally new resource, DRBD doesn't know which side of the array is "more valid" than the other. In reality, neither is as there was no existing data of note on either node. This means that we now need to choose a node and tell DRBD to treat it as the "source" node. This step will also tell DRBD to make the "source" node <span class="code">primary</span>. Once set, DRBD will begin <span class="code">sync</span>'ing in the background.
<source lang="bash">
drbdadm -- --overwrite-data-of-peer primary r0
</source>
As before, there will be no output at the command line, but <span class="code">/proc/drbd</span> will change to show the following:
<source lang="text">
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@xenmaster002.iplink.net, 2010-09-07 16:02:46
0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
    ns:69024 nr:0 dw:0 dr:69232 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:442888964
        [>....................] sync'ed:  0.1% (432508/432576)M
        finish: 307:33:42 speed: 320 (320) K/sec
</source>
If you're watching the secondary node, the <span class="code">/proc/drbd</span> will show <span class="code">ro:Secondary/Primary ds:Inconsistent/UpToDate</span>. This is, as you can guess, simply a reflection of it being the "over-written" node.
=== Setting the Second Node to Primary ===
'''Node2'''
The last step to complete the array is to tell the second node to also become <span class="code">primary</span>.
<source lang="bash">
drbdadm primary r0
</source>
As with many <span class="code">drbdadm</span> commands, nothing will be printed to the console. If you're watching the <span class="code">/proc/drbd</span> though, you should see something like <span class="code">Primary/Primary ds:UpToDate/Inconsistent</span>. The <span class="code">Inconsistent</span> flag will remain until the sync is complete.
=== A Note On sync Speed ===
You will notice in the previous step that the <span class="code">sync</span> speed seems awfully slow at <span class="code">320 (320) K/sec</span>.
'''This is not a problem!'''
As actual data is written to either side of the array, that data will be immediately copied to both nodes. As such, both nodes will always contain up to date copies of the real data. Given this, the <span class="code">syncer</span> is intentionally set low so as to not put too much load on the underlying disks that could cause slow downs. If you still wish to increase the sync speed, you can do so with the following command.
<source lang="bash">
drbdsetup /dev/drbd0 syncer -r 100M
</source>
The speed-up will not be instant. It will take a little while for the speed to pick up. Once the sync is finished, it is a good idea to revert to the default sync rate.
<source lang="text">
drbdadm syncer r0
</source>
= Setting Up CLVM =
The goal of DRBD in the cluster is to provide clustered [[LVM]], referred to as [[CLVM]] to the nodes. This is done by turning the DRBD partition into an CLVM physical volume.
So now we will create a [[PV]] on top of the new [[DRBD]] partition, <span class="code">/dev/drbd0</span>, that we created in the previous step. Since this new LVM [[PV]] will exist on top of the shared DRBD partition, whatever get written to it's logical volumes will be immediately available on either node, regardless of which node actually initiated the write.
This capability is the underlying reason for creating this cluster; Neither machine is truly needed so if one machine dies, anything on top of the DRBD partition will still be available. When the failed machine returns, the surviving node will have a list of what blocks changed while the other node was gone and can use this list to quickly re-sync the other server.
== Making LVM Cluster-Aware ==
Normally, LVM is run on a single server. This means that at any time, the LVM can write data to the underlying drive and not need to worry if any other device might change anything. In clusters, this isn't the case. The other node could try to write to the shared storage, so then nodes need to enable "locking" to prevent the two nodes from trying to work on the same bit of data at the same time.
The process of enabling this locking is known as making LVM "cluster-aware".
LVM has tool called <span class="code">lvmconf</span> that can be used to enable LVM locking. This is provided as part of the <span class="code">lvm2-cluster</span> package.
<source lang="bash">
yum install lvm2-cluster.x86_64
</source>
Now to enable cluster awareness in LVM, run to following command.
<source lang="bash">
lvmconf --enable-cluster
</source>
=== Enabling Cluster Locking ===
By default, <span class="code">clvmd</span>, the cluster lvm daemon, is stopped and not set to run on boot. Now that we've enabled LVM locking, we need to start it:
<source lang="bash">
/etc/init.d/clvmd status
</source>
<source lang="text">
clvmd is stopped
active volumes: lv_drbd lv_root lv_swap
</source>
As expected, it is stopped, so lets start it:
<source lang="bash">
/etc/init.d/clvmd start
</source>
<source lang="text">
Stopping clvm:                                            [  OK  ]
Starting clvmd:                                            [  OK  ]
Activating VGs:  3 logical volume(s) in volume group "an-lvm01" now active
                                                          [  OK  ]
</source>
== Creating a new PV using the DRBD Partition ==
We can now proceed with setting up the new DRBD-based LVM physical volume. Once the PV is created, we can create a new volume group and start allocating space to logical volumes.
'''Note''': As we will be using our DRBD device, and as it is a shared block device, most of the following commands only need to be run on one node. Once the block device changes in any way, those changes will near-instantly appear on the other node. For this reason, unless explicitly stated to do so, only run the following commands on one node.
To setup the DRBD partition as an LVM PV, run <span class="code">pvcreate</span>:
<source lang="bash">
pvcreate /dev/drbd0
</source>
<source lang="text">
  Physical volume "/dev/drbd0" successfully created
</source>
Now, on both nodes, check that the new physical volume is visible by using <span class="code">pvdisplay</span>:
<source lang="bash">
pvdisplay
</source>
<source lang="text">
  --- Physical volume ---
  PV Name              /dev/md1
  VG Name              vg_01
  PV Size              465.52 GiB / not usable 15.87 MiB
  Allocatable          yes
  PE Size              32.00 MiB
  Total PE              14896
  Free PE              782
  Allocated PE          14114
  PV UUID              BuR5uh-R74O-kACb-S1YK-MHxd-9O69-yo1EKW
 
  "/dev/drbd0" is a new physical volume of "399.99 GiB"
  --- NEW Physical volume ---
  PV Name              /dev/drbd0
  VG Name             
  PV Size              399.99 GiB
  Allocatable          NO
  PE Size              0 
  Total PE              0
  Free PE              0
  Allocated PE          0
  PV UUID              LYOE1B-22fk-LfOn-pu9v-9lhG-g8vx-cjBnsY
</source>
If you see <span class="code">PV Name /dev/drbd0</span> on both nodes, then your DRBD setup and LVM configuration changes are working perfectly!
== Creating a VG on the new PV ==
Now we need to create the volume group using the <span class="code">vgcreate</span> command:
<source lang="bash">
vgcreate -c y drbd_vg0 /dev/drbd0
</source>
<source lang="text">
  Clustered volume group "drbd_vg0" successfully created
</source>
Now we'll check that the new VG is visible on both nodes using <span class="code">vgdisplay</span>:
<source lang="bash">
vgdisplay
</source>
<source lang="text">
  --- Volume group ---
  VG Name              vg_01
  System ID           
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  6
  VG Access            read/write
  VG Status            resizable
  MAX LV                0
  Cur LV                3
  Open LV              3
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size              465.50 GiB
  PE Size              32.00 MiB
  Total PE              14896
  Alloc PE / Size      14114 / 441.06 GiB
  Free  PE / Size      782 / 24.44 GiB
  VG UUID              YbHSKn-x64P-oEbe-8R0S-3PjZ-UNiR-gdEh6T
 
  --- Volume group ---
  VG Name              drbd_vg0
  System ID           
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  1
  VG Access            read/write
  VG Status            resizable
  Clustered            yes
  Shared                no
  MAX LV                0
  Cur LV                0
  Open LV              0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size              399.98 GiB
  PE Size              4.00 MiB
  Total PE              102396
  Alloc PE / Size      0 / 0 
  Free  PE / Size      102396 / 399.98 GiB
  VG UUID              NK00Or-t9Z7-9YHz-sDC8-VvBT-NPeg-glfLwy
</source>
If the new VG is visible on both nodes, we are ready to create our first logical volume using the <span class="code">lvcreate</span> tool.
== Creating the First Two LVs on the new VG ==
Now we'll create a simple 20 GiB logical volumes. We will use it as a shared GFS store for source ISOs (and Xen domU config files) later on.
<source lang="bash">
lvcreate -L 20G -n iso_store drbd_vg0
</source>
<source lang="text">
  Logical volume "iso_store" created
</source>
As before, we will check that the new logical volume is visible from both nodes by using the <span class="code">lvdisplay</span> command:
<source lang="bash">
lvdisplay
</source>
<source lang="text">
  --- Logical volume ---
  LV Name                /dev/vg_01/lv_root
  VG Name                vg_01
  LV UUID                dl6jxD-asN7-bGYL-H4yO-op6q-Nt6y-RxkPnt
  LV Write Access        read/write
  LV Status              available
  # open                1
  LV Size                39.06 GiB
  Current LE            1250
  Segments              1
  Allocation            inherit
  Read ahead sectors    auto
  - currently set to    256
  Block device          253:0
 
  --- Logical volume ---
  LV Name                /dev/vg_01/lv_swap
  VG Name                vg_01
  LV UUID                VL3G06-Ob0o-sEB9-qNX3-rIAJ-nzW5-Auf64W
  LV Write Access        read/write
  LV Status              available
  # open                1
  LV Size                2.00 GiB
  Current LE            64
  Segments              1
  Allocation            inherit
  Read ahead sectors    auto
  - currently set to    256
  Block device          253:1
 
  --- Logical volume ---
  LV Name                /dev/vg_01/lv_drbd
  VG Name                vg_01
  LV UUID                SRT3N5-kA84-I3Be-LI20-253s-qTGT-fuFPfr
  LV Write Access        read/write
  LV Status              available
  # open                2
  LV Size                400.00 GiB
  Current LE            12800
  Segments              1
  Allocation            inherit
  Read ahead sectors    auto
  - currently set to    256
  Block device          253:2
 
  --- Logical volume ---
  LV Name                /dev/drbd_vg0/iso_store
  VG Name                drbd_vg0
  LV UUID                H0M5fL-Wxb6-o8cb-Wb30-Rla3-fwzp-tzdR62
  LV Write Access        read/write
  LV Status              available
  # open                0
  LV Size                20.00 GiB
  Current LE            5120
  Segments              1
  Allocation            inherit
  Read ahead sectors    auto
  - currently set to    256
  Block device          253:3
</source>
The last two are the new logical volumes.
= Creating A Shared GFS FileSystem =
GFS is a cluster-aware file system that can be simultaneously mounted on two or more nodes at once. We will use it as a place to store ISOs that we'll use to provision our virtual machines.
Start by installing the <span class="code">[[GFS2]]</span> tools:
<source lang="bash">
yum install gfs2-utils.x86_64
</source>
As before, modify the <span class="code">gfs2</span> init script to start after <span class="code">clvmd</span>, and then modify <span class="code">xendomains</span> to start after <span class="code">gfs2</span>. Finally, use <span class="code">chkconfig</span> to reconfigure the boot order:
<source lang="bash">
chkconfig xend off; chkconfig cman off; chkconfig drbd off; chkconfig clvmd off; chkconfig xendomains off; chkconfig gfs2 off
chkconfig xend on; chkconfig cman on; chkconfig drbd on; chkconfig clvmd on; chkconfig xendomains on; chkconfig gfs2 on
</source>
The following example is designed for the cluster used in this paper.
* If you have more than 2 nodes, increase the <span class="code">-j 2</span> to the number of nodes you want to mount this file system on.
* If your cluster is named something other than <span class="code">an-cluster</span> (as set in the <span class="code">cluster.conf</span> file), change <span class="code">-t an-cluster:iso_store</span> to match you cluster's name. The <span class="code">iso_store</span> can be whatever you like, but it must be unique in the cluster. I tend to use a name that matches the LV name, but this is my own preference and is not required.
To format the partition run:
<source lang="bash">
mkfs.gfs2 -p lock_dlm -j 2 -t xencluster001:iso_store /dev/drbd_vg0/iso_store
</source>
If you are prompted, press <span class="code">y</span> to proceed.
Once the format completes, you can mount <span class="code">/dev/drbd_vg0/iso_store</span> as you would a normal file system.
'''Both''':
To complete the example, lets mount the GFS2 partition we made just now on <span class="code">/shared</span>.
<source lang="bash">
mkdir /shared
mount /dev/drbd_vg0/iso_store /shared
</source>
Done!
== Growing a GFS2 Partition ==
To grow a GFS2 partition, you must know where it is mounted. You can not grow an unmounted GFS2 partition, as odd as that may seem at first. Also, you only need to run grow commands from one node. Once completed, all nodes will see and use the new free space automatically.
This requires two steps to complete:
# Extend the underlying LVM logical volume
# Grow the actual GFS2 partition
=== Extend the LVM LV ===
To keep things simple, we'll just use some of the free space we left on our <span class="code">/dev/drbd0</span> LVM physical volume. If you need to add more storage to your LVM first, please follow the instructions in the article: "[[Adding Space to an LVM]]" before proceeding.
Let's add <span class="code">50GB</span> to our GFS2 logical volume <span class="code">/dev/drbd_vg0/iso_store</span> from the <span class="code">/dev/drbd0</span> physical volume, which we know is available because we left more than that back when we first setup our LVM. To actually add the space, we need to use the <span class="code">lvextend</span> command:
<source lang="bash">
lvextend -L +50G /dev/drbd_vg0/iso_store /dev/drbd0
</source>
Which should return:
<source lang="text">
  Extending logical volume iso_store to 70.00 GB
  Logical volume iso_store successfully resized
</source>
If we run <span class="code">lvdisplay /dev/drbd_vg0/iso_store</span> now, we should see the extra space.
<source lang="text">
  --- Logical volume ---
  LV Name                /dev/drbd_vg0/iso_store
  VG Name                drbd_vg0
  LV UUID                svJx35-KDXK-ojD2-UDAA-Ah9t-UgUl-ijekhf
  LV Write Access        read/write
  LV Status              available
  # open                1
  LV Size                70.00 GB
  Current LE            17920
  Segments              2
  Allocation            inherit
  Read ahead sectors    auto
  - currently set to    256
  Block device          253:3
</source>
You're now ready to proceed.
=== Grow The GFS2 Partition ===
This step is pretty simple, but you need to enter the commands exactly. Also, you'll want to do a dry-run first and address any resulting errors before issuing the final <span class="code">gfs2_grow</span> command.
To get the exact name to use when calling <span class="code">gfs2_grow</span>, run the following command:
<source lang="bash">
gfs2_tool df
</source>
<source lang="text">
/shared:
  SB lock proto = "lock_dlm"
  SB lock table = "an-cluster:iso_store"
  SB ondisk format = 1801
  SB multihost format = 1900
  Block size = 4096
  Journals = 2
  Resource Groups = 80
  Mounted lock proto = "lock_dlm"
  Mounted lock table = "an-cluster:iso_store"
  Mounted host data = "jid=1:id=196610:first=0"
  Journal number = 1
  Lock module flags = 0
  Local flocks = FALSE
  Local caching = FALSE
  Type          Total Blocks  Used Blocks    Free Blocks    use%         
  ------------------------------------------------------------------------
  data          5242304        1773818        3468486        34%
  inodes        3468580        94            3468486        0%
</source>
From this output, we know that GFS2 expects the name "<span class="code">/shared</span>". Even adding something as simple as a trailing slash ''will not work''. The program we will use is called <span class="code">gfs2_grow</span> with the <span class="code">-T</span> switch to run the command as a test to work out possible errors.
For example, if you added the trailing slash, this is the kind of error you would see:
'''Bad command''':
<source lang="bash">
gfs_grow -T /shared/
</source>
<source lang="bash">
GFS Filesystem /shared/ not found
</source>
Once we get it right, it will look like this:
<source lang="bash">
gfs_grow -T /shared
</source>
<source lang="bash">
(Test mode--File system will not be changed)
FS: Mount Point: /shared
FS: Device:      /dev/mapper/drbd_vg0-iso_store
FS: Size:        5242878 (0x4ffffe)
FS: RG size:    65535 (0xffff)
DEV: Size:      18350080 (0x1180000)
The file system grew by 51200MB.
gfs2_grow complete.
</source>
This looks good! We're now ready to re-run the command without the <span class="code">-T</span> switch:
<source lang="bash">
gfs_grow /shared
</source>
<source lang="bash">
FS: Mount Point: /shared
FS: Device:      /dev/mapper/drbd_vg0-iso_store
FS: Size:        5242878 (0x4ffffe)
FS: RG size:    65535 (0xffff)
DEV: Size:      18350080 (0x1180000)
The file system grew by 51200MB.
gfs2_grow complete.
</source>
You can check that the new space is available on both nodes now using a simple call like <span class="code">df -h</span>.
= Provisioning Xen domU Virtual Machines =
To Do.
= Altering Start/Stop Orders =
It is important that the various daemon's in use by our cluster start and stop in the right order. Most daemons will expect another to be running, and will not operate reliably if shut down in the wrong order, possibly leaving your node(s) hung on reboot.
We need to make sure that <span class="code">xend</span> starts so that the network is stable. Then <span class="code">cman</span> needs to start so that [[fencing]] and [[dlm]] are available. Next, <span class="code">drbd</span> starts so that the clustered storage is available. Then <span class="code">clvmd</span> must start so that the data on the DRBD resource is accessible. Now <span class="code">gfs2</span> needs to start so that the Xen domU configuration files can be found and finally <span class="code">xendomains</span> must start to boot up the actual domU virtual machines. The shut down order needs to be in reverse order.
To restate as a list, the start order, and reverse stop order, must be:
* <span class="code">xend</span>
* <span class="code">cman</span>
* <span class="code">drbd</span>
* <span class="code">clvmd</span>
* <span class="code">gfs2</span>
* <span class="code">xendomains</span>
To make sure the start order is sane then, we'll edit each of the six daemon's <span class="code">init</span> scripts and alter their <span class="code">Required-Start</span> and <span class="code">Required-Stop</span> lines. Finally, to make the changes take effect, we will use <span class="code">chkconfig</span> to remove and re-add them to the various start levels.
== Altering xend ==
This should already be done. If it isn't, please see "[[#Make xend play nice with clustering|Making xend play nice with clustering]]" above. If you are revisiting that section, you can skip the <span class="code">cman</span> edit as we will need to make another change in the next step.
== Altering cman ==
We edited <span class="code">/etc/init.d/cman</span> earlier, but now we will edit it again and tell it to stop after <span class="code">drbd</span> as well as the earlier change which told it to start after <span class="code">xend</span>.
<source lang="bash">
vim /etc/init.d/cman
</source>
<source lang="text">
#!/bin/bash
#
# cman - Cluster Manager init script
#
# chkconfig: - 21 79
# description: Starts and stops cman
#
#
### BEGIN INIT INFO
# Provides:            cman
# Required-Start:      $network $time xend
# Required-Stop:        $network $time drbd
# Default-Start:
# Default-Stop:
# Short-Description:    Starts and stops cman
# Description:          Starts and stops the Cluster Manager set of daemons
### END INIT INFO
</source>
== Altering drbd ==
Now we will tell <span class="code">drbd</span> to start after <span class="code">cman</span> and to not stop until <span class="code">clvmd</span> has stopped.
This requires the additional step of altering the <span class="code">chkconfig: - 70 08</span> line to instead read <span class="code">chkconfig: - 20 08</span>. This isn't strictly needed, but will give more room for <span class="code">chkconfig</span> to order the dependent daemons by allowing DRBD to be started as low as position <span class="code">20</span>, rather than waiting until position <span class="code">70</span>. This is somewhat more compatible with <span class="code">cman</span> and <span class="code">clvmd</span> which normally start at positions <span class="code">21</span> and <span class="code">24</span>, respectively
<source lang="bash">
vim /etc/init.d/drbd
</source>
<source lang="text">
#!/bin/bash
#
# chkconfig: - 20 08
# description: Loads and unloads the drbd module
#
# Copright 2001-2008 LINBIT Information Technologies
# Philipp Reisner, Lars Ellenberg
#
### BEGIN INIT INFO
# Provides: drbd
# Required-Start: $local_fs $network $syslog cman
# Required-Stop:  $local_fs $network $syslog clvmd
# Should-Start:  sshd multipathd
# Should-Stop:    sshd multipathd
# Default-Start:
# Default-Stop:
# Short-Description:    Control drbd resources.
### END INIT INFO
</source>
== Altering clvmd ==
Now we will now tell <span class="code">clvmd</span> to start after <span class="code">drbd</span> and to not stop until <span class="code">gfs2</span> has stopped.
<source lang="bash">
vim /etc/init.d/clvmd
</source>
<source lang="text">
#!/bin/bash
#
# chkconfig: - 24 76
# description: Starts and stops clvmd
#
# For Red-Hat-based distributions such as Fedora, RHEL, CentOS.
#             
### BEGIN INIT INFO
# Provides: clvmd
# Required-Start: $local_fs drbd
# Required-Stop: $local_fs gfs2
# Default-Start:
# Default-Stop: 0 1 6
# Short-Description: Clustered LVM Daemon
### END INIT INFO
</source>
== Altering gfs2 ==
Now we will now tell <span class="code">gfs2</span> to start after <span class="code">clvmd</span> and to not stop until <span class="code">xendomains</span> has stopped. You will notice that <span class="code">cman</span> is already listed under <span class="code">Required-Start</span> and <span class="code">Required-Stop</span>. It's true that <span class="code">cman</span> must be started, but we've created a chain here so we can safely replace it with <span class="code">clvmd</span> in the start line.
As for the stop line, <span class="code">gfs2</span> should stop '''before''' <span class="code">cman</span> as it relies on <span class="code">cman</span>'s [[DLM]] to operate safely. If anyone has insight on why <span class="code">gfs2</span> is set to stop first, please [mailto:digimer@alteeve.com let me know]. Regardless, we don't want GFS2 to stop before all domU's are down or gone, so we'll set <span class="code">xendomains</span> in it's place.
<source lang="bash">
vim /etc/init.d/gfs2
</source>
<source lang="text">
#!/bin/bash
#
# gfs2 mount/unmount helper
#
# chkconfig: - 26 74
# description: mount/unmount gfs2 filesystems configured in /etc/fstab
### BEGIN INIT INFO
# Provides:            gfs2
# Required-Start:      $network clvmd
# Required-Stop:        $network xendomains
# Default-Start:
# Default-Stop:
# Short-Description:    mount/unmount gfs2 filesystems configured in /etc/fstab
# Description:          mount/unmount gfs2 filesystems configured in /etc/fstab
### END INIT INFO
</source>
== Altering xendomains ==
Finally, we will alter <span class="code">xendomains</span> so that it starts last, after <span class="code">gfs2</span>. It needs to be the first daemon to stop, so we will not require anything else be stopped. By default, <span class="code">xend</span> is set in both the start and stop lines. Thanks to our boot chain, we can again safely replace the start <span class="code">xend</span> with <span class="code">gfs2</span>. We'll simply remove the <span class="code">xend</span> in the stop line.
<source lang="bash">
vim /etc/init.d/xendomains
</source>
<source lang="text">
#!/bin/bash
#
# /etc/init.d/xendomains
# Start / stop domains automatically when domain 0 boots / shuts down.
#
# chkconfig: 345 99 00
# description: Start / stop Xen domains.
#
# This script offers fairly basic functionality.  It should work on Redhat
# but also on LSB-compliant SuSE releases and on Debian with the LSB package
# installed.  (LSB is the Linux Standard Base)
#
# Based on the example in the "Designing High Quality Integrated Linux
# Applications HOWTO" by Avi Alkalay
# <http://www.tldp.org/HOWTO/HighQuality-Apps-HOWTO/>
#
### BEGIN INIT INFO
# Provides:          xendomains
# Required-Start:    $syslog $remote_fs gfs2
# Should-Start:
# Required-Stop:    $syslog $remote_fs
# Should-Stop:
# Default-Start:    3 4 5
# Default-Stop:      0 1 2 6
# Default-Enabled:  yes
# Short-Description: Start/stop secondary xen domains
# Description:      Start / stop domains automatically when domain 0
#                    boots / shuts down.
### END INIT INFO
</source>
== Applying The Changes ==
Change the start order by removing and re-adding all cluster-related daemons using <span class="code">chkconfig</span>.
<source lang="bash">
chkconfig xend off; chkconfig cman off; chkconfig drbd off; chkconfig clvmd off; chkconfig gfs2 off; chkconfig xendomains off
chkconfig xendomains on; chkconfig gfs2 on; chkconfig clvmd on; chkconfig drbd on; chkconfig cman on; chkconfig xend on
</source>





Revision as of 01:29, 9 September 2010

 AN!Wiki :: How To :: Two Node Fedora 13 Cluster - Xen-Based Virtual Machine Host on DRBD+CLVM

Warning: This is currently a dumping ground for notes. DO NOT FOLLOW THIS DOCUMENT'S INSTRUCTIONS. Seriously, it could blow up your computer or cause winter to come early.


This HowTo will walk you through setting up Xen VMs using DRBD and CLVM for high availability.

Prerequisite

This talk is an extension of the Two Node Fedora 13 Cluster HowTo. As such, you will be expected to have a freshly built two-node cluster with spare disk space on either node.

Please do not proceed until you have completed the first tutorial.

Overview

This tutorial will cover several topics; DRBD, CLVM, GFS2, Xen dom0 and domU VMs and rgmanager. Their relationship is thus:

  • DRBD provides a mechanism to replicate data across both nodes in real time and guarantees a consistent view of that data from either node. Think of it like RAID level 1, but across machines.
  • CLVM sits on the DRBD partition and provides the underlying mechanism for allowing both nodes to access shared data in a clustered environment. It will host a shared filesystem by way of GFS2 as well as LVs that Xen's domU VMs will use as their disk space.
  • GFS2 will be the clustered file system used on one of the DBRD-backed, CLVM-managed partitions. Files that need to be shared between nodes, like the Xen VM configuration files, will exist on this partition.
  • Xen will be the hypervisor in use that will manage the various virtual machines. Each virtual machine will exist in an LVM LV.
    • Xen's dom0 is the special "host" virtual machine. In this case, dom0 will be the OS installed in the first HowTo.
    • Xen's domU virtual machines will be the "floating", highly available servers.
  • Lastly, rgmanager will be the component of cman that will be configured to manage the automatic migration of the virtual machines when failures occur and when nodes recover.

Setting Up Xen's dom0

It may seem odd to start with Xen at this stage, but it is going to rather fundamentally alter each node's "host" operating system.

At this point, each node's host OS is a traditional operating system operating on the bare metal. When we install a dom0 kernel though, we tell Xen to boot a mini operating system first, and then to boot our "host" operating system. In effect, this converts the host node's operating system into just another virtual machine, albeit with a special view of the underlying hardware and Xen hypervisor.

This conversion is somewhat disruptive, so I like to get it out of the way right away. We will then do the rest of the setup before returning to Xen later on to create the floating virtual machines.

A Note On The State Of Xen dom0 Support In Fedora

As of Fedora 8, support for Xen dom0 has been removed. This is temporary, and dom0 is expected to be restored as a supported option in Fedora 15 or 16.

The reason for the removal is that, at this time, much of the code needed to create a dom0 kernel needs to be applied as patches against a vanilla Linux kernel. This is very time-consuming task that will be resolves when many of these patches are moved into the kernel proper later. Once that happens, dom0 support will become native and the overhead will be significantly reduced for the Fedora developers.

What this means for us is that we need to use a non-standard dom0 kernel. Specifically, we will use a kernel created by myoung for Fedora 12. This kernel does not directly support DRBD, so be aware that we will need to build new DRBD kernel modules for his kernel and then rebuild the DRBD modules each time his kernel is updated.

Install The Hypervisor

The Xen hypervisor is the program that manages the virtual servers, provides the virtual hardware, routes access to the real hardware and so on. To install it, simply install the xen RPM package.

yum install xen.x86_64

Install myoung's dom0

This uses a kernel built for Fedora 12, but it works on Fedora 13. This step involves adding and enabling his repository.

To add the repository, download the myoung.dom0.repo into the /etc/yum.repos.d/ directory.

cd /etc/yum.repos.d/
wget -c http://myoung.fedorapeople.org/dom0/myoung.dom0.repo

To enable his repository, edit the repository file and change the two enabled=0 entries to enabled=1.

vim /etc/yum.repos.d/myoung.dom0.repo
[myoung-dom0]
name=myoung's repository of Fedora based dom0 kernels - $basearch
baseurl=http://fedorapeople.org/~myoung/dom0/$basearch/
enabled=1
gpgcheck=0

[myoung-dom0-source]
name=myoung's repository of Fedora based dom0 kernels - Source
baseurl=http://fedorapeople.org/~myoung/dom0/src/
enabled=1
gpgcheck=0

Install the Xen dom0 kernel (edit the version number with the updated version if it has changed).

yum install kernel-2.6.32.21-167.xendom0.fc12.x86_64

The entry in grub's /boot/grub/menu.lst won't work. You will need to edit it so that it calls the existing installed operating system as a module.

Note: Copy and modify the entry created by the RPM. Simply copying this entry will almost certainly not work! Your root= is likely different and your rd_MD_UUID= will definitely be different, even on the same machine across installs. Generally speaking, what follows the kernel /vmlinuz-2.6.32.21-167.xendom0.fc12.x86_64 ... entry made by the dom0 kernel can be copied after the module /vmlinuz-2.6.32.21-167.xendom0.fc12.x86_64 ... entry in the example below.

vim /boot/grub/menu.lst
title Xen 3.4.x, Linux kernel 2.6.32.21-167.xendom0.fc12.x86_64
	root   (hd0,2)
	kernel /xen.gz dom0_mem=1024M
	module /vmlinuz-2.6.32.21-167.xendom0.fc12.x86_64 ...
	module /initramfs-2.6.32.21-167.xendom0.fc12.x86_64.img

Lastly, we need to tell fstab to mount the virtual /proc/xen file system on boot. Do this by appending the following line to /etc/fstab

echo "xenfs                   /proc/xen               xenfs   defaults        0 0" >> /etc/fstab

Make xend play nice with clustering

By default under Fedora 13, cman will start before xend. This is a problem because xend takes the network down as part of it's setup. This causes totem communication to fail which leads to fencing.

To avoid this, edit /etc/init.d/xend and tell it to start earlier than position 98. This is done by changing the line chkconfig: 2345 98 01 to chkconfig: 2345 11 98.

We also don't want it to stop until cman has stopped. We accomplish this by adding cman to the Required-Stop line.

vim /etc/init.d/xend
#!/bin/bash
#
# xend          Script to start and stop the Xen control daemon.
#
# Author:       Keir Fraser <keir.fraser@cl.cam.ac.uk>
#
# chkconfig: 2345 11 98
# description: Starts and stops the Xen control daemon.
### BEGIN INIT INFO
# Provides:          xend
# Required-Start:    $syslog $remote_fs
# Should-Start:
# Required-Stop:     $syslog $remote_fs cman
# Should-Stop:
# Default-Start:     3 4 5
# Default-Stop:      0 1 2 6
# Default-Enabled:   yes
# Short-Description: Start/stop xend
# Description:       Starts and stops the Xen control daemon.
### END INIT INFO

With xend set to start at a position lower than 98, we now have room for chkconfig to put other daemons after it in the start order, which will be needed a little later. First and foremost, we now need to tell cman to not start until after xend is up.

As above, we will now edit cman's /etc/init.d/cman script. This time though, we will not edit it's chkconfig line. Instead, we will simply add xend to the Required-Start line.

vim /etc/init.d/cman
#!/bin/bash
#
# cman - Cluster Manager init script
#
# chkconfig: - 21 79
# description: Starts and stops cman
#
#
### BEGIN INIT INFO
# Provides:             cman
# Required-Start:       $network $time xend
# Required-Stop:        $network $time
# Default-Start:
# Default-Stop:
# Short-Description:    Starts and stops cman
# Description:          Starts and stops the Cluster Manager set of daemons
### END INIT INFO

Finally, remove and re-add the xend and cman daemons to re-order them in the start list:

chkconfig xend off; chkconfig cman off; chkconfig xend on; chkconfig cman on

Confirm that the order has changed so that xend is earlier in the boot sequence than cman. Assuming you've switched to run-level 3, run:

ls -lah /etc/rc3.d/

Your start sequence should now look like:

lrwxrwxrwx.  1 root root   14 Sep  1 19:26 S26xend -> ../init.d/xend
lrwxrwxrwx.  1 root root   14 Sep  1 19:26 S27cman -> ../init.d/cman

Booting Into The New dom0

If everything went well, you should be able to boot the new dom0 operating system. If you watch the boot process closely, you will see that the boot process is different. You should now see the Xen hypervisor boot prior to handing off to the "host" operating system. This can be confirmed once the dom0 operating system has booted by checking that the file /proc/xen/capabilities exists. What it contains doesn't matter at this stage, only that it exists at all.

cat /proc/xen/capabilities
control_d

If you see something like this, then you are ready to proceed!

Building the DRBD Array

Building the DRBD array requires a few steps. First, raw space on either node must be prepared. Next, DRBD must be told that it is to create a resource using this newly configured raw space. Finally, the new array must be initialized.

A Map of the Cluster's Storage

The layout of the storage in the cluster can quickly become difficult to follow. Below is an ASCII drawing which should help you see how DRBD will tie in to the rest of the cluster's storage. This map assumes a simple RAID level 1 array underlying each node. If your node has a single hard drive, simply collapse the first two layers into one. Similarly, if your underlying storage is a more complex RAID array, simply expand the number of physical devices at the top level.

               Node1                                Node2
           _____   _____                        _____   _____
          | sda | | sdb |                      | sda | | sdb |
          |_____| |_____|                      |_____| |_____|
             |_______|                            |_______|
     _______ ____|___ _______             _______ ____|___ _______
  __|__   __|__    __|__   __|__       __|__   __|__    __|__   __|__
 | md0 | | md1 |  | md2 | | md3 |     | md3 | | md2 |  | md1 | | md0 |
 |_____| |_____|  |_____| |_____|     |_____| |_____|  |_____| |_____|
    |       |        |       |           |       |        |       |
 ___|___   _|_   ____|____   |___________|   ____|____   _|_   ___|___
| /boot | | / | | <swap>  |        |        | <swap>  | | / | | /boot |
|_______| |___| |_________|  ______|______  |_________| |___| |_______|
                            | /dev/drbd0  |
                            |_____________|
                                   |
                               ____|______
                              | clvm PV   |
                              |___________|
                                   |
                              _____|_____
                             | drbd_vg0  |
                             |___________|
                                   |
                              _____|_____ ___...____
                             |           |          |
                          ___|___     ___|___    ___|___
                         | lv_X  |   | lv_Y  |  | lv_N  |
                         |_______|   |_______|  |_______|

Install The DRBD Tools

DRBD has two components; The actual application and tools and the kernel module. The tools provided by Fedora directly are sufficient for our use. You will need to install the following tools:

yum install drbd.x86_64 drbd-xen.x86_64 drbd-utils.x86_64

Install The DRBD Kernel Module

The kernel module must match the dom0 kernel that is running. If you update the kernel and neglect to update the DRBD kernel module, the DRBD array will not start.

To help simplify things, links to pre-compiled DRBD kernel modules are provided. If the kernel version you have installed doesn't match your kernel, instructions on recompiling the DRBD kernel module from source RPM is provided as well.

Install Pre-Compiled DRBD Kernel Module RPMs

These are the two RPMs you will need to install. Note that these RPMs are compiled against myoung's 2.6.32.21_167 kernel.

You can install the two above RPMs with this command:

rpm -ivh https://alteeve.com/files/an-cluster/drbd-km-2.6.32.21_167.xendom0.fc12.x86_64-8.3.7-12.fc13.x86_64.rpm https://alteeve.com/files/an-cluster/drbd-km-debuginfo-8.3.7-12.fc13.x86_64.rpm

Building DRBD Kernel Module RPMs From Source

If the above RPMs don't work or if the dom0 kernel you are using in any way differs, please follow the steps here to create a DRBD kernel module matched to your running dom0.

First, install the build environment.

yum -y groupinstall "Development Libraries"
yum -y groupinstall "Development Tools"

Install the kernel headers and development library for the dom0 kernel:

Note: The following commands use --force to get past the fact that the headers for the 2.6.33 are already installed, thus making RPM think that these are too old and will conflict. Please proceed with caution.

rpm -ivh --force http://fedorapeople.org/~myoung/dom0/x86_64/kernel-headers-2.6.32.21-167.xendom0.fc12.x86_64.rpm http://fedorapeople.org/~myoung/dom0/x86_64/kernel-devel-2.6.32.21-167.xendom0.fc12.x86_64.rpm

Download, prepare, build and install the source RPM:

rpm -ivh http://fedora.mirror.iweb.ca/releases/13/Everything/source/SRPMS/drbd-8.3.7-2.fc13.src.rpm
cd /root/rpmbuild/SPECS/
rpmbuild -bp drbd.spec 
cd /root/rpmbuild/BUILD/drbd-8.3.7/
./configure --enable-spec --with-km
cp /root/rpmbuild/BUILD/drbd-8.3.7/drbd-km.spec /root/rpmbuild/SPECS/
cd /root/rpmbuild/SPECS/
rpmbuild -ba drbd-km.spec
cd /root/rpmbuild/RPMS/x86_64
rpm -Uvh drbd-km-*

You should be good to go now!

Allocating Raw Space For DRBD On Each Node

If you followed the setup steps provided for in "Two Node Fedora 13 Cluster", you will have a set amount of unconfigured hard drive space. This is what we will use for the DRBD space on either node. If you've got a different setup, you will need to allocate some raw space before proceeding.

Creating a RAID level 1 'md' Device

This assumes that you have two raw drives, /dev/sda and /dev/sdb. It further assumes that you've created three partitions which have been assigned to three existing /dev/mdX devices. With these assumptions, we will create /dev/sda4 and /dev/sdb4 and, using them, create a new /dev/md3 device that will host the DRBD partition.

If you do not have two drives, you can stop after creating a new partition. If you have multiple drives and plan to use a different RAID levels, please adjust the follow commands accordingly.

Creating The New Partitions

Warning: The next steps will have you directly accessing your server's hard drive configuration. Please do not proceed on a live server until you've had a chance to work through these steps on a test server. One mistake can blow away all your data.

Start the fdisk shell

fdisk /dev/sda
WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
         switch off the mode (command 'c') and change display units to
         sectors (command 'u').

Command (m for help):

View the current configuration with the print option

p
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c6fe1

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1        5100    40960000   fd  Linux raid autodetect
/dev/sda2            5100        5622     4194304   fd  Linux raid autodetect
/dev/sda3   *        5622        5654      256000   fd  Linux raid autodetect

Command (m for help):

Now we know for sure that the next free partition number is 4. We will now create the new partition.

n
Command action
   e   extended
   p   primary partition (1-4)

We will make it a primary partition

p
Selected partition 4
First cylinder (5654-60801, default 5654):

Then we simply hit <enter> to select the default starting block.

<enter>
Using default value 5654
Last cylinder, +cylinders or +size{K,M,G} (5654-60801, default 60801):

Once again we will press <enter> to select the default ending block.

<enter>
Using default value 60801

Command (m for help):

Now we need to change the type of partition that it is.

t
Partition number (1-4):

We know that we are modifying partition number 4.

4
Hex code (type L to list codes):

Now we need to set the hex code for the partition type to set. We want to set fd, which defines Linux raid autodetect.

fd
Changed system type of partition 4 to fd (Linux raid autodetect)

Now check that everything went as expected by once again printing the partition table.

p
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c6fe1

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1        5100    40960000   fd  Linux raid autodetect
/dev/sda2            5100        5622     4194304   fd  Linux raid autodetect
/dev/sda3   *        5622        5654      256000   fd  Linux raid autodetect
/dev/sda4            5654       60801   442972704+  fd  Linux raid autodetect

Command (m for help):

There it is. So finally, we need to write the changes to the disk.

w
The partition table has been altered!

Calling ioctl() to re-read partition table.

WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
Syncing disks.

If you see the above message, do not reboot until both drives have been setup. You might as well reboot once only.

Repeat these steps for the second drive, /dev/sdb and then reboot if needed.

Creating The New /dev/mdX Device

If you only have one drive, skip this step.

Now we need to use mdadm to create the new RAID level 1 device. This will be used as the device that DRBD will directly access.

mdadm --create /dev/md3 --homehost=localhost.localdomain --raid-devices=2 --level=1 /dev/sda4 /dev/sdb4
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90

Seeing as /boot doesn't exist on this device, we can safely ignore this warning.

y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/md4 started.

You can now cat /proc/mdstat to verify that it indeed built. If you're interested, you could open a new terminal window and use watch cat /proc/mdstat and watch the array build.

cat /proc/mdstat
md3 : active raid1 sdb4[1] sda4[0]
      442971544 blocks super 1.2 [2/2] [UU]
      [>....................]  resync =  0.8% (3678976/442971544) finish=111.0min speed=65920K/sec
      
md2 : active raid1 sda2[0] sdb2[1]
      4193272 blocks super 1.1 [2/2] [UU]
      
md1 : active raid1 sda1[0] sdb1[1]
      40958908 blocks super 1.1 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md0 : active raid1 sda3[0] sdb3[1]
      255988 blocks super 1.0 [2/2] [UU]
      
unused devices: <none>

Finally, we need to make sure that the new array will start when the system boots. To do this, we'll again use mdadm, but with different options that will have it output data in a format suitable for the /etc/mdadm.conf file. We'll redirect this output to that config file, thus updating it.

mdadm --detail --scan | grep md3 >> /etc/mdadm.conf
cat /etc/mdadm.conf
# mdadm.conf written out by anaconda
MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=b58df6d0:d925e7bb:c156168d:47c01718
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=ac2cf39c:77cd0314:fedb8407:9b945bb5
ARRAY /dev/md2 level=raid1 num-devices=2 UUID=4e513936:4a966f4e:0dd8402e:6403d10d
ARRAY /dev/md3 metadata=1.2 name=localhost.localdomain:3 UUID=f0b6d0c1:490d47e7:91c7e63a:f8dacc21

You'll note that the last line, which we just added, is different from the previous lines. This isn't a concern, but you are welcome to re-write it to match the existing format if you wish.

Before you proceed, it is strongly advised that you reboot each node and then verify that the new array did in fact start with the system. You do not need to wait for the sync to finish before rebooting. It will pick up where you left off once rebooted.

DRBD Configuration Files

DRBD uses a global configuration file, /etc/drbd.d/global_common.conf, and one or more resource files. The resource files need to be created in the /etc/drbd.d/ directory and must have the suffix .res. For this example, we will create a single resource called r0 which we will configure in /etc/drbd.d/r0.res.

/etc/drbd.d/global_common.conf

The stock /etc/drbd.d/global_common.conf is sane, so we won't bother altering it here.

Full details on all the drbd.conf configuration file directives and arguments can be found here. Note: That link doesn't show this new configuration format. Please see Novell's link.

/etc/drbd.d/r0.res

This is the important part. This defines the resource to use, and must reflect the IP addresses and storage devices that DRBD will use for this resource.

vim /etc/drbd.d/r0.res
# This is the name of the resource and it's settings. Generally, 'r0' is used
# as the name of the first resource. This is by convention only, though.
resource r0
{
        # This tells DRBD where to make the new resource available at on each
        # node. This is, again, by convention only.
        device    /dev/drbd0;

        # The main argument here tells DRBD that we will have proper locking 
        # and fencing, and as such, to allow both nodes to set the resource to
        # 'primary' simultaneously.
        net
        {
                allow-two-primaries;
        }

        # This tells DRBD to automatically set both nodes to 'primary' when the
        # nodes start.
        startup
        {
                become-primary-on both;
        }

        # This tells DRBD to look for and store it's meta-data on the resource
        # itself.
        meta-disk       internal;

        # The name below must match the output from `uname -n` on each node.
        on an-node01.alteeve.com
        {
                # This must be the IP address of the interface on the storage 
                # network (an-node01.sn, in this case).
                address         10.0.0.71:7789;

                # This is the underlying partition to use for this resource on 
                # this node.
                disk            /dev/md3;
        }

        # Repeat as above, but for the other node.
        on an-node02.alteeve.com
        {
                address         10.0.0.72:7789;
                disk            /dev/md3;
        }
}

This file must be copied to BOTH nodes and must match before you proceed.

Starting The DRBD Resource

From the rest of this section, pay attention to whether you see

  • Node1
  • Node2
  • Both

These indicate which node to run the following commands on. There is no functional difference between either node, so just randomly choose one to be Node1 and the other will be Node2. Once you've chosen which is which, be consistent with which node you run the commands on. Of course, if a command block is proceeded by Both, run the following code block on both nodes.

Initialize The Block Device

Node1

This step creates the DRBD meta-data on the new DRBD device. It is only needed when creating new DRBD partitions.

drbdadm create-md r0
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
success

Monitoring Progress

Both

I find it very useful to monitor DRBD while running the rest of the setup. To do this, open a second terminal on each node and use watch to keep an eye on /proc/drbd. This way you will be able to monitor the progress of the array in near-real time.

Both

watch cat /proc/drbd

At this stage, it should look like this:

version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@xenmaster002.iplink.net, 2010-09-07 16:02:46
 0: cs:Unconfigured

Starting the Resource

Both

This will attach the backing device, /dev/md3 in our case, and then start the new resource r0.

drbdadm up r0

There will be no output at the command line. If you are watching /proc/drbd though, you should now see something like this:

version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@xenmaster002.iplink.net, 2010-09-07 16:02:46
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:442957988

That it is Secondary/Secondary and Inconsistent/Inconsistent is expected.

Setting the First Primary Node

Node1

As this is a totally new resource, DRBD doesn't know which side of the array is "more valid" than the other. In reality, neither is as there was no existing data of note on either node. This means that we now need to choose a node and tell DRBD to treat it as the "source" node. This step will also tell DRBD to make the "source" node primary. Once set, DRBD will begin sync'ing in the background.

drbdadm -- --overwrite-data-of-peer primary r0

As before, there will be no output at the command line, but /proc/drbd will change to show the following:

GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root@xenmaster002.iplink.net, 2010-09-07 16:02:46
 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
    ns:69024 nr:0 dw:0 dr:69232 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:442888964
        [>....................] sync'ed:  0.1% (432508/432576)M
        finish: 307:33:42 speed: 320 (320) K/sec

If you're watching the secondary node, the /proc/drbd will show ro:Secondary/Primary ds:Inconsistent/UpToDate. This is, as you can guess, simply a reflection of it being the "over-written" node.

Setting the Second Node to Primary

Node2

The last step to complete the array is to tell the second node to also become primary.

drbdadm primary r0

As with many drbdadm commands, nothing will be printed to the console. If you're watching the /proc/drbd though, you should see something like Primary/Primary ds:UpToDate/Inconsistent. The Inconsistent flag will remain until the sync is complete.

A Note On sync Speed

You will notice in the previous step that the sync speed seems awfully slow at 320 (320) K/sec.

This is not a problem!

As actual data is written to either side of the array, that data will be immediately copied to both nodes. As such, both nodes will always contain up to date copies of the real data. Given this, the syncer is intentionally set low so as to not put too much load on the underlying disks that could cause slow downs. If you still wish to increase the sync speed, you can do so with the following command.

drbdsetup /dev/drbd0 syncer -r 100M

The speed-up will not be instant. It will take a little while for the speed to pick up. Once the sync is finished, it is a good idea to revert to the default sync rate.

drbdadm syncer r0

Setting Up CLVM

The goal of DRBD in the cluster is to provide clustered LVM, referred to as CLVM to the nodes. This is done by turning the DRBD partition into an CLVM physical volume.

So now we will create a PV on top of the new DRBD partition, /dev/drbd0, that we created in the previous step. Since this new LVM PV will exist on top of the shared DRBD partition, whatever get written to it's logical volumes will be immediately available on either node, regardless of which node actually initiated the write.

This capability is the underlying reason for creating this cluster; Neither machine is truly needed so if one machine dies, anything on top of the DRBD partition will still be available. When the failed machine returns, the surviving node will have a list of what blocks changed while the other node was gone and can use this list to quickly re-sync the other server.

Making LVM Cluster-Aware

Normally, LVM is run on a single server. This means that at any time, the LVM can write data to the underlying drive and not need to worry if any other device might change anything. In clusters, this isn't the case. The other node could try to write to the shared storage, so then nodes need to enable "locking" to prevent the two nodes from trying to work on the same bit of data at the same time.

The process of enabling this locking is known as making LVM "cluster-aware".

LVM has tool called lvmconf that can be used to enable LVM locking. This is provided as part of the lvm2-cluster package.

yum install lvm2-cluster.x86_64

Now to enable cluster awareness in LVM, run to following command.

lvmconf --enable-cluster

Enabling Cluster Locking

By default, clvmd, the cluster lvm daemon, is stopped and not set to run on boot. Now that we've enabled LVM locking, we need to start it:

/etc/init.d/clvmd status
clvmd is stopped
active volumes: lv_drbd lv_root lv_swap

As expected, it is stopped, so lets start it:

/etc/init.d/clvmd start
Stopping clvm:                                             [  OK  ]
Starting clvmd:                                            [  OK  ]
Activating VGs:   3 logical volume(s) in volume group "an-lvm01" now active
                                                           [  OK  ]

Creating a new PV using the DRBD Partition

We can now proceed with setting up the new DRBD-based LVM physical volume. Once the PV is created, we can create a new volume group and start allocating space to logical volumes.

Note: As we will be using our DRBD device, and as it is a shared block device, most of the following commands only need to be run on one node. Once the block device changes in any way, those changes will near-instantly appear on the other node. For this reason, unless explicitly stated to do so, only run the following commands on one node.

To setup the DRBD partition as an LVM PV, run pvcreate:

pvcreate /dev/drbd0
  Physical volume "/dev/drbd0" successfully created

Now, on both nodes, check that the new physical volume is visible by using pvdisplay:

pvdisplay
  --- Physical volume ---
  PV Name               /dev/md1
  VG Name               vg_01
  PV Size               465.52 GiB / not usable 15.87 MiB
  Allocatable           yes 
  PE Size               32.00 MiB
  Total PE              14896
  Free PE               782
  Allocated PE          14114
  PV UUID               BuR5uh-R74O-kACb-S1YK-MHxd-9O69-yo1EKW
   
  "/dev/drbd0" is a new physical volume of "399.99 GiB"
  --- NEW Physical volume ---
  PV Name               /dev/drbd0
  VG Name               
  PV Size               399.99 GiB
  Allocatable           NO
  PE Size               0   
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               LYOE1B-22fk-LfOn-pu9v-9lhG-g8vx-cjBnsY

If you see PV Name /dev/drbd0 on both nodes, then your DRBD setup and LVM configuration changes are working perfectly!

Creating a VG on the new PV

Now we need to create the volume group using the vgcreate command:

vgcreate -c y drbd_vg0 /dev/drbd0
  Clustered volume group "drbd_vg0" successfully created

Now we'll check that the new VG is visible on both nodes using vgdisplay:

vgdisplay
  --- Volume group ---
  VG Name               vg_01
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  6
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                3
  Open LV               3
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               465.50 GiB
  PE Size               32.00 MiB
  Total PE              14896
  Alloc PE / Size       14114 / 441.06 GiB
  Free  PE / Size       782 / 24.44 GiB
  VG UUID               YbHSKn-x64P-oEbe-8R0S-3PjZ-UNiR-gdEh6T
   
  --- Volume group ---
  VG Name               drbd_vg0
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               399.98 GiB
  PE Size               4.00 MiB
  Total PE              102396
  Alloc PE / Size       0 / 0   
  Free  PE / Size       102396 / 399.98 GiB
  VG UUID               NK00Or-t9Z7-9YHz-sDC8-VvBT-NPeg-glfLwy

If the new VG is visible on both nodes, we are ready to create our first logical volume using the lvcreate tool.

Creating the First Two LVs on the new VG

Now we'll create a simple 20 GiB logical volumes. We will use it as a shared GFS store for source ISOs (and Xen domU config files) later on.

lvcreate -L 20G -n iso_store drbd_vg0
  Logical volume "iso_store" created

As before, we will check that the new logical volume is visible from both nodes by using the lvdisplay command:

lvdisplay
  --- Logical volume ---
  LV Name                /dev/vg_01/lv_root
  VG Name                vg_01
  LV UUID                dl6jxD-asN7-bGYL-H4yO-op6q-Nt6y-RxkPnt
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                39.06 GiB
  Current LE             1250
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0
   
  --- Logical volume ---
  LV Name                /dev/vg_01/lv_swap
  VG Name                vg_01
  LV UUID                VL3G06-Ob0o-sEB9-qNX3-rIAJ-nzW5-Auf64W
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                2.00 GiB
  Current LE             64
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:1
   
  --- Logical volume ---
  LV Name                /dev/vg_01/lv_drbd
  VG Name                vg_01
  LV UUID                SRT3N5-kA84-I3Be-LI20-253s-qTGT-fuFPfr
  LV Write Access        read/write
  LV Status              available
  # open                 2
  LV Size                400.00 GiB
  Current LE             12800
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2
   
  --- Logical volume ---
  LV Name                /dev/drbd_vg0/iso_store
  VG Name                drbd_vg0
  LV UUID                H0M5fL-Wxb6-o8cb-Wb30-Rla3-fwzp-tzdR62
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                20.00 GiB
  Current LE             5120
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3

The last two are the new logical volumes.

Creating A Shared GFS FileSystem

GFS is a cluster-aware file system that can be simultaneously mounted on two or more nodes at once. We will use it as a place to store ISOs that we'll use to provision our virtual machines.

Start by installing the GFS2 tools:

yum install gfs2-utils.x86_64

As before, modify the gfs2 init script to start after clvmd, and then modify xendomains to start after gfs2. Finally, use chkconfig to reconfigure the boot order:

chkconfig xend off; chkconfig cman off; chkconfig drbd off; chkconfig clvmd off; chkconfig xendomains off; chkconfig gfs2 off
chkconfig xend on; chkconfig cman on; chkconfig drbd on; chkconfig clvmd on; chkconfig xendomains on; chkconfig gfs2 on

The following example is designed for the cluster used in this paper.

  • If you have more than 2 nodes, increase the -j 2 to the number of nodes you want to mount this file system on.
  • If your cluster is named something other than an-cluster (as set in the cluster.conf file), change -t an-cluster:iso_store to match you cluster's name. The iso_store can be whatever you like, but it must be unique in the cluster. I tend to use a name that matches the LV name, but this is my own preference and is not required.

To format the partition run:

mkfs.gfs2 -p lock_dlm -j 2 -t xencluster001:iso_store /dev/drbd_vg0/iso_store

If you are prompted, press y to proceed.

Once the format completes, you can mount /dev/drbd_vg0/iso_store as you would a normal file system.

Both:

To complete the example, lets mount the GFS2 partition we made just now on /shared.

mkdir /shared
mount /dev/drbd_vg0/iso_store /shared

Done!

Growing a GFS2 Partition

To grow a GFS2 partition, you must know where it is mounted. You can not grow an unmounted GFS2 partition, as odd as that may seem at first. Also, you only need to run grow commands from one node. Once completed, all nodes will see and use the new free space automatically.

This requires two steps to complete:

  1. Extend the underlying LVM logical volume
  2. Grow the actual GFS2 partition

Extend the LVM LV

To keep things simple, we'll just use some of the free space we left on our /dev/drbd0 LVM physical volume. If you need to add more storage to your LVM first, please follow the instructions in the article: "Adding Space to an LVM" before proceeding.

Let's add 50GB to our GFS2 logical volume /dev/drbd_vg0/iso_store from the /dev/drbd0 physical volume, which we know is available because we left more than that back when we first setup our LVM. To actually add the space, we need to use the lvextend command:

lvextend -L +50G /dev/drbd_vg0/iso_store /dev/drbd0

Which should return:

  Extending logical volume iso_store to 70.00 GB
  Logical volume iso_store successfully resized

If we run lvdisplay /dev/drbd_vg0/iso_store now, we should see the extra space.

  --- Logical volume ---
  LV Name                /dev/drbd_vg0/iso_store
  VG Name                drbd_vg0
  LV UUID                svJx35-KDXK-ojD2-UDAA-Ah9t-UgUl-ijekhf
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                70.00 GB
  Current LE             17920
  Segments               2
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3

You're now ready to proceed.

Grow The GFS2 Partition

This step is pretty simple, but you need to enter the commands exactly. Also, you'll want to do a dry-run first and address any resulting errors before issuing the final gfs2_grow command.

To get the exact name to use when calling gfs2_grow, run the following command:

gfs2_tool df
/shared:
  SB lock proto = "lock_dlm"
  SB lock table = "an-cluster:iso_store"
  SB ondisk format = 1801
  SB multihost format = 1900
  Block size = 4096
  Journals = 2
  Resource Groups = 80
  Mounted lock proto = "lock_dlm"
  Mounted lock table = "an-cluster:iso_store"
  Mounted host data = "jid=1:id=196610:first=0"
  Journal number = 1
  Lock module flags = 0
  Local flocks = FALSE
  Local caching = FALSE

  Type           Total Blocks   Used Blocks    Free Blocks    use%           
  ------------------------------------------------------------------------
  data           5242304        1773818        3468486        34%
  inodes         3468580        94             3468486        0%

From this output, we know that GFS2 expects the name "/shared". Even adding something as simple as a trailing slash will not work. The program we will use is called gfs2_grow with the -T switch to run the command as a test to work out possible errors.

For example, if you added the trailing slash, this is the kind of error you would see:

Bad command:

gfs_grow -T /shared/
GFS Filesystem /shared/ not found

Once we get it right, it will look like this:

gfs_grow -T /shared
(Test mode--File system will not be changed)
FS: Mount Point: /shared
FS: Device:      /dev/mapper/drbd_vg0-iso_store
FS: Size:        5242878 (0x4ffffe)
FS: RG size:     65535 (0xffff)
DEV: Size:       18350080 (0x1180000)
The file system grew by 51200MB.
gfs2_grow complete.

This looks good! We're now ready to re-run the command without the -T switch:

gfs_grow /shared
FS: Mount Point: /shared
FS: Device:      /dev/mapper/drbd_vg0-iso_store
FS: Size:        5242878 (0x4ffffe)
FS: RG size:     65535 (0xffff)
DEV: Size:       18350080 (0x1180000)
The file system grew by 51200MB.
gfs2_grow complete.

You can check that the new space is available on both nodes now using a simple call like df -h.

Provisioning Xen domU Virtual Machines

To Do.

Altering Start/Stop Orders

It is important that the various daemon's in use by our cluster start and stop in the right order. Most daemons will expect another to be running, and will not operate reliably if shut down in the wrong order, possibly leaving your node(s) hung on reboot.

We need to make sure that xend starts so that the network is stable. Then cman needs to start so that fencing and dlm are available. Next, drbd starts so that the clustered storage is available. Then clvmd must start so that the data on the DRBD resource is accessible. Now gfs2 needs to start so that the Xen domU configuration files can be found and finally xendomains must start to boot up the actual domU virtual machines. The shut down order needs to be in reverse order.

To restate as a list, the start order, and reverse stop order, must be:

  • xend
  • cman
  • drbd
  • clvmd
  • gfs2
  • xendomains

To make sure the start order is sane then, we'll edit each of the six daemon's init scripts and alter their Required-Start and Required-Stop lines. Finally, to make the changes take effect, we will use chkconfig to remove and re-add them to the various start levels.

Altering xend

This should already be done. If it isn't, please see "Making xend play nice with clustering" above. If you are revisiting that section, you can skip the cman edit as we will need to make another change in the next step.

Altering cman

We edited /etc/init.d/cman earlier, but now we will edit it again and tell it to stop after drbd as well as the earlier change which told it to start after xend.

vim /etc/init.d/cman
#!/bin/bash
#
# cman - Cluster Manager init script
#
# chkconfig: - 21 79
# description: Starts and stops cman
#
#
### BEGIN INIT INFO
# Provides:             cman
# Required-Start:       $network $time xend
# Required-Stop:        $network $time drbd
# Default-Start:
# Default-Stop:
# Short-Description:    Starts and stops cman
# Description:          Starts and stops the Cluster Manager set of daemons
### END INIT INFO

Altering drbd

Now we will tell drbd to start after cman and to not stop until clvmd has stopped.

This requires the additional step of altering the chkconfig: - 70 08 line to instead read chkconfig: - 20 08. This isn't strictly needed, but will give more room for chkconfig to order the dependent daemons by allowing DRBD to be started as low as position 20, rather than waiting until position 70. This is somewhat more compatible with cman and clvmd which normally start at positions 21 and 24, respectively

vim /etc/init.d/drbd
#!/bin/bash
#
# chkconfig: - 20 08
# description: Loads and unloads the drbd module
#
# Copright 2001-2008 LINBIT Information Technologies
# Philipp Reisner, Lars Ellenberg
#
### BEGIN INIT INFO
# Provides: drbd
# Required-Start: $local_fs $network $syslog cman
# Required-Stop:  $local_fs $network $syslog clvmd
# Should-Start:   sshd multipathd
# Should-Stop:    sshd multipathd
# Default-Start:
# Default-Stop:
# Short-Description:    Control drbd resources.
### END INIT INFO

Altering clvmd

Now we will now tell clvmd to start after drbd and to not stop until gfs2 has stopped.

vim /etc/init.d/clvmd
#!/bin/bash
#
# chkconfig: - 24 76
# description: Starts and stops clvmd
#
# For Red-Hat-based distributions such as Fedora, RHEL, CentOS.
#              
### BEGIN INIT INFO
# Provides: clvmd
# Required-Start: $local_fs drbd
# Required-Stop: $local_fs gfs2
# Default-Start:
# Default-Stop: 0 1 6
# Short-Description: Clustered LVM Daemon
### END INIT INFO

Altering gfs2

Now we will now tell gfs2 to start after clvmd and to not stop until xendomains has stopped. You will notice that cman is already listed under Required-Start and Required-Stop. It's true that cman must be started, but we've created a chain here so we can safely replace it with clvmd in the start line.

As for the stop line, gfs2 should stop before cman as it relies on cman's DLM to operate safely. If anyone has insight on why gfs2 is set to stop first, please let me know. Regardless, we don't want GFS2 to stop before all domU's are down or gone, so we'll set xendomains in it's place.

vim /etc/init.d/gfs2
#!/bin/bash
#
# gfs2 mount/unmount helper
#
# chkconfig: - 26 74
# description: mount/unmount gfs2 filesystems configured in /etc/fstab

### BEGIN INIT INFO
# Provides:             gfs2
# Required-Start:       $network clvmd
# Required-Stop:        $network xendomains
# Default-Start:
# Default-Stop:
# Short-Description:    mount/unmount gfs2 filesystems configured in /etc/fstab
# Description:          mount/unmount gfs2 filesystems configured in /etc/fstab
### END INIT INFO

Altering xendomains

Finally, we will alter xendomains so that it starts last, after gfs2. It needs to be the first daemon to stop, so we will not require anything else be stopped. By default, xend is set in both the start and stop lines. Thanks to our boot chain, we can again safely replace the start xend with gfs2. We'll simply remove the xend in the stop line.

vim /etc/init.d/xendomains
#!/bin/bash
#
# /etc/init.d/xendomains
# Start / stop domains automatically when domain 0 boots / shuts down.
#
# chkconfig: 345 99 00
# description: Start / stop Xen domains.
#
# This script offers fairly basic functionality.  It should work on Redhat
# but also on LSB-compliant SuSE releases and on Debian with the LSB package
# installed.  (LSB is the Linux Standard Base)
#
# Based on the example in the "Designing High Quality Integrated Linux
# Applications HOWTO" by Avi Alkalay
# <http://www.tldp.org/HOWTO/HighQuality-Apps-HOWTO/>
#
### BEGIN INIT INFO
# Provides:          xendomains
# Required-Start:    $syslog $remote_fs gfs2
# Should-Start:
# Required-Stop:     $syslog $remote_fs
# Should-Stop:
# Default-Start:     3 4 5
# Default-Stop:      0 1 2 6
# Default-Enabled:   yes
# Short-Description: Start/stop secondary xen domains
# Description:       Start / stop domains automatically when domain 0 
#                    boots / shuts down.
### END INIT INFO

Applying The Changes

Change the start order by removing and re-adding all cluster-related daemons using chkconfig.

chkconfig xend off; chkconfig cman off; chkconfig drbd off; chkconfig clvmd off; chkconfig gfs2 off; chkconfig xendomains off
chkconfig xendomains on; chkconfig gfs2 on; chkconfig clvmd on; chkconfig drbd on; chkconfig cman on; chkconfig xend on


 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.