2-Node Red Hat KVM Cluster Tutorial - Troubleshooting

From Alteeve Wiki
Jump to navigation Jump to search

 AN!Wiki :: How To :: 2-Node Red Hat KVM Cluster Tutorial - Troubleshooting

Warning: This document is old, abandoned and very out of date. DON'T USE ANYTHING HERE! Consider it only as historical note taking.

This is the trouble-shooting section from the 2-Node Red Hat KVM Cluster Tutorial tutorial.

Troubleshooting

Here we will cover, in no particular order, some common clustering problems and their fixes.

[vm] error: internal error Attempt to migrate guest to the same host {uuid}

Note: See rhbz#770626. Work-around below. This seems to now be resolved! Setting host_uuid is enough to work around this issue.

This message will appear in the source node's syslog when trying to migrate a VM. Here is an example set of error messages.

Dec 27 22:00:46 an-node01 rgmanager[2492]: Migrating vm:vm0001-dev to an-node02.alteeve.ca
Dec 27 22:00:46 an-node01 rgmanager[22331]: [vm] Migrate vm0001-dev to an-node02.alteeve.ca failed:
Dec 27 22:00:46 an-node01 rgmanager[22353]: [vm] error: internal error Attempt to migrate guest to the same host 00020003-0004-0005-0006-000700080009
Dec 27 22:00:46 an-node01 rgmanager[2492]: migrate on vm "vm0001-dev" returned 150 (unspecified)
Dec 27 22:00:46 an-node01 rgmanager[2492]: Migration of vm:vm0001-dev to an-node02.alteeve.ca failed; return code 150

For reasons as yet unknown, both nodes have the same UUID. You can verify this by running virsh sysinfo | grep uuid on both nodes.

First node;

virsh sysinfo | grep uuid
    <entry name='uuid'>03000200-0400-0500-0006-000700080009</entry>

First node;

virsh sysinfo | grep uuid
    <entry name='uuid'>03000200-0400-0500-0006-000700080009</entry>

This UUID comes from the mainboard, and you can confirm this with the following command (note to change the string in grep to a portion of your UUID);

dmidecode -s system-uuid
03000200-0400-0500-0006-000700080009

Alternatively;

dmidecode |grep 000700080009 -B 7 -A 4
Handle 0x0001, DMI type 1, 27 bytes
System Information
	Manufacturer: empty
	Product Name: empty
	Version: empty
	Serial Number: empty
	UUID: 03000200-0400-0500-0006-000700080009
	Wake-up Type: Power Switch
	SKU Number: To be filled by O.E.M.
	Family: To be filled by O.E.M.

This is the result of a lazy vendor re-using UUIDs across mainboards.

The fix is to specify a unique UUID in /etc/libvirt/libvirtd.conf using its host_uuid variable. We'll generate new, unique UUIDs for each node using the uuidgen command. Be sure to use two new UUIDs for each node!

On the first node;

cp /etc/libvirt/libvirtd.conf /etc/libvirt/libvirtd.conf.orig
uuidgen
31873b9e-1069-42ce-b950-137ae5eaa3d1

Change the UUID;

vim /etc/libvirt/libvirtd.conf
host_uuid = "31873b9e-1069-42ce-b950-137ae5eaa3d1"

Here's the diff;

diff -u /etc/libvirt/libvirtd.conf.orig /etc/libvirt/libvirtd.conf
--- /etc/libvirt/libvirtd.conf.orig	2011-12-27 22:29:01.243394880 -0500
+++ /etc/libvirt/libvirtd.conf	2011-12-27 22:33:44.309799253 -0500
@@ -365,4 +365,4 @@
 # NB This default all-zeros UUID will not work. Replace
 # it with the output of the 'uuidgen' command and then
 # uncomment this entry
-#host_uuid = "00000000-0000-0000-0000-000000000000"
+host_uuid = "31873b9e-1069-42ce-b950-137ae5eaa3d1"

Make the same change, with a new and unique UUID, on the second node.

cp /etc/libvirt/libvirtd.conf /etc/libvirt/libvirtd.conf.orig
uuidgen
90b8d280-c9ff-4e0e-867e-6d4f7d915995

Change the UUID;

vim /etc/libvirt/libvirtd.conf
host_uuid = "90b8d280-c9ff-4e0e-867e-6d4f7d915995"

Here's the diff;

diff -u /etc/libvirt/libvirtd.conf.orig /etc/libvirt/libvirtd.conf
--- /etc/libvirt/libvirtd.conf.orig	2011-12-27 22:35:45.975389858 -0500
+++ /etc/libvirt/libvirtd.conf	2011-12-27 22:36:28.325518880 -0500
@@ -365,4 +365,4 @@
 # NB This default all-zeros UUID will not work. Replace
 # it with the output of the 'uuidgen' command and then
 # uncomment this entry
-#host_uuid = "00000000-0000-0000-0000-000000000000"
+host_uuid = "90b8d280-c9ff-4e0e-867e-6d4f7d915995"

Now to reload the configuration, we need to restart libvirtd (a reload is not enough).

Warning: Be sure to stop all VMs on the node before proceeding!
/etc/init.d/libvirtd restart
Stopping libvirtd daemon:                                  [  OK  ]
Starting libvirtd daemon:                                  [  OK  ]
virsh sysinfo | grep uuid

This should show the new UUID. If it doesn't though, please apply the work-around below.

Setting host_uuid Didn't Work, What Now?

Warning: This work-around is not supported in any way supported by Red Hat or any other vendor. This work-around is provided as-is until libvirt is fixed. - Dec. 28, 2011

The problem is that libvirt doesn't use libvirtd.conf's host_uuid if it sees the system UUID as being valid (not all 0 or all f).

The work-around is to create a wrapper script for dmidecode that intercepts dmidecode -q -t 0,1,4,17, reads the libvirtd.conf and, if host_uuid is set, substitute UUID returned by dmidecode with the one set by host_uuid.

Note: You can look at the source of the wrapper script on pastebin.org.

To apply the work-around;

Check that the current dmidecode returns the bad UUID;

dmidecode -q -t 0,1,4,17 | grep UUID
	UUID: 03000200-0400-0500-0006-000700080009

Now we're going to rename dmidecode as dmidecode.orig, then download the wrapper script.

mv /usr/sbin/dmidecode /usr/sbin/dmidecode.orig
wget -c https://alteeve.ca/files/dmidecode -O /usr/sbin/dmidecode
--2011-12-28 13:44:27--  https://alteeve.ca/files/dmidecode
Resolving alteeve.ca... 192.139.81.121
Connecting to alteeve.ca|192.139.81.121|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1159 (1.1K) [text/plain]
Saving to: “/usr/sbin/dmidecode”

100%[======================================>] 1,159       --.-K/s   in 0s      

2011-12-28 13:44:28 (15.3 MB/s) - “/usr/sbin/dmidecode” saved [1159/1159]
chmod 755 /usr/sbin/dmidecode
ls -lah /usr/sbin/dmidecode
-rwxr-xr-x 1 root root 1.2K Dec 28 13:26 /usr/sbin/dmidecode

Now re-run the dmidecode call and see that the new UUID is used.

dmidecode -q -t 0,1,4,17 | grep UUID
	UUID: 31873b9e-1069-42ce-b950-137ae5eaa3d1

This matches what was set in /etc/libvirt/libvirtd.conf;

grep host_uuid /etc/libvirt/libvirtd.conf
host_uuid = "31873b9e-1069-42ce-b950-137ae5eaa3d1"

Now restart libvirtd and check virsh sysinfo to confirm that libvirtd now returns the proper UUID.

/etc/init.d/libvirtd restart
Stopping libvirtd daemon:                                  [  OK  ]
Starting libvirtd daemon:                                  [  OK  ]
virsh sysinfo | grep uuid
    <entry name='uuid'>31873b9e-1069-42ce-b950-137ae5eaa3d1</entry>

Done!

As soon as libvirtd is fixed, this section will be re-written.

[vm] error: Cannot recv data: Host key verification failed.#015: Connection reset by peer

This can show up when you try to live migrate a VM but your /root/.ssh/known_hosts file has not been populated. Effectively, the cluster was prompted to accept the finger-print of the target node, was unable to answer and so then closed the connection.

The syslog entry will look something like this;

Dec 27 21:58:00 an-node02 rgmanager[2439]: Migrating vm:vm0003-db to an-node01.alteeve.ca
Dec 27 21:58:01 an-node02 rgmanager[18951]: [vm] Migrate vm0003-db to an-node01.alteeve.ca failed:
Dec 27 21:58:01 an-node02 rgmanager[18973]: [vm] error: Cannot recv data: Host key verification failed.#015: Connection reset by peer
Dec 27 21:58:01 an-node02 rgmanager[2439]: migrate on vm "vm0003-db" returned 150 (unspecified)
Dec 27 21:58:01 an-node02 rgmanager[2439]: Migration of vm:vm0003-db to an-node01.alteeve.ca failed; return code 150

To fix the problem, please return to Populating And Pushing ~/ssh/known_hosts.

error: unknown OS type hvm

This can be caused by hardware virtualization support being disabled in your BIOS.

To check whether you have hardware virtualization support enabled, run;

egrep '(vmx|svm)' --color=always /proc/cpuinfo

On Intel machines, you should see this;

flags		: ... vmx ...

On AMD machines, you should see this;

flags		: ... svm ...

The above will have the xvm or svm highlighted and the flags line will be quite long. You will also see an entry for every CPU core (or hyperthreaded pseudo-core).

If you don't see a match to either xvm or svm, please consult your motherboard's manual for information on enabling hardware virtualization.

My VM Just Vanished!

Warning: If virsh tries to start a virtual machine but a referenced device or media is missing, it will react by completely undefining the virtual machine!

If you ever suddenly find that a virtual machine has vanished, it is probably because something the VM wanted to use couldn't be found. This can be as trivial as deleting an ISO that a VM had been defined to mount on boot.

Let's look at the example where an ISO was deleted, as this is a common issue.

Copy your last backup of the XML definition file for the effected VM and then edit it to remove the <source file='...'/> lines for the removed media. For example, change:

    <disk type='file' device='floppy'>
      <driver name='qemu' type='raw' cache='none' io='threads'/>
      <source file='/shared/files/virtio-win-1.1.16.vfd'/>
      <target dev='fda' bus='fdc'/>
      <alias name='fdc0-0-0'/>
      <address type='drive' controller='0' bus='0' unit='0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw' io='threads'/>
      <source file='/shared/files/Windows_Server_2008_R2_64Bit_SP1.iso'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <alias name='ide0-1-0'/>
      <address type='drive' controller='0' bus='1' unit='0'/>
    </disk>

To:

    <disk type='file' device='floppy'>
      <driver name='qemu' type='raw' cache='none' io='threads'/>
      <target dev='fda' bus='fdc'/>
      <alias name='fdc0-0-0'/>
      <address type='drive' controller='0' bus='0' unit='0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw' io='threads'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <alias name='ide0-1-0'/>
      <address type='drive' controller='0' bus='1' unit='0'/>
    </disk>

Then redefine the VM and you can safely restart it again.

virsh define /shared/definitions/vm0002-ms.xml

You should be back in business at this point.

Disabling rsyslog Rate Limiting

If you are getting messages like rsyslogd-2177: imuxsock lost 575 messages from pid 29288 due to rate-limiting in EL6.3+, it is because of the tighter message flood restrictions. You can disable these messages by following the steps below.

Make a backup of the original rsyslog.conf, then edit /etc/rsyslog.conf and locate the line with $ModLoad imuxsock. directly below it add the following two entries, one per line; $SystemLogRateLimitInterval 0 and $SystemLogRateLimitBurst 0.

cp /etc/rsyslog.conf /etc/rsyslog.conf.orig
vim /etc/rsyslog.conf
$ModLoad imuxsock # provides support for local system logging (e.g. via logger command)
$SystemLogRateLimitInterval 0
$SystemLogRateLimitBurst 0

Save it and verify that the changes look sane by comparing against the original file:

diff -u /etc/rsyslog.conf.orig /etc/rsyslog.conf
--- /etc/rsyslog.conf.orig	2012-08-05 18:42:31.016783419 -0400
+++ /etc/rsyslog.conf	2012-08-05 18:42:17.609783118 -0400
@@ -6,6 +6,9 @@
 #### MODULES ####
 
 $ModLoad imuxsock # provides support for local system logging (e.g. via logger command)
+$SystemLogRateLimitInterval 0
+$SystemLogRateLimitBurst 0
+
 $ModLoad imklog   # provides kernel logging support (previously done by rklogd)
 #$ModLoad immark  # provides --MARK-- message capability

Restart the rsyslog daemon to make the changes take effect.

/etc/init.d/rsyslog restart
Shutting down system logger:                               [  OK  ]
Starting system logger:                                    [  OK  ]

Done!

You should no longer see rate limit messages.

FATAL: Module drbd not found

If you update your operating system's kernel, but it was added to /boot/grub/grub.conf in the wrong order, DRBD's kernel module will not load. In this case, you will see an error like:

modprobe drbd
FATAL: Module drbd not found.

Alternatively, if you are trying to install DRBD from source, you might see an error like this:

make
make -C drbd drbd_buildtag.c
make[1]: Entering directory `/root/drbd-8.3.15/drbd'
make[1]: Leaving directory `/root/drbd-8.3.15/drbd'
make[1]: Entering directory `/root/drbd-8.3.15/user'
flex -s -odrbdadm_scanner.c drbdadm_scanner.fl
gcc -g -O2 -Wall -I../drbd -I../drbd/compat   -c -o drbdadm_scanner.o drbdadm_scanner.c
gcc -g -O2 -Wall -I../drbd -I../drbd/compat   -c -o drbdadm_parser.o drbdadm_parser.c
gcc -g -O2 -Wall -I../drbd -I../drbd/compat   -c -o drbdadm_main.o drbdadm_main.c
gcc -g -O2 -Wall -I../drbd -I../drbd/compat   -c -o drbdadm_adjust.o drbdadm_adjust.c
gcc -g -O2 -Wall -I../drbd -I../drbd/compat   -c -o drbdtool_common.o drbdtool_common.c
gcc -g -O2 -Wall -I../drbd -I../drbd/compat   -c -o drbdadm_usage_cnt.o drbdadm_usage_cnt.c
cp ../drbd/drbd_buildtag.c drbd_buildtag.c
gcc -g -O2 -Wall -I../drbd -I../drbd/compat   -c -o drbd_buildtag.o drbd_buildtag.c
gcc -g -O2 -Wall -I../drbd -I../drbd/compat   -c -o drbdadm_minor_table.o drbdadm_minor_table.c
gcc  -o drbdadm drbdadm_scanner.o drbdadm_parser.o drbdadm_main.o drbdadm_adjust.o drbdtool_common.o drbdadm_usage_cnt.o drbd_buildtag.o drbdadm_minor_table.o
gcc -g -O2 -Wall -I../drbd -I../drbd/compat   -c -o drbdmeta.o drbdmeta.c
flex -s -odrbdmeta_scanner.c drbdmeta_scanner.fl
gcc -g -O2 -Wall -I../drbd -I../drbd/compat   -c -o drbdmeta_scanner.o drbdmeta_scanner.c
gcc  -o drbdmeta drbdmeta.o drbdmeta_scanner.o drbdtool_common.o drbd_buildtag.o
gcc -g -O2 -Wall -I../drbd -I../drbd/compat   -c -o drbdsetup.o drbdsetup.c
cp ../drbd/drbd_strings.c drbd_strings.c
gcc -g -O2 -Wall -I../drbd -I../drbd/compat   -c -o drbd_strings.o drbd_strings.c
gcc  -o drbdsetup drbdsetup.o drbdtool_common.o drbd_buildtag.o drbd_strings.o
make[1]: Leaving directory `/root/drbd-8.3.15/user'
make[1]: Entering directory `/root/drbd-8.3.15/scripts'
make[1]: Nothing to be done for `all'.
make[1]: Leaving directory `/root/drbd-8.3.15/scripts'
make[1]: Entering directory `/root/drbd-8.3.15/documentation'
To (re)make the documentation: make doc
make[1]: Leaving directory `/root/drbd-8.3.15/documentation'

	Userland tools build was successful.
    SORRY, kernel makefile not found.
    You need to tell me a correct KDIR,
    Or install the neccessary kernel source packages.

make: *** [check-kdir] Error 1

In either case, check your currently running kernel:

uname -a
Linux an-node01.alteeve.ca 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Note that there is no suffix after 2.6.32-279. This is the original kernel, not the updated one. You can confirm the mismatch by checking the version of the kernel-devel package:

yum list | grep kernel-headers
kernel-headers.x86_64                  2.6.32-279.19.1.el6         @updates

Note the version number of the header RPM is 2.6.32-279.19.1. The .19.1 suffix shows that it is a newer kernel than the one that is running. This is our problem.

This generally happens because of a bad order of kernels in /boot/grub/grub.conf. We can check this by looking at the file:

cat /boot/grub/grub.conf
# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /boot/, eg.
#          root (hd0,0)
#          kernel /vmlinuz-version ro root=/dev/sda2
#          initrd /initrd-[generic-]version.img
#boot=/dev/sda
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title CentOS (2.6.32-279.el6.x86_64)
	root (hd0,0)
	kernel /vmlinuz-2.6.32-279.el6.x86_64 ro root=UUID=861b2d43-6c16-4bfa-ae59-a0a95eebf607 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet
	initrd /initramfs-2.6.32-279.el6.x86_64.img
title CentOS (2.6.32-279.19.1.el6.x86_64)
	root (hd0,0)
	kernel /vmlinuz-2.6.32-279.19.1.el6.x86_64 ro root=UUID=861b2d43-6c16-4bfa-ae59-a0a95eebf607 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet
	initrd /initramfs-2.6.32-279.19.1.el6.x86_64.img

Note that default=0 which tells us that the default kernel is the first one in the list. If you look at the kernel version described by the first title entry, it is the 2.6.32-279.el6.x86_64 version.

You can fix this by changing the default value to 1, but you will likely miss future kernel updates. It is better instead to put the newer kernel version at the top of the list.

vim /boot/grub/grub.conf
# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /boot/, eg.
#          root (hd0,0)
#          kernel /vmlinuz-version ro root=/dev/sda2
#          initrd /initrd-[generic-]version.img
#boot=/dev/sda
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title CentOS (2.6.32-279.19.1.el6.x86_64)
	root (hd0,0)
	kernel /vmlinuz-2.6.32-279.19.1.el6.x86_64 ro root=UUID=861b2d43-6c16-4bfa-ae59-a0a95eebf607 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet
	initrd /initramfs-2.6.32-279.19.1.el6.x86_64.img
title CentOS (2.6.32-279.el6.x86_64)
	root (hd0,0)
	kernel /vmlinuz-2.6.32-279.el6.x86_64 ro root=UUID=861b2d43-6c16-4bfa-ae59-a0a95eebf607 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet
	initrd /initramfs-2.6.32-279.el6.x86_64.img

Note now that the first entry is the 2.6.32-279.19.1.el6.x86_64 version. Once you reboot, that should be the kernel that is loaded.

Once you reboot, you should see that your are now running the latest kernel;

uname -a
Linux an-node01.alteeve.ca 2.6.32-279.19.1.el6.x86_64 #1 SMP Wed Dec 19 07:05:20 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

You should now be able to load the drbd module. If you can't, please reinstall DRBD to ensure that it's kernel module is built against the latest kernel.

Starting Cluster; Mounting configfs... mount: none already mounted or /sys/kernel/config busy

This error occurs when the / (root) file system is full. When there is no space on disk, if causes cman to fail to start with the following error;

/etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs... mount: none already mounted or /sys/kernel/config busy
                                                           [FAILED]
Stopping cluster: 
   Leaving fence domain...                                 [  OK  ]
   Stopping gfs_controld...                                [  OK  ]
   Stopping dlm_controld...                                [  OK  ]
   Stopping fenced...                                      [  OK  ]
   Stopping cman...                                        [  OK  ]
   Unloading kernel modules...                             [  OK  ]
   Unmounting configfs...                                  [  OK  ]

Sure enough, if you run df, you will see no space left.

df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              40G   40G     0 100% /

Note that in this case, even the temp file systems failed to mount and /boot isn't visible. In this example, a bad RAM module caused a flood of errors in /var/log/, so deleting those log files freed up space. The system was so messed up that the node had to be fenced to reboot it as the reboot command failed.

Once free space was recovered and the node was rebooted, the cluster was able to start.

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.