Failed Server
A server that has entered a "Failed" state is one where rgmanager received an error state from the virtual server resource agent. Normally, when this occurs, a recovery will be attempted. However, if it fails repeatedly, the server is placed in a failed state and no further actions are taken.
This can also happen when a request to shut down the server was made, but it didn't shut down within two minutes. This can be caused because something is slowing down the shutdown or the server ignored the shutdown request.
If you have encountered this problem, please contact support.
Contents |
Recovery Options
Manual Shutdown
If you manually shut down the server, ScanCore's scan-clustat scan agent will determine when the server is no longer running on either node and automatically clear the 'failed' state.
When the server was shutting down, but just took a long time, then the same mechanism will clear the failed state once it does finally shut down.
Force Off
Warning: This solution is effectively the same as cutting the power to a traditional server. The operating system and software will recover in the same way. This method should be the last option. |
If the server is not responding but still shows as running, you will be able to click on 'Force Off' in Striker. When the server is terminated, it's 'failed' state will also be cleared.
Manual Recover
Warning: This recovery method requires directly accessing the Anvil! nodes. This option should not be used unless you are comfortable with the command line. These steps require running commands as the root user. Mistakes could be catastrophic! |
Both options below require knowing which node the server is currently running on. In the example image above, we see that the 'srv06-centos4' server is 'failed' and it is still running on 'an-a01n01'. Given this, the manual recovery options will be performed on that node.
Open a terminal to the node. This can be done from any Linux machine, including by opening a terminal on either Striker dashboard machine.
ssh root@10.20.10.1
Last login: Thu May 11 14:48:51 2017 from 111.222.33.44 [root@an-a01n01 ~]#
Restore Normal Monitoring
Warning: This recovery procedure requires extreme caution. A mistake could result in disk corruption that destroys the effected server. Manually shutting down the server is always the preferred recovery option. |
Note: A service in a 'failed' state can only be disabled. Thus, clearing the 'failed' state requires trying to disable the server a second time. This will again request the server to shut down. If it is now listening to ACPI power button events, this method will cause the server to power off. |
First, we will verify that the server is still showing as 'failed';
clustat
Cluster Status for an-anvil-01 @ Thu May 11 15:52:48 2017 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ an-a01n01.alteeve.com 1 Online, Local, rgmanager an-a01n02.alteeve.com 2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:libvirtd_n01 an-a01n01.alteeve.com started service:libvirtd_n02 an-a01n02.alteeve.com started service:storage_n01 an-a01n01.alteeve.com started service:storage_n02 an-a01n02.alteeve.com started vm:srv01-rhel7 an-a01n01.alteeve.com started vm:srv02-win2012 (an-a01n02.alteeve.com) disabled vm:srv03-sles12 an-a01n01.alteeve.com started vm:srv04-freebsd11 an-a01n01.alteeve.com started vm:srv05-win2016 (an-a01n02.alteeve.com) disabled vm:srv06-centos4 (an-a01n01.alteeve.com) failed
Now we will try again to disable the srv06-centos4 server. If it ignores it again, the command will wait for two minutes before giving up and clearing the failed state.
Warning: Again, this next step will ask the server to shut down again. If it listens to the request, your server will shut down! |
clusvcadm -d srv06-centos4
Local machine disabling service:srv06-centos4...Warning; see system logs
Now we can verify that the server state is 'disabled'.
clustat
Cluster Status for an-anvil-01 @ Thu May 11 16:01:05 2017 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ an-a01n01.alteeve.com 1 Online, Local, rgmanager an-a01n02.alteeve.com 2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:libvirtd_n01 an-a01n01.alteeve.com started service:libvirtd_n02 an-a01n02.alteeve.com started service:storage_n01 an-a01n01.alteeve.com started service:storage_n02 an-a01n02.alteeve.com started vm:srv01-rhel7 an-a01n01.alteeve.com started vm:srv02-win2012 (an-a01n02.alteeve.com) disabled vm:srv03-sles12 an-a01n01.alteeve.com started vm:srv04-freebsd11 an-a01n01.alteeve.com started vm:srv05-win2016 (an-a01n02.alteeve.com) disabled vm:srv06-centos4 (an-a01n01.alteeve.com) disabled
Verify that the server is still running;
virsh list --all
Id Name State ---------------------------------------------------- 5 srv06-centos4 running 6 srv04-freebsd11 running 7 srv01-rhel7 running 8 srv03-sles12 running
There it is, so we will re-enable it.
Warning: This command MUST specify the name of the host running the server. Failing to do this might cause the server to boot on the wrong node, which could cause serious disk corruption! |
clusvcadm -e srv06-centos4 -m an-a01n01.alteeve.com
Member an-a01n01.alteeve.com trying to enable service:srv06-centos4...Success vm:srv06-centos4 is now running on an-a01n01.alteeve.com
Verify;
clustat
Cluster Status for an-anvil-01 @ Thu May 11 16:04:30 2017 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ an-a01n01.alteeve.com 1 Online, Local, rgmanager an-a01n02.alteeve.com 2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:libvirtd_n01 an-a01n01.alteeve.com started service:libvirtd_n02 an-a01n02.alteeve.com started service:storage_n01 an-a01n01.alteeve.com started service:storage_n02 an-a01n02.alteeve.com started vm:srv01-rhel7 an-a01n01.alteeve.com started vm:srv02-win2012 (an-a01n02.alteeve.com) disabled vm:srv03-sles12 an-a01n01.alteeve.com started vm:srv04-freebsd11 an-a01n01.alteeve.com started vm:srv05-win2016 (an-a01n02.alteeve.com) disabled vm:srv06-centos4 an-a01n01.alteeve.com started
Done.
Manual Cold Migration Recovery
If the server failed to stop during a cold migration, then recovery using Striker will cause the server to restart on the original host.
To manually migrate the server, you will need to manually shut down the server and wait for ScanCore to clear the failed state. Once the server is off and the state has cleared, you can manually force the server to boot on the peer node using;
clusvcadm -e srv06-centos4 -m an-a01n02.alteeve.com
Member an-a01n02.alteeve.com trying to enable service:srv06-centos4...Success vm:srv06-centos4 is now running on an-a01n02.alteeve.com
In this case, the server was told to start on 'an-a01n02'.
Any questions, feedback, advice, complaints or meanderings are welcome. | ||||
Us: Alteeve's Niche! | Support: Mailing List | IRC: #clusterlabs on Freenode | © Alteeve's Niche! Inc. 1997-2019 | |
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions. |