Difference between revisions of "Failed Server"
(Created page with "{{howto_header}} A server that has entered a "Failed" state is one where <span class="code">rgmanager</span> received an error state from the virtual server [[resource ag...") |
|||
Line 3: | Line 3: | ||
A server that has entered a "Failed" state is one where <span class="code">[[rgmanager]]</span> received an error state from the virtual server [[resource agent]]. Normally, when this occurs, a recovery will be attempted. However, if it fails repeatedly, the server is placed in a failed state and no further actions are taken. | A server that has entered a "Failed" state is one where <span class="code">[[rgmanager]]</span> received an error state from the virtual server [[resource agent]]. Normally, when this occurs, a recovery will be attempted. However, if it fails repeatedly, the server is placed in a failed state and no further actions are taken. | ||
− | + | [[Image:Failed_server_01.png|thumb|center|1098px|The <span class="code">srv06-centos4</span> server in the 'Failed' state.]] | |
+ | |||
+ | This can also happen when a request to shut down the server was made, but it didn't shut down within two minutes. This can be caused because something is slowing down the shutdown or the server ignored the shutdown request. | ||
If you have encountered this problem, please contact [[support]]. | If you have encountered this problem, please contact [[support]]. | ||
+ | |||
+ | = Recovery Options = | ||
+ | |||
+ | == Manual Shutdown == | ||
+ | |||
+ | If you manually shut down the server, [[ScanCore]]'s [[scan-clustat]] scan agent will determine when the server is no longer running on either node and automatically clear the 'failed' state. | ||
+ | |||
+ | When the server was shutting down, but just took a long time, then the same mechanism will clear the failed state once it does finally shut down. | ||
+ | |||
+ | == Force Off == | ||
+ | |||
+ | {{warning|1=This solution is effectively the same as cutting the power to a traditional server. The operating system and software will recover in the same way. This method should be the last option.}} | ||
+ | |||
+ | If the server is not responding but still shows as running, you will be able to click on 'Force Off' in Striker. When the server is terminated, it's 'failed' state will also be cleared. | ||
+ | |||
+ | == Manual Recover == | ||
+ | |||
+ | {{warning|1=This recovery method requires directly accessing the ''Anvil!'' nodes. This option should not be used unless you are comfortable with the command line. These steps require running commands as the <span class="code">root</span> user. Mistakes could be catastrophic!}} | ||
+ | |||
+ | Both options below require knowing which node the server is currently running on. In the example image above, we see that the '<span class="code">srv06-centos4</span>' server is 'failed' and it is still running on '<span class="code">an-a01n01</span>'. Given this, the manual recovery options will be performed on that node. | ||
+ | |||
+ | Open a terminal to the node. This can be done from any Linux machine, including by opening a terminal on either Striker dashboard machine. | ||
+ | |||
+ | <syntaxhighlight lang="bash"> | ||
+ | ssh root@10.20.10.1 | ||
+ | </syntaxhighlight> | ||
+ | <syntaxhighlight lang="text"> | ||
+ | Last login: Thu May 11 14:48:51 2017 from 111.222.33.44 | ||
+ | [root@an-a01n01 ~]# | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | === Restore Normal Monitoring === | ||
+ | |||
+ | {{warning|1=This recovery procedure requires extreme caution. A mistake could result in disk corruption that destroys the effected server. Manually shutting down the server is always the preferred recovery option.}} | ||
+ | |||
+ | {{note|1=A service in a 'failed' state can only be [https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/s1-admin-manage-ha-services-cli-CA.html#s2-admin-manage-ha-services-clustat-cli-CA disabled]. Thus, clearing the 'failed' state requires trying to disable the server a second time. This will again request the server to shut down. If it is now listening to ACPI power button events, this method will cause the server to power off.}} | ||
+ | |||
+ | First, we will verify that the server is still showing as 'failed'; | ||
+ | |||
+ | <syntaxhighlight lang="bash"> | ||
+ | clustat | ||
+ | </syntaxhighlight> | ||
+ | <syntaxhighlight lang="text"> | ||
+ | Cluster Status for an-anvil-01 @ Thu May 11 15:52:48 2017 | ||
+ | Member Status: Quorate | ||
+ | |||
+ | Member Name ID Status | ||
+ | ------ ---- ---- ------ | ||
+ | an-a01n01.alteeve.com 1 Online, Local, rgmanager | ||
+ | an-a01n02.alteeve.com 2 Online, rgmanager | ||
+ | |||
+ | Service Name Owner (Last) State | ||
+ | ------- ---- ----- ------ ----- | ||
+ | service:libvirtd_n01 an-a01n01.alteeve.com started | ||
+ | service:libvirtd_n02 an-a01n02.alteeve.com started | ||
+ | service:storage_n01 an-a01n01.alteeve.com started | ||
+ | service:storage_n02 an-a01n02.alteeve.com started | ||
+ | vm:srv01-rhel7 an-a01n01.alteeve.com started | ||
+ | vm:srv02-win2012 (an-a01n02.alteeve.com) disabled | ||
+ | vm:srv03-sles12 an-a01n01.alteeve.com started | ||
+ | vm:srv04-freebsd11 an-a01n01.alteeve.com started | ||
+ | vm:srv05-win2016 (an-a01n02.alteeve.com) disabled | ||
+ | vm:srv06-centos4 (an-a01n01.alteeve.com) failed | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | Now we will try again to disable the <span class="code">srv06-centos4</span> server. If it ignores it again, the command will wait for two minutes before giving up and clearing the failed state. | ||
+ | |||
+ | {{warning|1=Again, this next step will ask the server to shut down again. If it listens to the request, your server will shut down!}} | ||
+ | |||
+ | <syntaxhighlight lang="bash"> | ||
+ | clusvcadm -d srv06-centos4 | ||
+ | </syntaxhighlight> | ||
+ | <syntaxhighlight lang="text"> | ||
+ | Local machine disabling service:srv06-centos4...Warning; see system logs | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | Now we can verify that the server state is 'disabled'. | ||
+ | |||
+ | <syntaxhighlight lang="bash"> | ||
+ | clustat | ||
+ | </syntaxhighlight> | ||
+ | <syntaxhighlight lang="text"> | ||
+ | Cluster Status for an-anvil-01 @ Thu May 11 16:01:05 2017 | ||
+ | Member Status: Quorate | ||
+ | |||
+ | Member Name ID Status | ||
+ | ------ ---- ---- ------ | ||
+ | an-a01n01.alteeve.com 1 Online, Local, rgmanager | ||
+ | an-a01n02.alteeve.com 2 Online, rgmanager | ||
+ | |||
+ | Service Name Owner (Last) State | ||
+ | ------- ---- ----- ------ ----- | ||
+ | service:libvirtd_n01 an-a01n01.alteeve.com started | ||
+ | service:libvirtd_n02 an-a01n02.alteeve.com started | ||
+ | service:storage_n01 an-a01n01.alteeve.com started | ||
+ | service:storage_n02 an-a01n02.alteeve.com started | ||
+ | vm:srv01-rhel7 an-a01n01.alteeve.com started | ||
+ | vm:srv02-win2012 (an-a01n02.alteeve.com) disabled | ||
+ | vm:srv03-sles12 an-a01n01.alteeve.com started | ||
+ | vm:srv04-freebsd11 an-a01n01.alteeve.com started | ||
+ | vm:srv05-win2016 (an-a01n02.alteeve.com) disabled | ||
+ | vm:srv06-centos4 (an-a01n01.alteeve.com) disabled | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | Verify that the server is still running; | ||
+ | |||
+ | <syntaxhighlight lang="bash"> | ||
+ | virsh list --all | ||
+ | </syntaxhighlight> | ||
+ | <syntaxhighlight lang="text"> | ||
+ | Id Name State | ||
+ | ---------------------------------------------------- | ||
+ | 5 srv06-centos4 running | ||
+ | 6 srv04-freebsd11 running | ||
+ | 7 srv01-rhel7 running | ||
+ | 8 srv03-sles12 running | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | There it is, so we will re-enable it. | ||
+ | |||
+ | {{warning|1=This command '''''MUST''''' specify the name of the host running the server. Failing to do this might cause the server to boot on the wrong node, which could cause serious disk corruption!}} | ||
+ | |||
+ | <syntaxhighlight lang="bash"> | ||
+ | clusvcadm -e srv06-centos4 -m an-a01n01.alteeve.com | ||
+ | </syntaxhighlight> | ||
+ | <syntaxhighlight lang="text"> | ||
+ | Member an-a01n01.alteeve.com trying to enable service:srv06-centos4...Success | ||
+ | vm:srv06-centos4 is now running on an-a01n01.alteeve.com | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | Verify; | ||
+ | |||
+ | <syntaxhighlight lang="bash"> | ||
+ | clustat | ||
+ | </syntaxhighlight> | ||
+ | <syntaxhighlight lang="text"> | ||
+ | Cluster Status for an-anvil-01 @ Thu May 11 16:04:30 2017 | ||
+ | Member Status: Quorate | ||
+ | |||
+ | Member Name ID Status | ||
+ | ------ ---- ---- ------ | ||
+ | an-a01n01.alteeve.com 1 Online, Local, rgmanager | ||
+ | an-a01n02.alteeve.com 2 Online, rgmanager | ||
+ | |||
+ | Service Name Owner (Last) State | ||
+ | ------- ---- ----- ------ ----- | ||
+ | service:libvirtd_n01 an-a01n01.alteeve.com started | ||
+ | service:libvirtd_n02 an-a01n02.alteeve.com started | ||
+ | service:storage_n01 an-a01n01.alteeve.com started | ||
+ | service:storage_n02 an-a01n02.alteeve.com started | ||
+ | vm:srv01-rhel7 an-a01n01.alteeve.com started | ||
+ | vm:srv02-win2012 (an-a01n02.alteeve.com) disabled | ||
+ | vm:srv03-sles12 an-a01n01.alteeve.com started | ||
+ | vm:srv04-freebsd11 an-a01n01.alteeve.com started | ||
+ | vm:srv05-win2016 (an-a01n02.alteeve.com) disabled | ||
+ | vm:srv06-centos4 an-a01n01.alteeve.com started | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | Done. | ||
+ | |||
+ | === Manual Cold Migration Recovery === | ||
+ | |||
+ | If the server failed to stop during a cold migration, then recovery using Striker will cause the server to restart on the original host. | ||
+ | |||
+ | To manually migrate the server, you will need to manually shut down the server and wait for ScanCore to clear the failed state. Once the server is off and the state has cleared, you can manually force the server to boot on the peer node using; | ||
+ | |||
+ | <syntaxhighlight lang="bash"> | ||
+ | clusvcadm -e srv06-centos4 -m an-a01n02.alteeve.com | ||
+ | </syntaxhighlight> | ||
+ | <syntaxhighlight lang="text"> | ||
+ | Member an-a01n02.alteeve.com trying to enable service:srv06-centos4...Success | ||
+ | vm:srv06-centos4 is now running on an-a01n02.alteeve.com | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | In this case, the server was told to start on '<span class="code">an-a01n02</span>'. | ||
{{footer}} | {{footer}} |
Latest revision as of 20:08, 11 May 2017
A server that has entered a "Failed" state is one where rgmanager received an error state from the virtual server resource agent. Normally, when this occurs, a recovery will be attempted. However, if it fails repeatedly, the server is placed in a failed state and no further actions are taken.
This can also happen when a request to shut down the server was made, but it didn't shut down within two minutes. This can be caused because something is slowing down the shutdown or the server ignored the shutdown request.
If you have encountered this problem, please contact support.
Contents |
[edit] Recovery Options
[edit] Manual Shutdown
If you manually shut down the server, ScanCore's scan-clustat scan agent will determine when the server is no longer running on either node and automatically clear the 'failed' state.
When the server was shutting down, but just took a long time, then the same mechanism will clear the failed state once it does finally shut down.
[edit] Force Off
Warning: This solution is effectively the same as cutting the power to a traditional server. The operating system and software will recover in the same way. This method should be the last option. |
If the server is not responding but still shows as running, you will be able to click on 'Force Off' in Striker. When the server is terminated, it's 'failed' state will also be cleared.
[edit] Manual Recover
Warning: This recovery method requires directly accessing the Anvil! nodes. This option should not be used unless you are comfortable with the command line. These steps require running commands as the root user. Mistakes could be catastrophic! |
Both options below require knowing which node the server is currently running on. In the example image above, we see that the 'srv06-centos4' server is 'failed' and it is still running on 'an-a01n01'. Given this, the manual recovery options will be performed on that node.
Open a terminal to the node. This can be done from any Linux machine, including by opening a terminal on either Striker dashboard machine.
ssh root@10.20.10.1
Last login: Thu May 11 14:48:51 2017 from 111.222.33.44 [root@an-a01n01 ~]#
[edit] Restore Normal Monitoring
Warning: This recovery procedure requires extreme caution. A mistake could result in disk corruption that destroys the effected server. Manually shutting down the server is always the preferred recovery option. |
Note: A service in a 'failed' state can only be disabled. Thus, clearing the 'failed' state requires trying to disable the server a second time. This will again request the server to shut down. If it is now listening to ACPI power button events, this method will cause the server to power off. |
First, we will verify that the server is still showing as 'failed';
clustat
Cluster Status for an-anvil-01 @ Thu May 11 15:52:48 2017 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ an-a01n01.alteeve.com 1 Online, Local, rgmanager an-a01n02.alteeve.com 2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:libvirtd_n01 an-a01n01.alteeve.com started service:libvirtd_n02 an-a01n02.alteeve.com started service:storage_n01 an-a01n01.alteeve.com started service:storage_n02 an-a01n02.alteeve.com started vm:srv01-rhel7 an-a01n01.alteeve.com started vm:srv02-win2012 (an-a01n02.alteeve.com) disabled vm:srv03-sles12 an-a01n01.alteeve.com started vm:srv04-freebsd11 an-a01n01.alteeve.com started vm:srv05-win2016 (an-a01n02.alteeve.com) disabled vm:srv06-centos4 (an-a01n01.alteeve.com) failed
Now we will try again to disable the srv06-centos4 server. If it ignores it again, the command will wait for two minutes before giving up and clearing the failed state.
Warning: Again, this next step will ask the server to shut down again. If it listens to the request, your server will shut down! |
clusvcadm -d srv06-centos4
Local machine disabling service:srv06-centos4...Warning; see system logs
Now we can verify that the server state is 'disabled'.
clustat
Cluster Status for an-anvil-01 @ Thu May 11 16:01:05 2017 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ an-a01n01.alteeve.com 1 Online, Local, rgmanager an-a01n02.alteeve.com 2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:libvirtd_n01 an-a01n01.alteeve.com started service:libvirtd_n02 an-a01n02.alteeve.com started service:storage_n01 an-a01n01.alteeve.com started service:storage_n02 an-a01n02.alteeve.com started vm:srv01-rhel7 an-a01n01.alteeve.com started vm:srv02-win2012 (an-a01n02.alteeve.com) disabled vm:srv03-sles12 an-a01n01.alteeve.com started vm:srv04-freebsd11 an-a01n01.alteeve.com started vm:srv05-win2016 (an-a01n02.alteeve.com) disabled vm:srv06-centos4 (an-a01n01.alteeve.com) disabled
Verify that the server is still running;
virsh list --all
Id Name State ---------------------------------------------------- 5 srv06-centos4 running 6 srv04-freebsd11 running 7 srv01-rhel7 running 8 srv03-sles12 running
There it is, so we will re-enable it.
Warning: This command MUST specify the name of the host running the server. Failing to do this might cause the server to boot on the wrong node, which could cause serious disk corruption! |
clusvcadm -e srv06-centos4 -m an-a01n01.alteeve.com
Member an-a01n01.alteeve.com trying to enable service:srv06-centos4...Success vm:srv06-centos4 is now running on an-a01n01.alteeve.com
Verify;
clustat
Cluster Status for an-anvil-01 @ Thu May 11 16:04:30 2017 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ an-a01n01.alteeve.com 1 Online, Local, rgmanager an-a01n02.alteeve.com 2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:libvirtd_n01 an-a01n01.alteeve.com started service:libvirtd_n02 an-a01n02.alteeve.com started service:storage_n01 an-a01n01.alteeve.com started service:storage_n02 an-a01n02.alteeve.com started vm:srv01-rhel7 an-a01n01.alteeve.com started vm:srv02-win2012 (an-a01n02.alteeve.com) disabled vm:srv03-sles12 an-a01n01.alteeve.com started vm:srv04-freebsd11 an-a01n01.alteeve.com started vm:srv05-win2016 (an-a01n02.alteeve.com) disabled vm:srv06-centos4 an-a01n01.alteeve.com started
Done.
[edit] Manual Cold Migration Recovery
If the server failed to stop during a cold migration, then recovery using Striker will cause the server to restart on the original host.
To manually migrate the server, you will need to manually shut down the server and wait for ScanCore to clear the failed state. Once the server is off and the state has cleared, you can manually force the server to boot on the peer node using;
clusvcadm -e srv06-centos4 -m an-a01n02.alteeve.com
Member an-a01n02.alteeve.com trying to enable service:srv06-centos4...Success vm:srv06-centos4 is now running on an-a01n02.alteeve.com
In this case, the server was told to start on 'an-a01n02'.
Any questions, feedback, advice, complaints or meanderings are welcome. | ||||
Us: Alteeve's Niche! | Support: Mailing List | IRC: #clusterlabs on Freenode | © Alteeve's Niche! Inc. 1997-2019 | |
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions. |