Failed Server

From AN!Wiki
Revision as of 20:08, 11 May 2017 by Digimer (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

 AN!Wiki :: How To :: Failed Server

A server that has entered a "Failed" state is one where rgmanager received an error state from the virtual server resource agent. Normally, when this occurs, a recovery will be attempted. However, if it fails repeatedly, the server is placed in a failed state and no further actions are taken.

The srv06-centos4 server in the 'Failed' state.

This can also happen when a request to shut down the server was made, but it didn't shut down within two minutes. This can be caused because something is slowing down the shutdown or the server ignored the shutdown request.

If you have encountered this problem, please contact support.

Contents

Recovery Options

Manual Shutdown

If you manually shut down the server, ScanCore's scan-clustat scan agent will determine when the server is no longer running on either node and automatically clear the 'failed' state.

When the server was shutting down, but just took a long time, then the same mechanism will clear the failed state once it does finally shut down.

Force Off

Template warning icon.png
Warning: This solution is effectively the same as cutting the power to a traditional server. The operating system and software will recover in the same way. This method should be the last option.

If the server is not responding but still shows as running, you will be able to click on 'Force Off' in Striker. When the server is terminated, it's 'failed' state will also be cleared.

Manual Recover

Template warning icon.png
Warning: This recovery method requires directly accessing the Anvil! nodes. This option should not be used unless you are comfortable with the command line. These steps require running commands as the root user. Mistakes could be catastrophic!

Both options below require knowing which node the server is currently running on. In the example image above, we see that the 'srv06-centos4' server is 'failed' and it is still running on 'an-a01n01'. Given this, the manual recovery options will be performed on that node.

Open a terminal to the node. This can be done from any Linux machine, including by opening a terminal on either Striker dashboard machine.

ssh root@10.20.10.1
Last login: Thu May 11 14:48:51 2017 from 111.222.33.44
[root@an-a01n01 ~]#

Restore Normal Monitoring

Template warning icon.png
Warning: This recovery procedure requires extreme caution. A mistake could result in disk corruption that destroys the effected server. Manually shutting down the server is always the preferred recovery option.
Template note icon.png
Note: A service in a 'failed' state can only be disabled. Thus, clearing the 'failed' state requires trying to disable the server a second time. This will again request the server to shut down. If it is now listening to ACPI power button events, this method will cause the server to power off.

First, we will verify that the server is still showing as 'failed';

clustat
Cluster Status for an-anvil-01 @ Thu May 11 15:52:48 2017
Member Status: Quorate
 
 Member Name                                            ID   Status
 ------ ----                                            ---- ------
 an-a01n01.alteeve.com                                      1 Online, Local, rgmanager
 an-a01n02.alteeve.com                                      2 Online, rgmanager
 
 Service Name                                  Owner (Last)                                  State         
 ------- ----                                  ----- ------                                  -----         
 service:libvirtd_n01                          an-a01n01.alteeve.com                         started       
 service:libvirtd_n02                          an-a01n02.alteeve.com                         started       
 service:storage_n01                           an-a01n01.alteeve.com                         started       
 service:storage_n02                           an-a01n02.alteeve.com                         started       
 vm:srv01-rhel7                                an-a01n01.alteeve.com                         started       
 vm:srv02-win2012                              (an-a01n02.alteeve.com)                       disabled      
 vm:srv03-sles12                               an-a01n01.alteeve.com                         started       
 vm:srv04-freebsd11                            an-a01n01.alteeve.com                         started       
 vm:srv05-win2016                              (an-a01n02.alteeve.com)                       disabled      
 vm:srv06-centos4                              (an-a01n01.alteeve.com)                       failed

Now we will try again to disable the srv06-centos4 server. If it ignores it again, the command will wait for two minutes before giving up and clearing the failed state.

Template warning icon.png
Warning: Again, this next step will ask the server to shut down again. If it listens to the request, your server will shut down!
clusvcadm -d srv06-centos4
Local machine disabling service:srv06-centos4...Warning; see system logs

Now we can verify that the server state is 'disabled'.

clustat
Cluster Status for an-anvil-01 @ Thu May 11 16:01:05 2017
Member Status: Quorate
 
 Member Name                                            ID   Status
 ------ ----                                            ---- ------
 an-a01n01.alteeve.com                                      1 Online, Local, rgmanager
 an-a01n02.alteeve.com                                      2 Online, rgmanager
 
 Service Name                                  Owner (Last)                                  State         
 ------- ----                                  ----- ------                                  -----         
 service:libvirtd_n01                          an-a01n01.alteeve.com                         started       
 service:libvirtd_n02                          an-a01n02.alteeve.com                         started       
 service:storage_n01                           an-a01n01.alteeve.com                         started       
 service:storage_n02                           an-a01n02.alteeve.com                         started       
 vm:srv01-rhel7                                an-a01n01.alteeve.com                         started       
 vm:srv02-win2012                              (an-a01n02.alteeve.com)                       disabled      
 vm:srv03-sles12                               an-a01n01.alteeve.com                         started       
 vm:srv04-freebsd11                            an-a01n01.alteeve.com                         started       
 vm:srv05-win2016                              (an-a01n02.alteeve.com)                       disabled      
 vm:srv06-centos4                              (an-a01n01.alteeve.com)                       disabled

Verify that the server is still running;

virsh list --all
 Id    Name                           State
----------------------------------------------------
 5     srv06-centos4                  running
 6     srv04-freebsd11                running
 7     srv01-rhel7                    running
 8     srv03-sles12                   running

There it is, so we will re-enable it.

Template warning icon.png
Warning: This command MUST specify the name of the host running the server. Failing to do this might cause the server to boot on the wrong node, which could cause serious disk corruption!
clusvcadm -e srv06-centos4 -m an-a01n01.alteeve.com
Member an-a01n01.alteeve.com trying to enable service:srv06-centos4...Success
vm:srv06-centos4 is now running on an-a01n01.alteeve.com

Verify;

clustat
Cluster Status for an-anvil-01 @ Thu May 11 16:04:30 2017
Member Status: Quorate
 
 Member Name                                            ID   Status
 ------ ----                                            ---- ------
 an-a01n01.alteeve.com                                      1 Online, Local, rgmanager
 an-a01n02.alteeve.com                                      2 Online, rgmanager
 
 Service Name                                  Owner (Last)                                  State         
 ------- ----                                  ----- ------                                  -----         
 service:libvirtd_n01                          an-a01n01.alteeve.com                         started       
 service:libvirtd_n02                          an-a01n02.alteeve.com                         started       
 service:storage_n01                           an-a01n01.alteeve.com                         started       
 service:storage_n02                           an-a01n02.alteeve.com                         started       
 vm:srv01-rhel7                                an-a01n01.alteeve.com                         started       
 vm:srv02-win2012                              (an-a01n02.alteeve.com)                       disabled      
 vm:srv03-sles12                               an-a01n01.alteeve.com                         started       
 vm:srv04-freebsd11                            an-a01n01.alteeve.com                         started       
 vm:srv05-win2016                              (an-a01n02.alteeve.com)                       disabled      
 vm:srv06-centos4                              an-a01n01.alteeve.com                         started

Done.

Manual Cold Migration Recovery

If the server failed to stop during a cold migration, then recovery using Striker will cause the server to restart on the original host.

To manually migrate the server, you will need to manually shut down the server and wait for ScanCore to clear the failed state. Once the server is off and the state has cleared, you can manually force the server to boot on the peer node using;

clusvcadm -e srv06-centos4 -m an-a01n02.alteeve.com
Member an-a01n02.alteeve.com trying to enable service:srv06-centos4...Success
vm:srv06-centos4 is now running on an-a01n02.alteeve.com

In this case, the server was told to start on 'an-a01n02'.

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Us: Alteeve's Niche! Support: Mailing List IRC: #clusterlabs on Freenode   © Alteeve's Niche! Inc. 1997-2019
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.
Personal tools
Namespaces

Variants
Actions
Navigation
projects
Toolbox