Managing Drive Failures with AN!CDB

From Alteeve Wiki
Revision as of 00:57, 15 February 2014 by Digimer (talk | contribs)
Jump to navigation Jump to search

 AN!Wiki :: How To :: Managing Drive Failures with AN!CDB

Note: At this time, only LSI-based controllers are supported. Please see this section of the AN!Cluster Tutorial 2 for required node configuration.

The AN!CDB dashboard supports basic drive management for nodes using LSI-based RAID controllers. This covers most all Fujitsu servers.

This guide will show you how to handle a few common storage management tasks easily and quickly.

Starting The Storage Manager

From the main AN!CDB page, under the "Cluster Nodes - Control window, click on the name of the node you wish to manage.

AN!CDB main page with node names as click-able items.

This will open a new tab (or window) showing the current configuration and state of the node's storage.

Storage Display Window

The storage display window shows your storage controller(s), their auxiliary power supply for write-back caching if installed, the logical disk(s) and each logical disk's constituent drives.

The auxiliary power and logical disks will be slightly indented under their parent controller.

The physical disks associates with a given logical disk are further indented, to show their association.

AN!CDB storage management page.

In this example, we have only one RAID controller, it has an auxiliary power pack and a single logical volume has been created.

The Logical volume is a RAID level 5 array with four physical disks.

Controlling the Physical Disk Identification ID Light

The first task we will explore is using identification lights to match a physical disk listing with a physical drive in a node.

If a drive fails completely, it's fault light will light up, making the failed drive easy to find. However, the AN!CDB alert system can notify us of pending failures. In these cases, the drive's fault light will not illuminate. So it becomes critical to identify the failing drive. Removing the wrong drive, when another drive is unhealthy, may well leave your node non-operational.

That's no fun.

Each physical drive will have a buttons labelled either Turn On or Turn Off, depending on the current state of the identification LED.

Illumination a Drive's ID Light

Let's illuminate!

We will identify the drive with the somewhat-cryptic name '252:0'.

Turning on the ID light for physical disk '252:0'.

The storage page will reload, indicating whether the command succeeded or not.

Physical disk '252:0' illuminated successfully.

If you now look at the front of your node, you should see one of the drives lit up.

Locating physical disk '252:0' on the node's front panel.

Most excellent.

Shutting off a Drive's ID Light

To turn the ID light off, simply click on the drive's Turn Off button.

Turning off the ID light for physical disk '252:0'.

As before, the success or failure will be reported.

Physical disk '252:0' ID light turned off successfully.

Refreshing The Storage Page

Warning: AN!CDB doesn't (yet) use a command key to prevent a request being sent again if a page is manually reloaded (ctrl + r, <f5>, etc). In most all cases, this is harmless as AN!CDB won't do something dangerous without verifying it is still safe to do so. Just the same, please always use the "reload" icon shown below.

After issuing a command to the storage manager, please do not use your browser's "refresh" function. It is always better to click on the reload icon.

Storage page "refresh" icon.

This will reload the page with the most up to date state of the storage in your node.

Storage page reloaded properly.

Failure and Recovery

There are many ways for hard drives to fail.

In this section, we're going to sort-of simulate four failures;

  1. Drive vanishes entirely
  2. Drive is failed but still online
  3. Good drive was ejected by accident, recovering
  4. Drive has not yet failed, but may soon

The first three are going to be sort of mashed together. We'll simply eject the drive while it's running, causing it to disappear and for the array to degrade. If this happened in real life, you would simply eject it and insert a new drive.

For the second case, we'll re-insert the ejected drive. The drive will be listed as failed ("Unconfigured(bad)"). We'll tell the controller to "spin down" the drive, making it safe to remove. If the real world, we would then eject it and install a new drive.

In the third case, we will again eject the drive, and then re-insert it. In this case, we won't spin down the drive, but instead mark it as healthy again.

Lastly, we will discuss predictive failure. In these cases, the drive has not failed yet, but confidence in the drive has been lost, so pre-failure replacement will be done.

Drive Vanishes Entirely

Drive is Failed but Still Online

Recovering from Accidental Ejection of Good Drive

Pre-Failure Drive Replacement

Managing Hot-Spares

In this final section, we will add a "Hot-Spare" drive to our node. A "Hot-Spare" is an unused drive that the control knows can be used to immediately replace a failed drive.

This is a popular option for people who want to return to a fully redundant state as soon after a failure as possible.

We will Configure a Hot-Spare, show how it replaces a failed drive, and show how to unmark a drive as a hot-spare, in case the Hot-Spare itself goes bad.

Creating a Hot-Spare

Example of a Hot-Spare Working

Replacing a Failing Hot-Spare

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.