Revision as of 00:57, 15 February 2014

AN!Wiki :: How To :: Managing Drive Failures with AN!CDB

Note: At this time, only LSI-based controllers are supported. Please see this section of the AN!Cluster Tutorial 2 for required node configuration.

The AN!CDB dashboard supports basic drive management for nodes using LSI-based RAID controllers. This covers most all Fujitsu servers.

This guide will show you how to handle a few common storage management tasks easily and quickly.

Starting The Storage Manager

From the main AN!CDB page, under the "Cluster Nodes - Control window, click on the name of the node you wish to manage.

This will open a new tab (or window) showing the current configuration and state of the node's storage.

Storage Display Window

The storage display window shows your storage controller(s), their auxiliary power supply for write-back caching if installed, the logical disk(s) and each logical disk's constituent drives.

The auxiliary power and logical disks will be slightly indented under their parent controller.

The physical disks associates with a given logical disk are further indented, to show their association.

In this example, we have only one RAID controller, it has an auxiliary power pack and a single logical volume has been created.

The Logical volume is a RAID level 5 array with four physical disks.

Controlling the Physical Disk Identification ID Light

The first task we will explore is using identification lights to match a physical disk listing with a physical drive in a node.

If a drive fails completely, it's fault light will light up, making the failed drive easy to find. However, the AN!CDB alert system can notify us of pending failures. In these cases, the drive's fault light will not illuminate. So it becomes critical to identify the failing drive. Removing the wrong drive, when another drive is unhealthy, may well leave your node non-operational.

That's no fun.

Each physical drive will have a buttons labelled either Turn On or Turn Off, depending on the current state of the identification LED.

Illumination a Drive's ID Light

Let's illuminate!

We will identify the drive with the somewhat-cryptic name '252:0'.

Turning on the ID light for physical disk '252:0'.

The storage page will reload, indicating whether the command succeeded or not.

If you now look at the front of your node, you should see one of the drives lit up.

Most excellent.

Shutting off a Drive's ID Light

To turn the ID light off, simply click on the drive's Turn Off button.

Turning off the ID light for physical disk '252:0'.

As before, the success or failure will be reported.

Refreshing The Storage Page

Warning: AN!CDB doesn't (yet) use a command key to prevent a request being sent again if a page is manually reloaded (ctrl + r, <f5>, etc). In most all cases, this is harmless as AN!CDB won't do something dangerous without verifying it is still safe to do so. Just the same, please always use the "reload" icon shown below.

After issuing a command to the storage manager, please do not use your browser's "refresh" function. It is always better to click on the reload icon.

This will reload the page with the most up to date state of the storage in your node.

Failure and Recovery

There are many ways for hard drives to fail.

In this section, we're going to sort-of simulate four failures;

Drive vanishes entirely
Drive is failed but still online
Good drive was ejected by accident, recovering
Drive has not yet failed, but may soon

The first three are going to be sort of mashed together. We'll simply eject the drive while it's running, causing it to disappear and for the array to degrade. If this happened in real life, you would simply eject it and insert a new drive.

For the second case, we'll re-insert the ejected drive. The drive will be listed as failed ("Unconfigured(bad)"). We'll tell the controller to "spin down" the drive, making it safe to remove. If the real world, we would then eject it and install a new drive.

In the third case, we will again eject the drive, and then re-insert it. In this case, we won't spin down the drive, but instead mark it as healthy again.

Lastly, we will discuss predictive failure. In these cases, the drive has not failed yet, but confidence in the drive has been lost, so pre-failure replacement will be done.

Drive Vanishes Entirely

Drive is Failed but Still Online

Recovering from Accidental Ejection of Good Drive

Pre-Failure Drive Replacement

Managing Hot-Spares

In this final section, we will add a "Hot-Spare" drive to our node. A "Hot-Spare" is an unused drive that the control knows can be used to immediately replace a failed drive.

This is a popular option for people who want to return to a fully redundant state as soon after a failure as possible.

We will Configure a Hot-Spare, show how it replaces a failed drive, and show how to unmark a drive as a hot-spare, in case the Hot-Spare itself goes bad.

Creating a Hot-Spare

Example of a Hot-Spare Working

Replacing a Failing Hot-Spare

Any questions, feedback, advice, complaints or meanderings are welcome.
`Alteeve's Niche!`	`Enterprise Support: Alteeve Support`	`Community Support`
© Alteeve's Niche! Inc. 1997-2024		Anvil! "Intelligent Availability®" Platform
`legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.`

@@ Line 3: / Line 3: @@
 {{note|1=At this time, only LSI-based controllers are supported. Please see [[AN!Cluster_Tutorial_2#Monitoring_LSI-Based_RAID_Controllers_with_MegaCli|this section]] of the [[AN!Cluster Tutorial 2]] for required node configuration.}}
-The most common repair needed on [[Anvil!]] nodes is the replacement of failing or failed physical disks.
+The [[AN!CDB]] dashboard supports basic drive management for nodes using LSI-based [[RAID]] controllers. This covers most all Fujitsu servers.
-[[AN!CDB]] provides a very easy to use interface for managing this. In this tutorial, we will physically eject a drive from a small running logical volume, simulating a failure.
+This guide will show you how to handle a few common storage management tasks easily and quickly.
-= Introducing AN!CDB Drive Management =
+= Starting The Storage Manager =
-On the main [[AN!CDB]] page, you can click on either node's name in the "Cluster Nodes - Control" section.
+From the main [[AN!CDB]] page, under the "''Cluster Nodes - Control'' window, click on the name of the node you wish to manage.
 [[Image:an-cdb_storage-control_01.png|thumb|800px|center|AN!CDB main page with node names as click-able items.]]
-Click on the name of the node you want to work on. In our case, we will work on <span class="code">an-c05n01.alteeve.ca</span>.
+This will open a new tab (or window) showing the current configuration and state of the node's storage.
-= Storage Display Window =
+== Storage Display Window ==
 The storage display window shows your storage controller(s), their auxiliary power supply for [[write-back caching]] if installed, the logical disk(s) and each logical disk's constituent drives.
@@ Line 29: / Line 29: @@
 The Logical volume is a [[RAID level 5]] array with four physical disks.
-== Managing the Physical Disk Identification ID Light ==
+= Controlling the Physical Disk Identification ID Light =
-The first task we will explore is using identification lights to find a physical drive in a node.
+The first task we will explore is using identification lights to match a physical disk listing with a physical drive in a node.
-If a drive fails completely, it's fault light will light up, making the failed drive easy to find. However, the [[AN!CDB]] [[AN!Cluster_Tutorial_2#Setting_Up_Alerts|alert system]] can notify us of pending failures.
+If a drive fails completely, it's fault light will light up, making the failed drive easy to find. However, the [[AN!CDB]] [[AN!Cluster_Tutorial_2#Setting_Up_Alerts|alert system]] can notify us of pending failures. In these cases, the drive's fault light will not illuminate. So it becomes critical to identify the failing drive. Removing the wrong drive, when another drive is unhealthy, may well leave your node non-operational.
-In these cases, the drive's fault light may not illuminate. So it becomes critical to identify the failing drive. Removing the wrong drive, when another drive is unhealthy, may well leave your node non-operational.
 That's no fun.
-Each physical drive, whether in an array or unconfigured, will have a pair of buttons labelled <span class="code">Turn On</span> and <span class="code">Turn Off</span>. Which you click will determine if the drive's ID light illuminates or turns off.
+Each physical drive will have a buttons labelled either <span class="code">Turn On</span> or <span class="code">Turn Off</span>, depending on the current state of the identification LED.
 === Illumination a Drive's ID Light ===
@@ Line 81: / Line 79: @@
 [[Image:an-cdb_storage-control_09.png|thumb|800px|center|Storage page reloaded properly.]]
-= Failure Recovery =
+= Failure and Recovery =
-Now the fun part; Breaking things!
-== Failing a Drive ==
-{{warning|1=Physically ejecting a drive that is powered up and running is very dangerous! The electrical connections will be live and there is a possibility of a short destroying components. Further, platter-based drives will be under centripetal force, and moving the physical drive out of it's current rotational plane risks a very destructive head crash. DO NOT EJECT A POWERED ON, SPINNING DRIVE! We understood the risks when writing this tutorial and, even then, used development nodes.}}
-For this tutorial, we will physically eject the drive we identified earlier, called '<span class="code">252:0</span>', which is a member of the logical drive <span class="code">#0</span>. This will cause the logical drive to enter a degraded state and the ejected drive will completely vanish from the storage page.
-[[Image:an-cdb_storage-control_10.png|thumb|800px|center|Storage page pre-failure state.]]
-Now we will eject the drive.
-[[Image:an-cdb_storage-control_11.png|thumb|800px|center|Logical drive <span class="code">#0</span> is now degraded and physical disk <span class="code">252:0</span> is missing.]]
-Notice how <span class="code">252:0</span> is gone and how the logical drive's <span class="code">state</span> is now <span class="code" style="color: #a60a0a; font-style: italic;">Degraded</span>?
-Underneath the "Degraded" state is more detail on which drive is missing and how big the replacement drive has to be.
-== Recovering the Ejected Disk ==
-In our case, we know that the drive we ejected is healthy, so we will re-insert it into the bay.
-If your drive really has failed, then remove the drive and insert the new replacement drive.
+There are many ways for hard drives to fail.
-Once inserted, the drive will be marked by the controller as <span class="code" style="color: #a60a0a; font-style: italic;">Unconfigured(bad)</span>. It will also be listed as an "''Unconfigured Physical Disk''".
+In this section, we're going to sort-of simulate four failures;
-[[Image:an-cdb_storage-control_12.png|thumb|800px|center|Physical drive <span class="code">252:0</span> is back, but it's had better days.]]
+# Drive vanishes entirely
+# Drive is failed but still online
+# Good drive was ejected by accident, recovering
+# Drive has not yet failed, but may soon
-{{warning|1=If the drive was marked bad automatically by the controller, do not try to repair it. Replace it!}}
+The first three are going to be sort of mashed together. We'll simply eject the drive while it's running, causing it to disappear and for the array to degrade. If this happened in real life, you would simply eject it and insert a new drive.
-In this state, the physical drive is useless. Before we can use it, we must click on the <span class="code">Make Good</span> link by the drive's <span class="code">State</span>.
+For the second case, we'll re-insert the ejected drive. The drive will be listed as failed ("Unconfigured(bad)"). We'll tell the controller to "spin down" the drive, making it safe to remove. If the real world, we would then eject it and install a new drive.
-[[Image:an-cdb_storage-control_13.png|thumb|800px|center|Physical drive <span class="code">252:0</span> degraded state and it's "<span class="code">Make Good</span>" button.]]
+In the third case, we will again eject the drive, and then re-insert it. In this case, we won't spin down the drive, but instead mark it as healthy again.
-Once you click on the <span class="code">Make Good</span>, the drive will be flagged as healthy.
+Lastly, we will discuss predictive failure. In these cases, the drive has not failed yet, but confidence in the drive has been lost, so pre-failure replacement will be done.
-[[Image:an-cdb_storage-control_14.png|thumb|800px|center|Physical drive <span class="code">252:0</span> made good and brought online.]]
+== Drive Vanishes Entirely ==
-Now the physical disk is usable again.
-== Adding the Recovered/Replacement Drive to the Degraded Array ==
+== Drive is Failed but Still Online ==
-{{note|1=Depending on the controller's configuration, inserting a fresh drive into a node with a degraded array may cause the drive to be automatically added to the array and rebuild will start automatically. In such a case, the steps below will not be needed.}}
-Whether you replaced a truly bad drive or flagged an ejected drive as good again, the new drive may not automatically be added to the degraded array.
+== Recovering from Accidental Ejection of Good Drive ==
-When the storage manager sees a degraded array, it looks for healthy drives not yet in an array. When found, the healthy drive's size is checked. If it is large enough to replace the lost drive, a button called "<span class="code">Add to Logical Disk #x</span>" will be shown. If there are multiple degraded arrays, the option of which array to add the disk to will be shown as multiple buttons.
-[[Image:an-cdb_storage-control_15.png|thumb|800px|center|Healthy physical drive <span class="code">252:0</span> ready to be added to the degraded array.]]
+== Pre-Failure Drive Replacement ==
-Once you click on the "<span class="code">Add to Logical Disk #0</span>" button shown above, the drive will be added to the array, it will be brought online and the rebuild of the lost data will start.
-[[Image:an-cdb_storage-control_16.png|thumb|800px|center|Physical drive <span class="code">252:0</span> added to logical drive <span class="code">#0</span>.]]
-Immediately, we'll see that the rebuild of the freshly inserted physical disk has begun.
+= Managing Hot-Spares =
-[[Image:an-cdb_storage-control_17.png|thumb|800px|center|Physical drive <span class="code">252:0</span> rebuild started and the drive is back under logical drive <span class="code">#0</span>.]]
+In this final section, we will add a "Hot-Spare" drive to our node. A "Hot-Spare" is an unused drive that the control knows can be used to immediately replace a failed drive.
-Good time to go grab a <span class="code">$drink</span>.
+This is a popular option for people who want to return to a fully redundant state as soon after a failure as possible.
-== Monitoring Drive Rebuild ==
+We will Configure a Hot-Spare, show how it replaces a failed drive, and show how to unmark a drive as a hot-spare, in case the Hot-Spare itself goes bad.
-{{note|1=The rebuild progress does not (yet) auto-update. To monitor the rebuild progress, please periodically [[#Refreshing_The_Storage_Page|refresh the page]].}}
+== Creating a Hot-Spare ==
-Rebuilding can take a fair bit of time. How long exactly depends on the complexity of the RAID level used, the size of the physical disk that was inserted and the configuration of the controller.
-[[Image:an-cdb_storage-control_18.png|thumb|800px|center|Physical drive <span class="code">252:0</span> rebuild progressing...]]
+== Example of a Hot-Spare Working ==
-In my case, the rebuild took about 30 minutes.
-[[Image:an-cdb_storage-control_19.png|thumb|800px|center|Physical drive <span class="code">252:0</span> rebuilt!]]
+== Replacing a Failing Hot-Spare ==
-That's it!
-The logical drive is fully restored.
 {{footer}}

Managing Drive Failures with AN!CDB: Difference between revisions