Managing Drive Failures with AN!CDB: Difference between revisions

From Alteeve Wiki
Jump to navigation Jump to search
 
(18 intermediate revisions by the same user not shown)
Line 3: Line 3:
{{note|1=At this time, only LSI-based controllers are supported. Please see [[AN!Cluster_Tutorial_2#Monitoring_LSI-Based_RAID_Controllers_with_MegaCli|this section]] of the [[AN!Cluster Tutorial 2]] for required node configuration.}}
{{note|1=At this time, only LSI-based controllers are supported. Please see [[AN!Cluster_Tutorial_2#Monitoring_LSI-Based_RAID_Controllers_with_MegaCli|this section]] of the [[AN!Cluster Tutorial 2]] for required node configuration.}}


The most common repair needed on [[Anvil!]] nodes is the replacement of failing or failed physical disks.
The [[AN!CDB]] dashboard supports basic drive management for nodes using LSI-based [[RAID]] controllers. This covers all Fujitsu servers with hardware [[RAID]].


[[AN!CDB]] provides a very easy to use interface for managing this. In this tutorial, we will physically eject a drive from a small running logical volume, simulating a failure.
This guide will show you how to handle a few common storage management tasks easily and quickly.


= Introducing AN!CDB Drive Management =
= Starting the Storage Manager =


On the main [[AN!CDB]] page, you can click on either node's name in the "Cluster Nodes - Control" section.
From the main [[AN!CDB]] page, under the "''Cluster Nodes - Control''" window, click on the name of the node you wish to manage.


[[Image:an-cdb_storage-control_01.png|thumb|800px|center|AN!CDB main page with node names as click-able items.]]
[[Image:an-cdb_storage-control_01.png|thumb|800px|center|AN!CDB main page with node names as click-able items.]]


Click on the name of the node you want to work on. In our case, we will work on <span class="code">an-c05n01.alteeve.ca</span>.
This will open a new tab (or window) showing the current configuration and state of the node's storage.


= Storage Display Window =
== Storage Display Window ==


The storage display window shows your storage controller(s), their auxiliary power supply for [[write-back caching]] if installed, the logical disk(s) and each logical disk's constituent drives.  
The storage display window shows your storage controller(s), their auxiliary power supply for [[write-back caching]] if installed, the logical disk(s) and each logical disk's constituent drives.  
Line 21: Line 21:
The auxiliary power and logical disks will be slightly indented under their parent controller.  
The auxiliary power and logical disks will be slightly indented under their parent controller.  


The physical disks associates with a given logical disk are further indented, to show their association.
The physical disks associated with a given logical disk are further indented to show their association.


[[Image:an-cdb_storage-control_02.png|thumb|800px|center|AN!CDB storage management page.]]
[[Image:an-cdb_storage-control_02.png|thumb|800px|center|AN!CDB storage management page.]]
Line 29: Line 29:
The Logical volume is a [[RAID level 5]] array with four physical disks.
The Logical volume is a [[RAID level 5]] array with four physical disks.


== Managing the Physical Disk Identification ID Light ==
= Controlling the Physical Disk Identification Light =


The first task we will explore is using identification lights to find a physical drive in a node.
The first task we will explore is using identification lights to match a physical disk listing with a physical drive in a node.


If a drive fails completely, it's fault light will light up, making the failed drive easy to find. However, the [[AN!CDB]] [[AN!Cluster_Tutorial_2#Setting_Up_Alerts|alert system]] can notify us of pending failures.  
If a drive fails completely, its fault light will light up, making the failed drive easy to find. However, the [[AN!CDB]] [[AN!Cluster_Tutorial_2#Setting_Up_Alerts|alert system]] can notify us of pending failures. In these cases, the drive's fault light will not illuminate. Therefore, it becomes critical to identify the failing drive. Removing the wrong drive, when another drive is unhealthy, may well leave your node non-operational.
 
In these cases, the drive's fault light may not illuminate. So it becomes critical to identify the failing drive. Removing the wrong drive, when another drive is unhealthy, may well leave your node non-operational.


That's no fun.
That's no fun.


Each physical drive, whether in an array or unconfigured, will have a pair of buttons labelled <span class="code">Turn On</span> and <span class="code">Turn Off</span>. Which you click will determine if the drive's ID light illuminates or turns off.
Each physical drive will have a button labelled either <span class="code">Turn On</span> or <span class="code">Turn Off</span>, depending on the current state of the identification LED.


=== Illumination a Drive's ID Light ===
=== Illuminating a Drive's ID Light ===


Let's illuminate!
Let's illuminate!
Line 69: Line 67:
[[Image:an-cdb_storage-control_07.png|thumb|800px|center|Physical disk '<span class="code">252:0</span>' ID light turned off successfully.]]
[[Image:an-cdb_storage-control_07.png|thumb|800px|center|Physical disk '<span class="code">252:0</span>' ID light turned off successfully.]]


= Refreshing The Storage Page =
= Refreshing the Storage Page =


{{warning|1=AN!CDB doesn't (yet) use a command key to prevent a request being sent again if a page is manually reloaded (<span class="code">ctrl + r</span>, <span class="code"><f5></span>, etc). In most all cases, this is harmless as AN!CDB won't do something dangerous without verifying it is still safe to do so. Just the same, please always use the "reload" icon shown below.}}
{{warning|1=AN!CDB doesn't (yet) use a command key to prevent a request being sent again if a page is manually reloaded (<span class="code">ctrl + r</span>, <span class="code"><f5></span>, etc). In most all cases, this is harmless as AN!CDB won't do something dangerous without verifying it is still safe to do so. Just the same, please always use the "reload" icon shown below.}}
Line 81: Line 79:
[[Image:an-cdb_storage-control_09.png|thumb|800px|center|Storage page reloaded properly.]]
[[Image:an-cdb_storage-control_09.png|thumb|800px|center|Storage page reloaded properly.]]


= Failure Recovery =
= Failure and Recovery =
 
There are many ways for hard drives to fail.
 
In this section, we're going to sort-of simulate four failures:
 
# Drive vanishes entirely
# Drive is failed but still online
# Good drive was ejected by accident, recovering
# Drive has not yet failed, but may soon
 
The first three are going to be sort of mashed together. We'll simply eject the drive while it's running, causing it to disappear and for the array to degrade. If this happened in real life, you would simply eject it and insert a new drive.
 
For the second case, we'll re-insert the ejected drive. The drive will be listed as failed ("Unconfigured(bad)"). We'll tell the controller to "spin down" the drive, making it safe to remove. In the real world, we would then eject it and install a new drive.
 
In the third case, we will again eject the drive, and then re-insert it. In this case, we won't spin down the drive, but instead mark it as healthy again.
 
Lastly, we will discuss predictive failure. In these cases, the drive has not failed yet, but confidence in the drive has been lost, so pre-failure replacement will be done.
 
== Drive Vanishes Entirely ==
 
If a drive fails catastrophically, say the controller on the drive fails, you may find that the drive simply no longer appears in the list of physical disks under the logical disk.
 
Likewise, if the disk is physically ejected (by accident or otherwise), the same result will be seen.
 
The logical drive will list as '<span class="code">Degraded</span>', but none of the disks will show errors.
 
[[Image:an-cdb_storage-control_10.png|thumb|800px|center|Degraded logical disk with no failed physical disks.]]
 
If you look under '''Logical Disk #X''', you will see the number of physical disks in the logical disk beside the '<span class="code">Number of Drives</span>'. When you count the actual number of disks under the logical drive though, you will see the number of displayed disks is smaller.
 
In this situation, the best thing to do is locate the failed disk. The 'Failed' LED should be illuminated on the bay with the failed or missing disk. If it is not lit, you may need to light up the ID LED on the remaining good disks and use a simple process of illimination to find the dead drive.
 
Once you know which disk is dead, remove it and install a replacement disk. If '<span class="code">Restore Hot-Spare on Insert</span>' is set to '<span class="code">Yes</span>', then the drive should immediately be added to the logical disk and the rebuild process should start.
 
=== Inserting a Replacement Physical Disk ===
 
[[Image:an-cdb_storage-control_11.png|thumb|800px|center|Replaced physical disk automatically added to logical disk and rebuild begun.]]
 
If the replacement drive doesn't automatically get added to the logical disk, you can add the new physical disk to the ''Degraded'' logical disk using the storage manager. The new disk will come up as '<span class="code">Unconfigured(good), Spun Up</span>'. Directly beneath that, you will see the option to '<span class="code">Add to Logical Disk #X</span>', where X is the logical disk number that is degraded.
 
[[Image:an-cdb_storage-control_12.png|thumb|800px|center|Disk is 'Unconfigured(good), Spun Up' with option to add the physical disk to logical disk #0.]]
 
Click on '<span class="code">Add to Logical Disk #X</span>' and the drive will be added to the logical disk. The rebuild process will start immediately.
 
[[Image:an-cdb_storage-control_13.png|thumb|800px|center|Replacement physical disk manually added to logical disk #0, rebuild process started.]]
 
== Drive is Failed but Still Online ==
 
If a drive has failed, but its controller is still working, the drive should appear as '<span class="code">Unconfigured(bad)</span>.
 
This will be one of the most common failures you will have to deal with.
 
[[Image:an-cdb_storage-control_14.png|thumb|800px|center|Physical disk failed and marked as 'Unconfigured(bad)'.]]
 
You will generally have two options here.
 
* Prepare it for removal
* Try to make the drive "good" again
 
=== Preparing a Failed Disk for Removal ===
 
A failed disk may still be "spun up", meaning that its platters are rotating. Moving the disk before the drive has spun down completely could cause a head-crash. Further, the electrical connection to the drive will still be active, making it possible to cause a short when the drive is mechanically ejected, which could damager the drive's controller, the backplane or the RAID controller itself.
 
To avoid this risk, before a drive is removed, we must prepare it for removal. This will spin-down the disk and make it safe to eject with minimal risk of electrical shorts.
 
To prepare it for removal, you will click on '<span class="code">Spin Down Disk</span>'.
 
[[Image:an-cdb_storage-control_15.png|thumb|800px|center|Failed physical disk with 'Spin Down Disk' option highlighted.]]
 
{{note|1=In the example below, [[AN!CDB]]'s first attempt to spin down the drive failed. So it attempted, successfully, to mark the disk as 'good' and then tried to spin it down a second time, which worked. Be sure to physically record the disk as failed!}}
 
Once spun down, the physical disk will be shown to be safe to remove.
 
[[Image:an-cdb_storage-control_16.png|thumb|800px|center|Failed physical disk is ready to be removed.]]
 
Once you insert the new disk, it should automatically be added to the logical disk and the rebuild process should start. If it doesn't, please see "[[#Inserting a Replacement Physical Disk|Inserting a Replacement Physical Disk]]" above.
 
=== Attempting to Recover a Failed Disk ===
 
{{warning|1=If you do not know why the drive failed, replace it. Even if it appears to recover, it will likely fail again soon.}}
 
Depending on why the physical disk failed, it may be possible to mark is as "good" again. This should only be done in two cases;
 
* You know why the disk was flagged as failed and you trust it is actually healthy
* To try and restore redundancy to the logical disk while you wait for the replacement disk to arrive
 
Underneath the '<span class="code">Unconfigured(good), Spun Up</span>' disk state will be the option '<span class="code">Make good</span>'.
 
[[Image:an-cdb_storage-control_17.png|thumb|800px|center|Failed disk with 'Make Good' option highlighted.]]
 
If this works, the disk state will change to '<span class="code">Unconfigured(good), Spun Up</span>'. Below this, one of two options will be available;
 
[[Image:an-cdb_storage-control_18.png|thumb|800px|center|Physical disk marked good.]]
 
* If (one of) the logical disk(s) are degraded, '<span class="code">Add to Logical Disk #X</span>' will be available.
* If (all) the logical disk(s) are '<span class="code">Optimal</span>', you will be able to mark the disk as a 'Hot Spare'.
 
In this section, we will add the recovered physical disk to the degraded logical disk. Managing hot-spares is covered further below. Do note though; If you mark a disk as a hot-spare when a logical disk is degraded, it will automatically be added to that degraded logical disk and rebuild will begin.
 
If there are multiple degraded logical disks, you will see multiple '<span class="code">Add to Logical Disk #X</span>' options, one for each degraded logical disk. In our case, there is just one logical disk, so there is just one option.
 
[[Image:an-cdb_storage-control_19.png|thumb|800px|center|Recovered physical disk's '<span class="code">Add to Logical Disk #0</span>' button highlighted.]]
 
Once you click on '<span class="code">Add to Logical Disk #X</span>', the physical disk will appear under the logical disk and rebuild will begin automatically.
 
[[Image:an-cdb_storage-control_20.png|thumb|800px|center|Recovered physical disk is back in the logical disk and rebuild has begun.]]
 
Done.
 
== Recovering from Accidental Ejection of Good Drive ==
 
{{warning|1=Ejecting a physical disk that is currently operating risks damaging the physical disk, it's controller, interface, the node's storage back-plane and/or the RAID controller. '''''Never eject a running drive'''''! We do it here using a test node and we accepted the risk of damaging the hardware.}}
 
There are numerous reasons why a perfectly good drive might get ejected from a node. In spite of the risks in the above warning, people often seem to eject a running disk. Perhaps not realizing the system is running, or out of a desire to simulate a failure. In any case, it happens and it is important to know how to recover from it.
 
As we [[#Drive Vanishes Entirely|discussed above]], ejecting a disk will cause it to vanish from the list of physical disks and the logical disk it was in will become degraded.
 
Once re-inserted, the disk will be flagged as '<span class="code">Unconfigured(bad)</span>'. In this case, you know the disk is fine, so marking it as good after reinserting it is fine.
 
Please [[#Drive Vanishes Entirely|jump up]] to that section to finish recovering the physical disk.
 
== Pre-Failure Drive Replacement ==
 
With the AN!CDB [[AN!Cluster_Tutorial_2#Setting_Up_Alerts|monitoring program]], you may occasionally get an alert like:
 
<syntaxhighlight lang="text">
Still healthy drive:
</syntaxhighlight>
<syntaxhighlight lang="text">
RAID 0's Physical Disk 5's "Other Error Count" has changed!
  0 -> 1
</syntaxhighlight>
 
This is not always a concern. For example, every drive has a rated "[https://en.wikipedia.org/wiki/RAID#Unrecoverable_read_errors_during_rebuild unrecoverable read error (URE)]" rating, usually 1 in every 10^15 reads for enterprise SAS drives. When the logical disk is healthy, the read error will not be a problem as the read can be recovered and the drive itself is still fine.
 
However, if we see a sudden spike in <span class="code">"Other Error Count"</span>, it may be a good idea to replace the drive.
 
<syntaxhighlight lang="text">
A disk that may be failing and should be replaced:
</syntaxhighlight>
<syntaxhighlight lang="text">
RAID 0's Physical Disk 3's "Other Error Count" has changed!
  0 -> 4
</syntaxhighlight>
 
In this case, we see that four errors occurred at almost the same time. This causes concern and warrants a pre-failure replacement.
 
In this case, you would leave the questionable disk in the logical disk until the replacement drive is in-hand. Once we're ready, we will remove the drive from the logical disk, which will leave it ready to eject, and then insert the replacement physical disk.
 
[[Image:an-cdb_storage-control_21.png|thumb|800px|center|(Simulated) <span class="code">"Other" Error Count</span> showing <span class="code">4</span> errors. "<span class="code">Remove from Logical Disk</span>" highlighted.]]


Now the fun part; Breaking things!
When you click on '<span class="code">Remove from Logical Disk</span>', you will be asked to confirm the action. You will also be warned that this action will degrade the logical disk. In this case, the replacement drive is ready to insert, so we are safe to proceed.


== Failing a Drive ==
[[Image:an-cdb_storage-control_22.png|thumb|800px|center|Confirming the "<span class="code">Remove from Logical Disk</span>" action.]]


{{warning|1=Physically ejecting a drive that is powered up and running is very dangerous! The electrical connections will be live and there is a possibility of a short destroying components. Further, platter-based drives will be under centripetal force, and moving the physical drive out of it's current rotational plane risks a very destructive head crash. DO NOT EJECT A POWERED ON, SPINNING DRIVE! We understood the risks when writing this tutorial and, even then, used development nodes.}}
Click on '<span class="code">Confirm</span>', the physical disk will be taken offline, marked as missing and then spun down. Once completed, you will be able to safely eject the physical disk and install the replacement drive.


For this tutorial, we will physically eject the drive we identified earlier, called '<span class="code">252:0</span>', which is a member of the logical drive <span class="code">#0</span>. This will cause the logical drive to enter a degraded state and the ejected drive will completely vanish from the storage page.
[[Image:an-cdb_storage-control_23.png|thumb|800px|center|Physical disk out of the logical disk and is safe to remove.]]


[[Image:an-cdb_storage-control_10.png|thumb|800px|center|Storage page pre-failure state.]]
When you insert the replacement disk, it should automatically be added to the logical disk and the rebuild should begin.


Now we will eject the drive.
{{note|1=If the freshly inserted drive is not automatically added to the logical disk, please see the "[[#Inserting a Replacement Physical Disk|Inserting a Replacement Physical Disk]]" section above.}}


[[Image:an-cdb_storage-control_11.png|thumb|800px|center|Logical drive <span class="code">#0</span> is now degraded and physical disk <span class="code">252:0</span> is missing.]]
[[Image:an-cdb_storage-control_24.png|thumb|800px|center|Replacement physical disk inserted and rebuild begun.]]


Notice how <span class="code">252:0</span> is gone and how the logical drive's <span class="code">state</span> is now <span class="code" style="color: #a60a0a; font-style: italic;">Degraded</span>?
Done!


Underneath the "Degraded" state is more detail on which drive is missing and how big the replacement drive has to be.
= Managing Hot-Spares =


== Recovering the Ejected Disk ==
In this final section, we will add a "hot-spare" drive to our node.


In our case, we know that the drive we ejected is healthy, so we will re-insert it into the bay.
A "Hot-Spare" is an unused drive that the RAID controller knows can be used to immediately replace a failed drive. This is a popular option for people who want to return to a fully redundant state as soon after a failure as possible.


If your drive really has failed, then remove the drive and insert the new replacement drive.
We will Configure a hot-spare, show how it replaces a failed drive, and show how to unmark a drive as a hot-spare, in case the hot-spare itself goes bad.


Once inserted, the drive will be marked by the controller as <span class="code" style="color: #a60a0a; font-style: italic;">Unconfigured(bad)</span>. It will also be listed as an "''Unconfigured Physical Disk''".
== Creating a Hot-Spare ==


[[Image:an-cdb_storage-control_12.png|thumb|800px|center|Physical drive <span class="code">252:0</span> is back, but it's had better days.]]
Up until this point, we've had four physical disks in our system. Here, we've inserted a fifth healthy disk. The logical disk was '<span class="code">Optimal</span>', so no default action was taken. Likewise, no option to add the physical disk to a logical disk is presented.


{{warning|1=If the drive was marked bad automatically by the controller, do not try to repair it. Replace it!}}
[[Image:an-cdb_storage-control_25.png|thumb|800px|center|Newly inserted physical disk.]]


In this state, the physical drive is useless. Before we can use it, we must click on the <span class="code">Make Good</span> link by the drive's <span class="code">State</span>.
The one option that is presented, however, is '<span class="code">Make Hot-Spare</span>'.


[[Image:an-cdb_storage-control_13.png|thumb|800px|center|Physical drive <span class="code">252:0</span> degraded state and it's "<span class="code">Make Good</span>" button.]]
[[Image:an-cdb_storage-control_26.png|thumb|800px|center|New physical disk with '<span class="code">Make Hot-Spare</span>' highlighted.]]


Once you click on the <span class="code">Make Good</span>, the drive will be flagged as healthy.
Once marked as a hot-spare, the disk will be automatically used to replace a physical drive, should one of an equal or smaller size fail.


[[Image:an-cdb_storage-control_14.png|thumb|800px|center|Physical drive <span class="code">252:0</span> made good and brought online.]]
[[Image:an-cdb_storage-control_27.png|thumb|800px|center|Physical disk now marked as a '<span class="code">Hot-Spare, Spun up</span>'.]]


Now the physical disk is usable again.
Done!


== Adding the Recovered/Replacement Drive to the Degraded Array ==
Now any disk failure will recover as fast as technically possible.


{{note|1=Depending on the controller's configuration, inserting a fresh drive into a node with a degraded array may cause the drive to be automatically added to the array and rebuild will start automatically. In such a case, the steps below will not be needed.}}
== Example of a Hot-Spare Working ==


Whether you replaced a truly bad drive or flagged an ejected drive as good again, the new drive may not automatically be added to the degraded array.
As we saw earlier, there are many ways for a logical disk to degrade. The hot-spare will be used automatically as soon as the array is degraded, regardless of the cause.


When the storage manager sees a degraded array, it looks for healthy drives not yet in an array. When found, the healthy drive's size is checked. If it is large enough to replace the lost drive, a button called "<span class="code">Add to Logical Disk #x</span>" will be shown. If there are multiple degraded arrays, the option of which array to add the disk to will be shown as multiple buttons.
For this reason, we'll simply remove a physical disk currently in the logical disk. We should see immediately that the hot-spare has entered the array and rebuild has begun.


[[Image:an-cdb_storage-control_15.png|thumb|800px|center|Healthy physical drive <span class="code">252:0</span> ready to be added to the degraded array.]]
I will remove physical disk '<span class="code">[252:6]</span>' from the array.


Once you click on the "<span class="code">Add to Logical Disk #0</span>" button shown above, the drive will be added to the array, it will be brought online and the rebuild of the lost data will start.
[[Image:an-cdb_storage-control_28.png|thumb|800px|center|Physical disk '<span class="code">[252:6]</span>' removed from the logical disk, former hot-spare physical disk '<span class="code">[252:5]</span>' already in the logical disk and rebuilding.]]


[[Image:an-cdb_storage-control_16.png|thumb|800px|center|Physical drive <span class="code">252:0</span> added to logical drive <span class="code">#0</span>.]]
At this point, we could eject <span class="code">[252:6]</span> or spin it back up. I know it's healthy, so I will spin it up and mark it as a hot spare in preparation for the next section.


Immediately, we'll see that the rebuild of the freshly inserted physical disk has begun.
== Replacing a Failing Hot-Spare ==


[[Image:an-cdb_storage-control_17.png|thumb|800px|center|Physical drive <span class="code">252:0</span> rebuild started and the drive is back under logical drive <span class="code">#0</span>.]]
As we [[#Pre-Failure Drive Replacement|discussed earlier]], a drive may show signs of failing before it's actually failed. Of course, it could just as well die and get marked as '<span class="code">Unconfigured(bad)</span>' or vanish entirely.


Good time to go grab a <span class="code">$drink</span>.
In this case, we'll pretend the drive is predicted to fail.


== Monitoring Drive Rebuild ==
Before we can spin down and remove a hot-spare disk, we must first click on '<span class="code">Unmark as Hot-Spare</span>'.


Rebuilding can take a fair bit of time. How long exactly depends on the complexity of the RAID level used, the size of the physical disk that was inserted and the configuration of the controller.
[[Image:an-cdb_storage-control_29.png|thumb|800px|center|Hot-spare physical disk's <span class="code">Unmark as Hot-Spare</span> highlighted.]]


[[Image:an-cdb_storage-control_18.png|thumb|800px|center|Physical drive <span class="code">252:0</span> rebuild progressing...]]
Once you click it, the physical disk reverts to being '<span class="code">Unconfigured(good)</span>'. From here, we can click on '<span class="code">Spin Down Disk</span>' to prepare it for removal.


In my case, the rebuild took about 30 minutes.
[[Image:an-cdb_storage-control_30.png|thumb|800px|center|Physical disk's '<span class="code">Spin Down Disk</span>' highlighted.]]


[[Image:an-cdb_storage-control_19.png|thumb|800px|center|Physical drive <span class="code">252:0</span> rebuilt!]]
Once spun down, the disk can safely be ejected and replaced.


That's it!
[[Image:an-cdb_storage-control_31.png|thumb|800px|center|Physical disk '<span class="code">[252:6]</span>' ready to be removed.]]


The logical drive is fully restored.
Done!


{{footer}}
{{footer}}

Latest revision as of 03:59, 17 February 2014

 AN!Wiki :: How To :: Managing Drive Failures with AN!CDB

Note: At this time, only LSI-based controllers are supported. Please see this section of the AN!Cluster Tutorial 2 for required node configuration.

The AN!CDB dashboard supports basic drive management for nodes using LSI-based RAID controllers. This covers all Fujitsu servers with hardware RAID.

This guide will show you how to handle a few common storage management tasks easily and quickly.

Starting the Storage Manager

From the main AN!CDB page, under the "Cluster Nodes - Control" window, click on the name of the node you wish to manage.

AN!CDB main page with node names as click-able items.

This will open a new tab (or window) showing the current configuration and state of the node's storage.

Storage Display Window

The storage display window shows your storage controller(s), their auxiliary power supply for write-back caching if installed, the logical disk(s) and each logical disk's constituent drives.

The auxiliary power and logical disks will be slightly indented under their parent controller.

The physical disks associated with a given logical disk are further indented to show their association.

AN!CDB storage management page.

In this example, we have only one RAID controller, it has an auxiliary power pack and a single logical volume has been created.

The Logical volume is a RAID level 5 array with four physical disks.

Controlling the Physical Disk Identification Light

The first task we will explore is using identification lights to match a physical disk listing with a physical drive in a node.

If a drive fails completely, its fault light will light up, making the failed drive easy to find. However, the AN!CDB alert system can notify us of pending failures. In these cases, the drive's fault light will not illuminate. Therefore, it becomes critical to identify the failing drive. Removing the wrong drive, when another drive is unhealthy, may well leave your node non-operational.

That's no fun.

Each physical drive will have a button labelled either Turn On or Turn Off, depending on the current state of the identification LED.

Illuminating a Drive's ID Light

Let's illuminate!

We will identify the drive with the somewhat-cryptic name '252:0'.

Turning on the ID light for physical disk '252:0'.

The storage page will reload, indicating whether the command succeeded or not.

Physical disk '252:0' illuminated successfully.

If you now look at the front of your node, you should see one of the drives lit up.

Locating physical disk '252:0' on the node's front panel.

Most excellent.

Shutting off a Drive's ID Light

To turn the ID light off, simply click on the drive's Turn Off button.

Turning off the ID light for physical disk '252:0'.

As before, the success or failure will be reported.

Physical disk '252:0' ID light turned off successfully.

Refreshing the Storage Page

Warning: AN!CDB doesn't (yet) use a command key to prevent a request being sent again if a page is manually reloaded (ctrl + r, <f5>, etc). In most all cases, this is harmless as AN!CDB won't do something dangerous without verifying it is still safe to do so. Just the same, please always use the "reload" icon shown below.

After issuing a command to the storage manager, please do not use your browser's "refresh" function. It is always better to click on the reload icon.

Storage page "refresh" icon.

This will reload the page with the most up to date state of the storage in your node.

Storage page reloaded properly.

Failure and Recovery

There are many ways for hard drives to fail.

In this section, we're going to sort-of simulate four failures:

  1. Drive vanishes entirely
  2. Drive is failed but still online
  3. Good drive was ejected by accident, recovering
  4. Drive has not yet failed, but may soon

The first three are going to be sort of mashed together. We'll simply eject the drive while it's running, causing it to disappear and for the array to degrade. If this happened in real life, you would simply eject it and insert a new drive.

For the second case, we'll re-insert the ejected drive. The drive will be listed as failed ("Unconfigured(bad)"). We'll tell the controller to "spin down" the drive, making it safe to remove. In the real world, we would then eject it and install a new drive.

In the third case, we will again eject the drive, and then re-insert it. In this case, we won't spin down the drive, but instead mark it as healthy again.

Lastly, we will discuss predictive failure. In these cases, the drive has not failed yet, but confidence in the drive has been lost, so pre-failure replacement will be done.

Drive Vanishes Entirely

If a drive fails catastrophically, say the controller on the drive fails, you may find that the drive simply no longer appears in the list of physical disks under the logical disk.

Likewise, if the disk is physically ejected (by accident or otherwise), the same result will be seen.

The logical drive will list as 'Degraded', but none of the disks will show errors.

Degraded logical disk with no failed physical disks.

If you look under Logical Disk #X, you will see the number of physical disks in the logical disk beside the 'Number of Drives'. When you count the actual number of disks under the logical drive though, you will see the number of displayed disks is smaller.

In this situation, the best thing to do is locate the failed disk. The 'Failed' LED should be illuminated on the bay with the failed or missing disk. If it is not lit, you may need to light up the ID LED on the remaining good disks and use a simple process of illimination to find the dead drive.

Once you know which disk is dead, remove it and install a replacement disk. If 'Restore Hot-Spare on Insert' is set to 'Yes', then the drive should immediately be added to the logical disk and the rebuild process should start.

Inserting a Replacement Physical Disk

Replaced physical disk automatically added to logical disk and rebuild begun.

If the replacement drive doesn't automatically get added to the logical disk, you can add the new physical disk to the Degraded logical disk using the storage manager. The new disk will come up as 'Unconfigured(good), Spun Up'. Directly beneath that, you will see the option to 'Add to Logical Disk #X', where X is the logical disk number that is degraded.

Disk is 'Unconfigured(good), Spun Up' with option to add the physical disk to logical disk #0.

Click on 'Add to Logical Disk #X' and the drive will be added to the logical disk. The rebuild process will start immediately.

Replacement physical disk manually added to logical disk #0, rebuild process started.

Drive is Failed but Still Online

If a drive has failed, but its controller is still working, the drive should appear as 'Unconfigured(bad).

This will be one of the most common failures you will have to deal with.

Physical disk failed and marked as 'Unconfigured(bad)'.

You will generally have two options here.

  • Prepare it for removal
  • Try to make the drive "good" again

Preparing a Failed Disk for Removal

A failed disk may still be "spun up", meaning that its platters are rotating. Moving the disk before the drive has spun down completely could cause a head-crash. Further, the electrical connection to the drive will still be active, making it possible to cause a short when the drive is mechanically ejected, which could damager the drive's controller, the backplane or the RAID controller itself.

To avoid this risk, before a drive is removed, we must prepare it for removal. This will spin-down the disk and make it safe to eject with minimal risk of electrical shorts.

To prepare it for removal, you will click on 'Spin Down Disk'.

Failed physical disk with 'Spin Down Disk' option highlighted.
Note: In the example below, AN!CDB's first attempt to spin down the drive failed. So it attempted, successfully, to mark the disk as 'good' and then tried to spin it down a second time, which worked. Be sure to physically record the disk as failed!

Once spun down, the physical disk will be shown to be safe to remove.

Failed physical disk is ready to be removed.

Once you insert the new disk, it should automatically be added to the logical disk and the rebuild process should start. If it doesn't, please see "Inserting a Replacement Physical Disk" above.

Attempting to Recover a Failed Disk

Warning: If you do not know why the drive failed, replace it. Even if it appears to recover, it will likely fail again soon.

Depending on why the physical disk failed, it may be possible to mark is as "good" again. This should only be done in two cases;

  • You know why the disk was flagged as failed and you trust it is actually healthy
  • To try and restore redundancy to the logical disk while you wait for the replacement disk to arrive

Underneath the 'Unconfigured(good), Spun Up' disk state will be the option 'Make good'.

Failed disk with 'Make Good' option highlighted.

If this works, the disk state will change to 'Unconfigured(good), Spun Up'. Below this, one of two options will be available;

Physical disk marked good.
  • If (one of) the logical disk(s) are degraded, 'Add to Logical Disk #X' will be available.
  • If (all) the logical disk(s) are 'Optimal', you will be able to mark the disk as a 'Hot Spare'.

In this section, we will add the recovered physical disk to the degraded logical disk. Managing hot-spares is covered further below. Do note though; If you mark a disk as a hot-spare when a logical disk is degraded, it will automatically be added to that degraded logical disk and rebuild will begin.

If there are multiple degraded logical disks, you will see multiple 'Add to Logical Disk #X' options, one for each degraded logical disk. In our case, there is just one logical disk, so there is just one option.

Recovered physical disk's 'Add to Logical Disk #0' button highlighted.

Once you click on 'Add to Logical Disk #X', the physical disk will appear under the logical disk and rebuild will begin automatically.

Recovered physical disk is back in the logical disk and rebuild has begun.

Done.

Recovering from Accidental Ejection of Good Drive

Warning: Ejecting a physical disk that is currently operating risks damaging the physical disk, it's controller, interface, the node's storage back-plane and/or the RAID controller. Never eject a running drive! We do it here using a test node and we accepted the risk of damaging the hardware.

There are numerous reasons why a perfectly good drive might get ejected from a node. In spite of the risks in the above warning, people often seem to eject a running disk. Perhaps not realizing the system is running, or out of a desire to simulate a failure. In any case, it happens and it is important to know how to recover from it.

As we discussed above, ejecting a disk will cause it to vanish from the list of physical disks and the logical disk it was in will become degraded.

Once re-inserted, the disk will be flagged as 'Unconfigured(bad)'. In this case, you know the disk is fine, so marking it as good after reinserting it is fine.

Please jump up to that section to finish recovering the physical disk.

Pre-Failure Drive Replacement

With the AN!CDB monitoring program, you may occasionally get an alert like:

Still healthy drive:
RAID 0's Physical Disk 5's "Other Error Count" has changed!
  0	-> 1

This is not always a concern. For example, every drive has a rated "unrecoverable read error (URE)" rating, usually 1 in every 10^15 reads for enterprise SAS drives. When the logical disk is healthy, the read error will not be a problem as the read can be recovered and the drive itself is still fine.

However, if we see a sudden spike in "Other Error Count", it may be a good idea to replace the drive.

A disk that may be failing and should be replaced:
RAID 0's Physical Disk 3's "Other Error Count" has changed!
  0	-> 4

In this case, we see that four errors occurred at almost the same time. This causes concern and warrants a pre-failure replacement.

In this case, you would leave the questionable disk in the logical disk until the replacement drive is in-hand. Once we're ready, we will remove the drive from the logical disk, which will leave it ready to eject, and then insert the replacement physical disk.

(Simulated) "Other" Error Count showing 4 errors. "Remove from Logical Disk" highlighted.

When you click on 'Remove from Logical Disk', you will be asked to confirm the action. You will also be warned that this action will degrade the logical disk. In this case, the replacement drive is ready to insert, so we are safe to proceed.

Confirming the "Remove from Logical Disk" action.

Click on 'Confirm', the physical disk will be taken offline, marked as missing and then spun down. Once completed, you will be able to safely eject the physical disk and install the replacement drive.

Physical disk out of the logical disk and is safe to remove.

When you insert the replacement disk, it should automatically be added to the logical disk and the rebuild should begin.

Note: If the freshly inserted drive is not automatically added to the logical disk, please see the "Inserting a Replacement Physical Disk" section above.
Replacement physical disk inserted and rebuild begun.

Done!

Managing Hot-Spares

In this final section, we will add a "hot-spare" drive to our node.

A "Hot-Spare" is an unused drive that the RAID controller knows can be used to immediately replace a failed drive. This is a popular option for people who want to return to a fully redundant state as soon after a failure as possible.

We will Configure a hot-spare, show how it replaces a failed drive, and show how to unmark a drive as a hot-spare, in case the hot-spare itself goes bad.

Creating a Hot-Spare

Up until this point, we've had four physical disks in our system. Here, we've inserted a fifth healthy disk. The logical disk was 'Optimal', so no default action was taken. Likewise, no option to add the physical disk to a logical disk is presented.

Newly inserted physical disk.

The one option that is presented, however, is 'Make Hot-Spare'.

New physical disk with 'Make Hot-Spare' highlighted.

Once marked as a hot-spare, the disk will be automatically used to replace a physical drive, should one of an equal or smaller size fail.

Physical disk now marked as a 'Hot-Spare, Spun up'.

Done!

Now any disk failure will recover as fast as technically possible.

Example of a Hot-Spare Working

As we saw earlier, there are many ways for a logical disk to degrade. The hot-spare will be used automatically as soon as the array is degraded, regardless of the cause.

For this reason, we'll simply remove a physical disk currently in the logical disk. We should see immediately that the hot-spare has entered the array and rebuild has begun.

I will remove physical disk '[252:6]' from the array.

Physical disk '[252:6]' removed from the logical disk, former hot-spare physical disk '[252:5]' already in the logical disk and rebuilding.

At this point, we could eject [252:6] or spin it back up. I know it's healthy, so I will spin it up and mark it as a hot spare in preparation for the next section.

Replacing a Failing Hot-Spare

As we discussed earlier, a drive may show signs of failing before it's actually failed. Of course, it could just as well die and get marked as 'Unconfigured(bad)' or vanish entirely.

In this case, we'll pretend the drive is predicted to fail.

Before we can spin down and remove a hot-spare disk, we must first click on 'Unmark as Hot-Spare'.

Hot-spare physical disk's Unmark as Hot-Spare highlighted.

Once you click it, the physical disk reverts to being 'Unconfigured(good)'. From here, we can click on 'Spin Down Disk' to prepare it for removal.

Physical disk's 'Spin Down Disk' highlighted.

Once spun down, the disk can safely be ejected and replaced.

Physical disk '[252:6]' ready to be removed.

Done!

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.