Hard drive has gone bad in DRBD

From Alteeve Wiki
Revision as of 15:03, 11 July 2011 by Digimer (talk | contribs) (→‎Migrate the VMs off of the Effected Node)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

 AN!Wiki :: How To :: Hard drive has gone bad in DRBD

So you've lost or are losing a hard drive in one of the cluster nodes.

Steps needed:

  1. Identify the failed drive. This example will use /dev/sda on Node1.
  2. Migrate the hosted VMs to the ok node. This document will migrate from Node1 to Node2.
  3. Break the RAID 1 mirror by removing the defective drive from the effected MD device. Here we will remove /dev/sda from the /dev/md0 device.
  4. Power off the defective server, physically replace the effected drive and power the repaired server back on.
  5. Add the replaced /dev/sda into /dev/md0 and begin the RAID 1 rebuild procedure.
  6. Migrate the effected virtual servers back onto the effected server.

Identifying the Failed Drive

SMART Control

If it's not clear, check the drives' states using smartctl. For each questionable drive, run:

smartctl -a /dev/sda

Replace sda with the drive you want to examine. You should see output like:

Good Drive

Good Drive (/dev/sdb):

smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     ST31500341AS
Serial Number:    9VS1XL54
Firmware Version: CC1H
User Capacity:    1,500,301,910,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Wed Feb  3 12:35:54 2010 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 ( 617) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   113   099   006    Pre-fail  Always       -       55840141
  3 Spin_Up_Time            0x0003   100   099   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       101
  5 Reallocated_Sector_Ct   0x0033   097   097   036    Pre-fail  Always       -       151
  7 Seek_Error_Rate         0x000f   079   060   030    Pre-fail  Always       -       89466942
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3584
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       102
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Unknown_Attribute       0x0032   100   093   000    Old_age   Always       -       71
189 High_Fly_Writes         0x003a   066   066   000    Old_age   Always       -       34
190 Airflow_Temperature_Cel 0x0022   069   052   045    Old_age   Always       -       31 (Lifetime Min/Max 23/48)
194 Temperature_Celsius     0x0022   031   048   000    Old_age   Always       -       31 (0 20 0 0)
195 Hardware_ECC_Recovered  0x001a   032   024   000    Old_age   Always       -       55840141
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       51737176051199
241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       1591700793
242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       95914747

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         3         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Bad Drive Output

Bad Drive (/dev/sda):

smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     ST31500341AS
Serial Number:    9VS1Q4Q3
Firmware Version: CC1H
User Capacity:    1,500,301,910,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Wed Feb  3 12:37:38 2010 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 ( 609) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   099   006    Pre-fail  Always       -       229324280
  3 Spin_Up_Time            0x0003   100   099   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       95
  5 Reallocated_Sector_Ct   0x0033   096   096   036    Pre-fail  Always       -       191
  7 Seek_Error_Rate         0x000f   065   058   030    Pre-fail  Always       -       30092885574
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3541
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       95
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   076   076   000    Old_age   Always       -       24
188 Unknown_Attribute       0x0032   100   097   000    Old_age   Always       -       4295032890
189 High_Fly_Writes         0x003a   001   001   000    Old_age   Always       -       241
190 Airflow_Temperature_Cel 0x0022   067   054   045    Old_age   Always       -       33 (Lifetime Min/Max 23/46)
194 Temperature_Celsius     0x0022   033   046   000    Old_age   Always       -       33 (0 21 0 0)
195 Hardware_ECC_Recovered  0x001a   026   026   000    Old_age   Always       -       229324280
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       5
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       5
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       6
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       229084965637588
241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       2647572671
242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       3166798893

SMART Error Log Version: 1
ATA Error Count: 24 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 24 occurred at disk power-on lifetime: 1785 hours (74 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  19d+13:50:04.321  READ DMA EXT
  27 00 00 00 00 00 e0 00  19d+13:50:04.291  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02  19d+13:50:04.283  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02  19d+13:50:04.238  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  19d+13:50:04.115  READ NATIVE MAX ADDRESS EXT

Error 23 occurred at disk power-on lifetime: 1785 hours (74 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  19d+13:50:01.385  READ DMA EXT
  27 00 00 00 00 00 e0 00  19d+13:50:01.355  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02  19d+13:50:01.347  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02  19d+13:50:01.325  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  19d+13:50:01.275  READ NATIVE MAX ADDRESS EXT

Error 22 occurred at disk power-on lifetime: 1785 hours (74 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  19d+13:49:58.449  READ DMA EXT
  27 00 00 00 00 00 e0 00  19d+13:49:58.419  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02  19d+13:49:58.411  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02  19d+13:49:58.366  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  19d+13:49:58.247  READ NATIVE MAX ADDRESS EXT

Error 21 occurred at disk power-on lifetime: 1785 hours (74 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  19d+13:49:55.488  READ DMA EXT
  27 00 00 00 00 00 e0 00  19d+13:49:55.459  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02  19d+13:49:55.451  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02  19d+13:49:55.428  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  19d+13:49:55.379  READ NATIVE MAX ADDRESS EXT

Error 20 occurred at disk power-on lifetime: 1785 hours (74 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00  19d+13:49:52.540  READ DMA EXT
  27 00 00 00 00 00 e0 00  19d+13:49:52.511  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02  19d+13:49:52.488  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02  19d+13:49:52.463  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  19d+13:49:52.420  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

You will notice in the above example that the drive's SMART status is OK, but that it has generated errors. These errors were sufficient to cause poor enough performance to trigger a fence against the effected server.

Locating the Physical Drive

Once you know the logical block device path, /dev/sda here, you will need to locate it's physical position in the server. To do this, reference the docs for the effected server. This mapping should have been recorded when the node was built.

If it wasn't, first go kick the admin in the shins. Next, you will need to guess which is which. We can make an educated guess though because the above output includes the Serial Number (9VS1Q4Q3 above). In fact, reference the Serial Number anyway, in case the OS changed things up at some point.

  • Node2
  • Node1

Using the serial number and the docs in Node1, we know that the effected drive is in Tray 1.

Migrate the VMs off of the Effected Node

Note: Re-write this to use clusvcadm.

From either node, preferably the good node, ssh into in with X-Forwarding enabled and then start convirt. In our example, Node1 is effected, so we will connect to Node2.

From your PC:

ssh root@an-node02.alteeve.com -X

Once on Node2:

convirt &

Convirt

With convirt running, connect to each node by clicking on their names user the Servers item. This will prompt you to enter the Node's root password.

Then, for each VM on the effected node, do the following:

  1. Click to highlight the VM.
  2. Click on Migrate.
  3. Select the good node as the destination. This is Node2 in this example.
  4. Confirm the live migration.
    1. Note: The migration could take some time, so be sure to warn John or whomever might be using the effected VM prior to initiating the migration. No processes will need to be stopped, but to the user, the VM will appear to "freeze" for the duration of the migration.

xm

If you do not want to use convirt, you can use the xm command line tool to perform the migration procedure. The syntax is:

UNTESTED!

#xm migrate [domain_id] [host] -l
xm migrate sql01 an-node02 -l

Break the RAID Arrays

We will tell the array that the drive /dev/sda is no longer usable.

Confirming Your Partition Structure

Continuing our example, we will need to replace /dev/sda which has four partitions:

/dev/sda1; In /dev/md0 -   250MB  '/boot'  partition
/dev/sda2; In /dev/md2 -  2000MB  '<swap>' partition
/dev/sda3; In /dev/md1 - 20000MB  '/'      partition
/dev/sda5; In /dev/md3 -    1+MB  '<LVM>'  partition

Confirm the above configuration by checking /proc/mdadm:

cat /proc/mdstat

Which should show:

md0 : active raid1 sdb1[1] sda1[0]
      256896 blocks [2/2] [UU]
      
md1 : active raid1 sdb3[1] sda3[0]
      2048192 blocks [2/2] [UU]
      
md3 : active raid1 sdb5[1] sda5[0]
      1442347712 blocks [2/2] [UU]
      
md2 : active raid1 sdb2[1] sda2[0]
      20482752 blocks [2/2] [UU]
      
unused devices: <none>

Depending on how your drive has failed, you may see one or more entries with: [_U]. If this is the case, the corresponding partition may be absent or, if there, will look like: sda2[2](F).

Failing The mdX Devices

Given these four partitions, we will need to run the following commands to remove their four partitions from their four respective /dev/mdX devices. Adapt this to your needs:

mdadm --fail /dev/md0 /dev/sda1
mdadm --fail /dev/md1 /dev/sda3
mdadm --fail /dev/md2 /dev/sda2
mdadm --fail /dev/md3 /dev/sda5

Confirm that all the arrays are now broken by again running:

cat /proc/mdstat

Which should show:

md0 : active raid1 sdb1[1] sda1[0](F)
      256896 blocks [2/1] [_U]
      
md1 : active raid1 sdb3[1] sda3[0](F)
      2048192 blocks [2/1] [_U]
      
md3 : active raid1 sdb5[1]
      1442347712 blocks [2/1] [_U]
      
md2 : active raid1 sdb2[1] sda2[0](F)
      20482752 blocks [2/1] [_U]
      
unused devices: <none>

In the above example, sda1, sda2 and sda3 where failed my the mdadm --fail call while sda5 has failed in such a way that it is not visible at all.

Replace The Defective Drive

With the knowledge of the defective drive's serial number and port in hand, power off the server.

DO NOT POWER IT BACK ON WHILE CONNECTED TO THE NETWORK!!!.

Under no circumstances do we want the cluster the re-assemble until after the defective drive has been replaced, re-added to the array and confirmed good!

Prepare the Replacement Drive

I prefer to pre-partition the replacement drive on a separate workstation, but this can be done safely in the server itself once it's been installed. If you wish to delay partitioning until then, skip to the next step and then return to here once you reach: #Power the Node Back in SINGLE USER MODE below.

Ensure the New Drive is Blank

In my case, the replacement drive comes up on my workstation as /dev/sdb. If yours is different, simple replace sdb with your drive letter.

First, wipe the drive by writing 10000 blocks to the drive using dd from /dev/zero. Confirm the drive is where we expect it:

fdisk -l
Disk /dev/sda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x13662e6d

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1        1216     9767488+   7  HPFS/NTFS
/dev/sda2            1217        1340      996030   82  Linux swap / Solaris
/dev/sda3            1341        9729    67384642+  83  Linux

Disk /dev/sdb: 1500.3 GB, 1500301910016 bytes
255 heads, 63 sectors/track, 182401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x000f0012

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          32      257008+  fd  Linux raid autodetect
/dev/sdb2              33         287     2048287+  fd  Linux raid autodetect
/dev/sdb3             288      182401  1462830705   fd  Linux raid autodetect

In the case above, the replacement drive had three partitions on it. New drives will usually be blank. Also, I know that the /dev/sdb is the right drive by looking at their capacities. I could further confirm this using smartctl -a /dev/sdb if I had any doubt.

Now blank the drive:

dd if=/dev/zero of=/dev/sdb count=10000
10000+0 records in
10000+0 records out
5120000 bytes (5.1 MB) copied, 1.40715 s, 3.6 MB/s

Now confirm that the drive is clear by re-running fdisk -l:

fdisk -l
Disk /dev/sda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x13662e6d

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1        1216     9767488+   7  HPFS/NTFS
/dev/sda2            1217        1340      996030   82  Linux swap / Solaris
/dev/sda3            1341        9729    67384642+  83  Linux

Disk /dev/sdb: 1500.3 GB, 1500301910016 bytes
255 heads, 63 sectors/track, 182401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

Disk /dev/sdb doesn't contain a valid partition table

Perfect!

Create a Duplicate Partition Structure

Now we need to create the new partitions in such a way to identically mimic the old drive.

To do this, run fdisk against an okay drive and take node of the start and end cylinders for each good partition. These will be our guide when re-creating the partition scheme on the replacement drive.

Here is the output from a good drive on the surviving node:

fdisk -l /dev/sda
Disk /dev/sda: 1500.3 GB, 1500301910016 bytes
255 heads, 63 sectors/track, 182401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          32      257008+  fd  Linux raid autodetect
/dev/sda2              33        2582    20482875   fd  Linux raid autodetect
/dev/sda3            2583        2837     2048287+  fd  Linux raid autodetect
/dev/sda4            2838      182401  1442347830    5  Extended
/dev/sda5            2838      182401  1442347798+  fd  Linux raid autodetect

The Start and End columns have the values we will need to set for the corresponding partitions.

Below is a fairly large dump from my terminal using fdisk. I prefer this method over graphical tools as I can be very precise this way. I'll assume here that you are familiar with the fdisk shell. If not, GO LEARN before proceeding!

Start the fdisk shell:

fdisk /dev/sdb

And build the partitions:

Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x2d841b1b.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.


The number of cylinders for this disk is set to 182401.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-182401, default 1): 
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-182401, default 182401): 32

Command (m for help): a
Partition number (1-4): 1

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 2
First cylinder (33-182401, default 33): 
Using default value 33
Last cylinder, +cylinders or +size{K,M,G} (33-182401, default 182401): 2582

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 3
First cylinder (2583-182401, default 2583): 
Using default value 2583
Last cylinder, +cylinders or +size{K,M,G} (2583-182401, default 182401): 2837

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
e
Selected partition 4
First cylinder (2838-182401, default 2838): 
Using default value 2838
Last cylinder, +cylinders or +size{K,M,G} (2838-182401, default 182401): 
Using default value 182401

Command (m for help): n
First cylinder (2838-182401, default 2838): 
Using default value 2838
Last cylinder, +cylinders or +size{K,M,G} (2838-182401, default 182401): 
Using default value 182401

Command (m for help): t
Partition number (1-5): 1
Hex code (type L to list codes): fd
Changed system type of partition 1 to fd (Linux raid autodetect)

Command (m for help): t
Partition number (1-5): 2
Hex code (type L to list codes): fd
Changed system type of partition 2 to fd (Linux raid autodetect)

Command (m for help): t
Partition number (1-5): 3
Hex code (type L to list codes): fd
Changed system type of partition 3 to fd (Linux raid autodetect)

Command (m for help): t
Partition number (1-5): 5
Hex code (type L to list codes): fd
Changed system type of partition 5 to fd (Linux raid autodetect)

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

What was done above, in short, was;

  1. Create the first partition as primary ending on cyl. 32
  2. Set the first partition to be bootable
  3. Created the remaining partitions as primary, primary, extended, primary.
  4. Changed the type of all partitions to fd, Linux raid autodetect.

Confirm that the replacement drive now matches what the remaining good drive is partitioned as:

fdisk -l /dev/sdb
/dev/sdb1   *           1          32      257008+  fd  Linux raid autodetect
/dev/sdb2              33        2582    20482875   fd  Linux raid autodetect
/dev/sdb3            2583        2837     2048287+  fd  Linux raid autodetect
/dev/sdb4            2838      182401  1442347830    5  Extended
/dev/sdb5            2838      182401  1442347798+  fd  Linux raid autodetect

Perfect! Now we can install it in place of the defective drive.

Power Off and Unrack

poweroff

Unrack the server and move it to a work area.

Replace the Drive

Remove the drive you suspect to be the failed one. Confirm it is the right one by comparing the Serial Number reported by the smartctl -a /dev/sda call from step 1 (switch sda for your drive, of course). Once you have confirmed that the proper drive is in hand, remove it from it's carrier and set it aside to process for RMA later. Install the replacement drive and UPDATE THE DOCS!

Power the Node Back in SINGLE USER MODE

With the server on your work-bench and not connected to any network, power on the server.

If You Don't Get the Grub Screen

If you replaced the first drive (sda), then there is a good chance the node will not boot but instead appear to hang with a black screen. This happens because the replacement drive is flagged bootable but has no data. To get around this, select the Boot Device BBS prompt. On most systems, including Node2 and Node1, this is done by pressing <F8> during the POST. Once you get the Boot Device list, select the second hard drive to boot from and the proceed with the next step.

Boot as Single User

Interrupt the Grub boot screen by pressing <esc> at the appropriate time. With the default kernel selected, press e to edit it. Append the word " single" to the end of the line (note the leading space).

Recovering the RAID Arrays Manually

In this case, the node failed to boot. Under the rescue DVD, I was able to rebuild the arrays manually. This hasn't solved the boot problem yet, but I'll get back to that tomorrow.

RAID Rebuild in Rescue Mode

To add a replacement disk to a busted array under the CentOS DVD in rescue mode, you need to start by writing out a skeleton /etc/mdadm.conf file. Here is one compatible with the Node2 and Node1 nodes.

Note that in this case, /dev/sda has been replaced and /dev/md0 wouldn't build because, for some reason, mdadm was detecting a superblock on /dev/sda1. For this reason, the initial skeleton file left out the /dev/sda1 entry for /dev/md0 at first.

Create the following /etc/mdadm.conf:

vi /etc/mdadm.conf
ARRAY /dev/md0 level=raid1 num-devices=2 devices=/dev/sdb1,/dev/sda1
ARRAY /dev/md1 level=raid1 num-devices=2 devices=/dev/sdb3,/dev/sda3
ARRAY /dev/md2 level=raid1 num-devices=2 devices=/dev/sdb2,/dev/sda2
ARRAY /dev/md3 level=raid1 num-devices=2 devices=/dev/sdb5,/dev/sda5

Now re-assemble the array:

mdadm --assemble --scan
mdadm: superblock on /dev/sda1 doesn't match others - assembly aborted
mdadm: /dev/md1 has been started with 1 drive (out of 2).
mdadm: /dev/md2 has been started with 1 drive (out of 2).
mdadm: /dev/md3 has been started with 1 drive (out of 2).

You noticed the error above? To fix this, edit /etc/mdadm.conf and remove the ,/dev/sda1 from the /dev/md0 line. It should now look like:

ARRAY /dev/md0 level=raid1 num-devices=2 devices=/dev/sdb1
ARRAY /dev/md1 level=raid1 num-devices=2 devices=/dev/sdb3,/dev/sda3
ARRAY /dev/md2 level=raid1 num-devices=2 devices=/dev/sdb2,/dev/sda2
ARRAY /dev/md3 level=raid1 num-devices=2 devices=/dev/sdb5,/dev/sda5

Zero-out the /dev/sda1 partition with this command:

dd if=/dev/zero of=/dev/sda1 count=1000

Then re-assemble just the array /dev/md0 with just the device /dev/sdb1 specified:

mdadm --assemble --scan /dev/md0
mdadm: /dev/md0 has been started with 1 drive (out of 2).

Good, now add the /dev/sda1 partition to the /dev/md0 array in /dev/mdadm.conf. Once it's back, add /dev/sda1 to the /dev/md0 array to start the sync process.

mdadm --manage /dev/md0 --add /dev/sda1
mdadm: added /dev/sda1

If this worked, you should see be able to cat /proc/mdstat and see something like:

md0 : active raid1 sda1[0] sdb1[1]
      256896 blocks [2/2] [UU]

The other entries will show degraded arrays. Now that we've gotten this far, add the new partitions to the rest of the arrays:

mdadm --manage /dev/md1 --add /dev/sda3
mdadm --manage /dev/md2 --add /dev/sda2
mdadm --manage /dev/md3 --add /dev/sda5

You can now watch the arrays sync with watch:

watch cat /proc/mdstat

Depending on the speed of your drives, you will probably see one of the arrays sync'ing and possible one of the others waiting to sync.

Every 2s: /proc/mdstat Wed Feb  3 22:02:07 2010

Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] 
md3 : active raid1 sda5[0] sdb5[1]
      1442347712 blocks [2/1] [_U]
      [==>..................]  recovery = 10.4% (151696560/1442347712) finishes=217.4min speed=105770K/sec

md2 : active raid1 sda2[0] sdb2[1]
      20482753 blocks [2/2] [UU]

md1 : active raid1 sda1[0] sdb1[1]
      2048192 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
      256896 blocks [2/2] [UU]

You can see above how the /dev/md3 array is still sync'ing. You can reboot at this point, but I prefer to wait when I can afford the time.

More to come

...

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.