D. R. Evans (N7DR): Replacing a Linux Software RAID1 Bootable Disk

I recently started to receive SMART e-mails indicating that one of the two drives in a bootable Linux software RAID1 array on my main desktop machine was likely to fail before too long. An Internet search produced several pages that included what were purported to be complete instructions on how to replace the failing drive. However, all seemed to be missing one or more steps, so I thought that I had better document exactly the steps that I followed to perform the replacement.

Note that my system has two md devices, each using two disks, and each disk has two partitions (see step 2 below).

1. Identify the device

The important part of one of the SMART e-mails looked like this:

This message was generated by the smartd daemon running on:

   host name:  homebrew
   DNS domain: [Empty]

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1024 Currently unreadable (pending) sectors

Device info:
ST2000DM001-1CH164, S/N:Z340BANG, WWN:5-000c50-064a4855d, FW:CC27, 2.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Sun Mar 19 13:12:58 2017 MDT
Another message will be sent in 24 hours if the problem persists.

It's a good idea to check that the identification of the device in /dev (in this case, /dev/sda) matches the information in the e-mail. This can be done with:
ls -al /dev/disk/by-id

In this case, the failing device is /dev/sda, so that is the drive that will be replaced.

2. Determine which md devices use the failing drive

Execute:
cat /proc/mdstat

$ cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[3] sdb2[2]
      1937755968 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[3] sdb1[2]
      15615872 blocks super 1.2 [2/2] [UU]

unused devices: <none>

So the failing device (sda) is in use by md0 (which uses partition sda1) and md1 (which uses partition sda2).

3. Mark the drive as failed in the md devices

Execute:
mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md0 --fail /dev/sda1

4. Mark the drive as removed in the md devices

Execute:
mdadm --manage /dev/md0 --remove /dev/sda1
mdadm --manage /dev/md0 --remove /dev/sda1

5. Check the status

Execute:
cat /proc/mdstat
which should show that md0 and md1 are both no longer using /dev/sda.

6. Shut down

Execute:
poweroff
or power off the system using a graphical desktop.

7. Replace the failing drive

Open the case; identify the failing drive using the identification information in the SMART e-mail; remove the drive; replace it with a new drive of at least the same size.

Note that drives that are nominally the same size may in fact have different usable amounts of space. When I created my RAID1 array, the instructions contained no warning about this (I don't know whether this omission is still true). Consequently, I partitioned the disks to use the entirety of the disks. This is a mistake, as I found: none of my spare drives were as large as the removed drive, even though they were all nominally the same size. Eventually, I ordered a new drive that was nominally 50% larger than the one I had removed (3TB instead of 2B), as all my spare drives of the same nominal size were all in fact too small. Using such a large replacement drive wastes a lot of space, but does ensure that the new drive will not fail at step 10 below.

8. Power up the computer

9. Determine the new mapping between device identifiers and device letters

Execute:
cat /proc/mdstat
and note the sd<x> identifier for the good drive that is currently the (sole) drive in the arrays. We do this now because the sd<x> identifiers of the drives might have changed across the reboot (and in fact on my system, they did change).

I will call this drive sd<good>.

Execute:
ls -al /dev/disk/by-id
and write down the /dev/sd<x> identifier of the new drive. I will call this drive sd<new>.

10. Copy the partition table from /dev/sd<good> to /dev/sd<new>

Execute:
sfdisk -d /dev/sd<good> | sfdisk /dev/sd<new>
If the drive sd<new> is not at least as large as the disk that it replaced, this command will fail. There are ways to work around a failure at this point, but none of them is pleasant, and, frankly, the best thing to do is to do as I did and obtain (quickly) a drive that is known to be large enough.

11. Check that things look correct

Execute:
fdisk -l
This lists the partition information for all the drives on the system. Check that the partition tables for /dev/sd<good> and /dev/sd<new> are identical.

12. Add the new drive to the RAID arrays

Execute:
mdadm --manage /dev/md0 --add /dev/sd<new>1
mdadm --manage /dev/md1 --add /dev/sd<new>2

13. Check that things look correct

Execute:
cat /proc/mdstat
At this point, one of the RAID drives will be resilvering; the other will either already be completed (if the second one is resilvering) or marked as DELAYED, which means that it waiting for the first RAID drive to complete its resilver.

14. Make the new drive bootable

Execute:
dd bs=512 count=1 if=/dev/sd<new> 2> /dev/null | strings

This should produce no output, as /dev/sd<new> is a clean disk.

Now execute:
grub-install /dev/sd<new>
to install the boot loader on the new disk.

15. Check that the new drive is bootable

Execute:
dd bs=512 count=1 if=/dev/sd<new> 2> /dev/null | strings

This should now produce output that includes one or more references to GRUB, such as:

ZRr=
`|f
\|f1
GRUB
Geom
Hard Disk
Read
Error

At this point the replacement is complete and the md devices are functional. It usually takes a few hours for the resilvering to complete (i.e., to make the two drives in the RAID arrays identical), but this happens behind the scenes, and the system is fully usable in the meantime.

15a. Actually boot from the new drive

If you are paranoid (as I am), it is a good idea, after resilvering has completed, to reboot, changing the BIOS settings temporarily so that the system is forced to boot from the new disk. Once this has been confirmed, the BIOS settings can be returned to the original configuration.

D. R. Evans (N7DR)

2017-05-01

Replacing a Linux Software RAID1 Bootable Disk