Friday, February 15, 2008

Linux RAID1: Failing disk in boot set, part 1

I'm running software RAID on Linux, using the md module.

I have everything RAIDed, including /boot and swap.

The /boot and / is on a RAID1 set. The logic behind that is the boostrap problem: the kernel must be loaded to enable the md_mod module and hence enable RAID, but the kernel resides on the RAID set - catch 22! However, since RAID1 is pure mirroring, things work out: assume the boot disk isn't a part of a RAID set while loading the kernel from /dev/sda1 in read-only mode, then, when it comes up, load and activate the RAID module, enable the RAID set, and continue taking up the system using the now /dev/md0 device.

So, if a disk fails, find out which one (find its serial number), chuck that out, stick in a new one, and then boot up. The RAID module will upon trying to assemble the RAID sets discover that one of the device is gone, and simply take up the set in reduced mode. You are then able to partition your new disk exactly as the one already present, and then add the new device back into the RAID set.

Simple!

However, there is a bootstrap problem before the one described, which is what really is called "the boot process": When the computer starts up, it reads the first sector of the boot harddisk (typically the first in its discovery sequence, possibly changeable in the BIOS setup) into memory, and runs that. This is where the bootloader resides.

The only problem is that a normal system only sticks the bootloader on the mentioned boot harddisk. This part of the disk isn't handled by the RAID module - so it won't be copied over to the other disk (e.g. the partition table is there too - can't have two different sized disk's partition table hard-copied from one to the other).

So, if a disk fails, you'd better hope that it is the other disk, not the boot disk!

I wasn't that lucky...!

No comments:

Post a Comment