Showing posts with label linux. Show all posts
Showing posts with label linux. Show all posts

Saturday, April 23, 2016

Simplifying hard drive layout

Whew, it's been a while since I've posted here...Gotta love Gentoo, I'm still running my original install from sometime back in 2006 (my oldest raid superblock says the array was created "Thu Sep  7 18:41:05 2006").  The Gentoo install is probably older than that because I know I didn't start out using kernel raid at all.  So I've gone from a single drive, to a 2 drive mirror, to adding a 4 disk raid5 to swapping out the original mirror drives to bigger ones and creating a couple Frankenstein arrays of those 2 disks being partly mirrored and partly being added to the original 4 disk raid5 array and converted to a raid6 array which gave me double redundancy and one more 320GB slice of capacity.  It is this mess that is the topic of today's post.  I realized I don't have a need for as much storage, the drives in my system are in some cases near 10 years old and way out of warranty and I wanted a simpler setup that would use less electricity.   So I did some research and bought 2 new 1TB drives and we're going to migrate everything onto a simple mirror of those 2 drives.

Sidebar: I was originally looking into getting NAS drives but after a bit of research I decided on WD Black drives.  For a raid5/6 array you want NAS drives but for raid 0/1 you don't want NAS drives that support TLER.

Important Note: Before we do anything we make sure we have backups of our important data right?  And we all know that raid isn't a backup, right?  Moving on...

OK, so to recap, this is the current situation:

erma ~ # mdadm --detail /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Thu Sep  7 18:41:05 2006
     Raid Level : raid6
     Array Size : 1250274304 (1192.35 GiB 1280.28 GB)
  Used Dev Size : 312568576 (298.09 GiB 320.07 GB)
   Raid Devices : 6
  Total Devices : 5
Preferred Minor : 0
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Fri Apr 22 16:05:07 2016
          State : clean, degraded
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 128K

           UUID : fdc29307:ba90c91c:d9adde8d:723321bc
         Events : 0.2437894

    Number   Major   Minor   RaidDevice State
       0       8       68        0      active sync   /dev/sde4
       2       0        0        2      removed
       2       8       17        2      active sync   /dev/sdb1
       3       8       33        3      active sync   /dev/sdc1
       4       8        1        4      active sync   /dev/sda1
       5       8       84        5      active sync   /dev/sdf4
erma ~ # mdadm --detail /dev/md1
/dev/md1:
        Version : 0.90
  Creation Time : Sat Nov 23 10:12:58 2013
     Raid Level : raid1
     Array Size : 39040 (38.13 MiB 39.98 MB)
  Used Dev Size : 39040 (38.13 MiB 39.98 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Fri Apr 22 15:37:36 2016
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 3979d926:51bc94d9:cb201669:f728008a
         Events : 0.361

    Number   Major   Minor   RaidDevice State
       0       8       65        0      active sync   /dev/sde1
       1       8       81        1      active sync   /dev/sdf1
erma ~ # mdadm --detail /dev/md3
/dev/md3:
        Version : 0.90
  Creation Time : Sun Mar 18 17:18:46 2007
     Raid Level : raid1
     Array Size : 155244032 (148.05 GiB 158.97 GB)
  Used Dev Size : 155244032 (148.05 GiB 158.97 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 3
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Fri Apr 22 16:14:53 2016
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 950ef5a7:b7b41171:b9e86deb:7164bf0e
         Events : 0.2168580

    Number   Major   Minor   RaidDevice State
       0       8       67        0      active sync   /dev/sde3
       1       8       83        1      active sync   /dev/sdf3


So we have my bulk data array (raid6) and my boot and root arrays (raid1) respectively.  I am going to migrate all of this onto a new raid1 array.  You'll notice md0 is already running degraded as I have previously shut the system down, removed one of the 320GB drives and replaced it with a new single 1TB drive (I don't have any free SATA ports so I needed to substitute an old for a new drive, I removed a drive that is only in the raid 6 array since that has double redundancy meaning I could still lose any other drive in the system and still be ok).

First, I need to find the new 1TB drive's name:  fdisk -l is our friend here:

Disk /dev/sdd: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes



This looks promising, it's about the right size and there's no partition table, let's check some more info just to be sure:
erma ~ # hdparm -i /dev/sdd

/dev/sdd:

 Model=WDC WD1003FZEX-00MK2A0, FwRev=01.01A01, SerialNo=WD-WCC3F7KJUCTH
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0
 BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=off
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: Reserved:  ATA/ATAPI-1,2,3,4,5,6,7

 * signifies the current active mode
 

 Looks good.  Every other existing drive was a Seagate.  

After some math and some trial and error I arrived at the following partition layout:
Device     Boot      Start        End    Sectors   Size Id Type
/dev/sdd1             2048     206847     204800   100M fd Linux raid autodetect
/dev/sdd2           206848 1947412848 1947206001 928.5G fd Linux raid autodetect
/dev/sdd3       1947414528 1953525167    6110640   2.9G 82 Linux swap / Solaris


Let's create the 2 new arrays.
erma ~ # mdadm --create /dev/md10 --level=1 --metadata=0.90 --raid-devices=2 missing /dev/sdd1
mdadm: array /dev/md10 started.
erma ~ # mdadm --create /dev/md11 --level=1 --metadata=0.90 --raid-devices=2 missing /dev/sdd2
mdadm: array /dev/md11 started.



Format /dev/md10 which will become /boot as ext2:

erma ~ # mkfs.ext2 /dev/md10

Create a single LVM PV out of the main array on the disks which is /dev/md11:
erma ~ # pvcreate /dev/md11

My old root filesystem wasn't LVM but the new one will be.  I also had several other mounts carved out from the "bulk" raid6 array.  I'll create a new LV for root but I can utilize the LVM tools to migrate the data from the existing drives to the new.

Add the new PV to the existing VG:
erma ~ # vgextend vg /dev/md11
  Volume group "vg" successfully extended
erma ~ # pvs
  PV         VG   Fmt  Attr PSize   PFree
  /dev/md0   vg   lvm2 a--    1.16t 245.85g
  /dev/md11  vg   lvm2 a--  928.50g 928.50g


Tell LVM to migrate LVs from old PV to new PV (This is going to take a while...):
erma ~ # pvmove --atomic /dev/md0 /dev/md11

Now all the LVM data has been moved off of the /dev/md0 array and I can turn remove that from VG and delete the array.
erma ~ # vgreduce vg /dev/md0
  Removed "/dev/md0" from volume group "vg"

erma ~ # pvremove /dev/md0
  Labels on physical volume "/dev/md0" successfully wiped
erma ~ # mdadm --stop /dev/md0
mdadm: stopped /dev/md0
erma ~ # mdadm --remove /dev/md0

Now I just need to copy the old boot and root arrays which weren't LVM to the new disk.  The new boot partition array is still not LVM but the root partition (which I'm about to create) will now be a LV so we'll just rsync the data over.
lvcreate -L 150G -n root vg /dev/md11
mkfs.xfs /dev/vg/root

mkdir /mnt/root
mount /dev/vg/root /mnt/root
rsync -avxHAXW --info=progress2 / /mnt/root/
Now, the system is still running so it's going to be copying over some files that will be changing while/after this is going on.  I'll need to shut down the system and boot off a livecd and re-run the same rsync command after mounting the filesystems to let it get all the things that changed/it missed since the first run, but doing it this way minimizes down-time.

At this point I am going to shut down the server and remove the remaining 3 320GB drives and add the second new 1TB drive.  After booting up some drive letters will change (But everything will be fine and mount without issues cause you use UUIDs in your fstab instead of device names right?) so the first new drive (what was /dev/sdd before) is now /dev/sdb and the second (just added) new drive is /dev/sda.

Let's copy the partition layout from sdb to the newly added sda:
erma ~ # sfdisk -d /dev/sdb | sfdisk /dev/sda

Add the 2 missing partitions to the boot and LVM arrays, they'll immediately start syncing drives to get the arrays at 100%.  I'll issue a couple commands to increase the sync speed since, by default, it doesn't go as fast as it can in order to not put a large drag on your system.  I'm not really doing anything else important and want the sync to happen ASAP.
erma ~ # mdadm --add /dev/md10 /dev/sda1
mdadm: added /dev/sda1
erma ~ # mdadm --add /dev/md11 /dev/sda2
mdadm: added /dev/sda2

erma ~ # echo 200000 > /proc/sys/dev/raid/speed_limit_max
erma ~ # echo 200000 > /proc/sys/dev/raid/speed_limit_min

erma ~ # cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md10 : active raid1 sda1[0] sdb1[1]
      102336 blocks [2/2] [UU]

md11 : active raid1 sda2[2] sdb2[1]
      973602880 blocks [2/1] [_U]
      [>....................]  recovery =  1.5% (15322176/973602880) finish=97.3min speed=164128K/sec
      bitmap: 6/8 pages [24KB], 65536KB chunk


Now go watch a movie until it's done syncing...According to iostat it's getting 150-170MB/sec transfer speed.

In the meantime, I'm going to do some more housekeeping.  I'll activate the new swap partition, turn off the existing 2 swap partitions, format the second new swap partition and update the fstab (using UUIDs) so everything will be automatic when the system boots up.
erma ~ # swapon /dev/sdb3
erma ~ # swapoff /dev/sdd2
erma ~ # swapoff /dev/sdc2
erma ~ # mkswap /dev/sda3
Setting up swapspace version 1, size = 2.9 GiB (3128643584 bytes)
no label, UUID=570c612c-3209-4a34-89da-0b4e72357258
erma ~ # swapon /dev/sda3
erma ~ # cat /proc/swaps
Filename                                Type            Size    Used    Priority
/dev/sda3                               partition       3055316 0       -1
/dev/sdb3                               partition       3055316 0       1


Install GRUB on the new drives:
grub> device (hd0) /dev/sda
grub> device (hd1) /dev/sdb
grub> root (hd0,0)
grub> setup (hd0)
grub> root (hd1,0)
grub> setup (hd1)
grub> quit


Up next is booting the system to a live cd, rsyncing the root filesystem one last time, changing the fstab and grub.conf to point to the new arrays and hope she boots up.

Well, she didn't, at least not without a little more work.  I had overlooked that my current initramfs wasn't set up to handle kernel raid so it wasn't assembling the arrays properly which meant no logical volumes were found and no root filesystem.  A few searches later I found I needed to run generkernel with --lvm and --mdadm flags.  I needed chroot into the system using the normal Gentoo install process and then I was able to run genkernel with the proper flags, add domdadm to my kernel line in grub.conf and after that everything worked fine.  I spent some more time cleaning up the old arrays, removing the last of the 6 old hard drives and getting the device names for the raid arrays to md0 and md1 which is nice and simple how I wanted it.  The nice thing about Gentoo is that using it over a long period of time makes you learn things that help you fix problems a lot quicker than if you used an "easy" distribution.

 So I'm finally left with:
erma ~ # cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid1 sda1[0] sdb1[1]
      102336 blocks [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
      973602880 blocks [2/2] [UU]
      bitmap: 1/8 pages [4KB], 65536KB chunk

erma ~ # pvs
  PV         VG   Fmt  Attr PSize   PFree
  /dev/md1   vg   lvm2 a--  928.50g 258.00g
 

I'm actively using just under 500GB of the array.  Some guys at work were questioning why in the age of 8TB drives I purchased 1TB drives.  I had about 1.5TB of storage before but I had grown certain LVs over time and wasn't needing that much space and I also deleted a bunch of stuff before migrating the LVs over to the new PV so 1TB of total space (or 928.5G) is more than enough.  I still have the 2 500GB Seagates and can create another mirrored array if the need arises but I don't think I'll need it.  

That's all for now.

Sunday, March 14, 2010

Migrating Linux software raid from 4 device raid5 to 6 device raid6

In a previous post, I discussed migrating from 2xIDE device mirror to a 2xSATA device mirror.  Since the old arrays were using 160GB and I bought 500GB drives I figured I'd use the space left over to add a couple more devices to my storage array.

Here's what I'm starting with:
# cat /proc/mdstat
md0 : active raid5 sdd1[1] sdb1[3] sdc1[2] sda1[0]
      937705728 blocks level 5, 128k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/150 pages [0KB], 1024KB chunk

It's a 4x320GB raid5 array.  I'm going to expand the array to include 2 more devices (sde4 and sdf4) and reshape it to a raid6 array at the same time.  I will end up gaining 1 device's worth of space (320GB) and 1 more drive of redundancy.  With raid6, the array will be able to survive 2 failures and still function instead of the 1 failure a raid5 array can survive.

Unfortunately, the current, stable hardened kernel is 2.6.28-r9 and to reshape a raid5 to raid6 requires at least a 2.6.31 kernel.  Additionally mdadm >=3.1.0 is required and 3.0 is currently stable.  The second is reasonably easy to fix:
# echo "=sys-fs/mdadm-3.1.1-r1" >> /etc/portage/package.keywords 
# emerge -av mdadm

For the kernel, I installed layman and added the hardened-development overlay (not covered here) and unmasked the minimum required kernel:
# echo "=sys-kernel/hardened-sources-2.6.31-r11" >> /etc/portage/package.keywords
# emerge -av hardened-sources

I'm also not going to cover configuring/building/installing/booting to the new kernel.  If you're using Gentoo, you should know what you're doing already in that respect.

After all the prerequisites are taken care of (I created the partitions I'm using here during the previous array muddling in the previous blog post) we can move forward.

Add the 2 new devices to the array.
# mdadm /dev/md0 --add /dev/sde4 /dev/sdf4

At this point the new devices will be acting as "spares" as shown below (the (S) next to the device):
# cat /proc/mdstat
md0 : active raid5 sdf4[4](S) sde4[5](S) sdd1[1] sdb1[3] sdc1[2] sda1[0]
      937705728 blocks level 5, 128k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 1/150 pages [4KB], 1024KB chunk

Turn off the write-intent bitmap on the array temporarily.  This is necessary for the reshape to occur.  I originally was getting an error and Neil Brown (mdadm author http://neil.brown.name) told me I needed to remove the bitmap while reshaping:
# mdadm --grow /dev/md0 --bitmap none

To speed up the sync process we're about to cause, issue the following:
# echo 200000 > /proc/sys/dev/raid/speed_limit_max
# echo 200000 > /proc/sys/dev/raid/speed_limit_min

Start the reshape:
# mdadm --grow /dev/md0 --level=6 --raid-devices=6 --backup-file=/root/raid-backup 
mdadm level of /dev/md0 changed to raid6 
mdadm: Need to backup 1536K of critical section..

Watch the *extremely* slow reshape (you can literally watch it with watch -n 1 cat /proc/mdstat):
# cat /proc/mdstat
md0 : active raid6 sda1[4] sdf4[0] sde4[5] sdb1[3] sdd1[1] sdc1[2]
      937705728 blocks super 0.91 level 6, 128k chunk, algorithm 18 [6/7] [UUUUUU]
      [====>................]  reshape = 22.6% (70662528/312568576) finish=286.1min speed=14088K/sec 


At this point, mdadm --detail output still shows my array as being the old size:
Array Size : 937705728 (894.27 GiB 960.21 GB)
Used Dev Size : 312568576 (298.09 GiB 320.07 GB) 


I was curious about this as I should have gained 320GB.  My device size is 320GB, raid5 capacity is n-1 devices: 320x3 = 960GB.  After the reshape it will be n-2 devices: 320x4=1280GB. So I ran a test with some loopback devices and the size of the array will be correct when the reshape is completed.

After the reshape, turn the write intent bitmap back on:
# mdadm --grow /dev/md0 --bitmap internal

As you can see, the array now has the proper 320x4 size (and the superblock version went back 0.90):
# mdadm -D /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Thu Sep  7 18:41:05 2006
     Raid Level : raid6
     Array Size : 1250274304 (1192.35 GiB 1280.28 GB)
  Used Dev Size : 312568576 (298.09 GiB 320.07 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Mar 14 11:40:59 2010
          State : active
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 128K

           UUID : fdc29307:ba90c91c:d9adde8d:723321bc
         Events : 0.692377

    Number   Major   Minor   RaidDevice State
       0       8       84        0      active sync   /dev/sdf4
       1       8       49        1      active sync   /dev/sdd1
       2       8       33        2      active sync   /dev/sdc1
       3       8       17        3      active sync   /dev/sdb1
       4       8        1        4      active sync   /dev/sda1
       5       8       68        5      active sync   /dev/sde4 


Since I use LVM to chop up this array, I just need to grow my pv to make LVM aware of the new, larger, size of the underlying raid array:
# pvresize /dev/md0

pvdisplay now shows the full size:
  PV Size               1.16 TiB / not usable 2.81 MiB

Similarly, vgdisplay shows the extra space available for allocation:
  Free  PE / Size       86234 / 336.85 GiB

And that's about it.  Big thanks to Neil for the tip on the write-intent bitmap.  The combination of Linux kernel raid and mdadm let's you do some pretty amazing things.  I was able to do both the raid1 migrations and this raid5 -> raid6 extend/reshape while the system was up and running with live filesystems.  That's pretty impressive.

Migrating a software raid1 array to 2 new harddrives

My server currently has 2 IDE drives mirrored for the boot/root and swap partitions.  They are 160 and 250GB drives.  The 250 is a refurb sent to me by Seagate after the original matching 160 died.  So the mirror has already saved me once.  I'm going to migrate to newer/faster/larger 500GB SATA discs.  Rather than use the whole 500 for the root filesystem which I don't need.  I decided to keep the partition sizes all the same and use the leftover space on the drive to add capacity and redundancy to my storage array which is the subject for another blog post.

So here's what I'm starting with partition wise:
/dev/hda1          0+      4       5-     40131   fd  Linux raid autodetect
/dev/hda2          5     129     125    1004062+  82  Linux swap / Solaris
/dev/hda3        130   19456   19327  155244127+  fd  Linux raid autodetect

hdb obviously has the exact same partition layout.  
/dev/hda1 and hdb1 are raid1 /dev/md1 for /boot
/dev/hda2 and hdb2 are swap
/dev/hda3 and hdb3 are raid1 /dev/md3 for /

I originally had the swap partitions mirrored as well but after reading up a bit I decided I didn't need THAT level of protection so just added the 2 partitions individually.  (They used to be /dev/md2)

The first step (after adding the new drives to the system) is to partition new disks exactly like the old ones:
sfdisk -d /dev/hda | sfdisk /dev/sde  
sfdisk -d /dev/hda | sfdisk /dev/sdf

At this point I added a 4th partition to sde and sdf taking up the rest of the drives.  That's in preparation for the other migration.

set up swap:
# mkswap /dev/sde2
Setting up swapspace version 1, size = 1004056 KiB
no label, UUID=d9a6fd39-e768-4334-8496-2b0b5ab44bdf
# mkswap /dev/sdf2
Setting up swapspace version 1, size = 1004056 KiB
no label, UUID=529c0773-9a3e-434d-b6e4-16cb0e8f24a2
 

Turn on the new swaps (I mount stuff almost exclusively with UUIDs):
# swapon UUID=d9a6fd39-e768-4334-8496-2b0b5ab44bdf
# swapon UUID=529c0773-9a3e-434d-b6e4-16cb0e8f24a2


Add swaps to fstab (I removed the old ones at this point before I forgot):
UUID=d9a6fd39-e768-4334-8496-2b0b5ab44bdf  none  swap  sw,pri=1 0 0
UUID=529c0773-9a3e-434d-b6e4-16cb0e8f24a2  none  swap  sw,pri=1 0 0
Add 2 new devices to md1 (/boot):
mdadm /dev/md1 --add /dev/sde1 --add /dev/sdf1

Add 2 new devices to md3 (/):
mdadm /dev/md3 --add /dev/sde3 --add /dev/sdf3

Snipped mdadm detail output shows the new devices as spares:
# mdadm --detail /dev/md1
    Number   Major   Minor   RaidDevice State
       0       3        1        0      active sync   /dev/hda1
       1       3       65        1      active sync   /dev/hdb1

       2       8       81        -      spare   /dev/sdf1
       3       8       65        -      spare   /dev/sde1

# mdadm --detail /dev/md3
    Number   Major   Minor   RaidDevice State
       0       3        3        0      active sync   /dev/hda3
       1       3       67        1      active sync   /dev/hdb3

       2       8       83        -      spare   /dev/sdf3
       3       8       67        -      spare   /dev/sde3

To speed up the sync process we're about to cause, issue the following:
# echo 200000 > /proc/sys/dev/raid/speed_limit_max
# echo 200000 > /proc/sys/dev/raid/speed_limit_min

Before I continue.  The next couple steps are where the system doesn't have the full redundancy.  I'm going to mark one of the 2 ide devices in the array as faulty.  The kernel will automatically grab a spare and start rebuilding the array.  Personally, I'm not worrying about this because, if something does happen, I can always re-add the drive I "failed".  Just making the point that when you manually degrade the array, you're mirror isn't redundant until the rebuild/resync is complete.  Continuing on...

Mark one of the old devices in a raid as faulty.  It's very important that you only mark one device faulty!  This will cause the array to grab a spare and start syncing the remaining good device, hdb1 in this case, to the new device .  You can watch the progress of the resync via cat /proc/mdstat.
# mdadm /dev/md1 -f /dev/hda1
mdadm: set /dev/hda1 faulty in /dev/md1 


Since this raid volume is all of 40MB, it resyncs before I can even look at the mdstat output.  Still, I check and make sure it's all synced up and fault the other old partition:
# mdadm /dev/md1 -f /dev/hdb1
mdadm: set /dev/hdb1 faulty in /dev/md1


Again, check mdstat output and make sure it finishes.  It should look similar to this:
md1 : active raid1 sdf1[2] sde1[3] hdb1[1](F) hda3[1](F)
      40064 blocks [2/2] [UU]
      bitmap: 0/5 pages [0KB], 4KB chunk

Now all the data has been copied over to the new drives and we just need to remove the old ones from the array:
# mdadm /dev/md1 --remove /dev/hda1 --remove /dev/hdb1
mdadm: hot removed /dev/hda1
mdadm: hot removed /dev/hdb1

mdstat now says:
md1 : active raid1 sdf1[0] sde1[1]
      40064 blocks [2/2] [UU]
      bitmap: 0/5 pages [0KB], 4KB chunk

Now do the same steps for hda3 and hdb3 for the md3 array.

Fail one of the devices:
# mdadm /dev/md3 -f /dev/hda3
mdadm: set /dev/hda3 faulty in /dev/md3

This array is ~155GB so it takes a little longer to resync, here's the mdstat output:
md3 : active raid1 sdf3[2] sde3[3](S) hdb3[1] hda3[4](F)
      155244032 blocks [2/1] [_U]
      [=====>...............]  recovery = 29.2% (45338816/155244032) finish=38.1min speed=48002K/sec
      bitmap: 25/149 pages [100KB], 512KB chunk

Incidentally, here's some iostat output showing why I want to get rid of the old IDE harddrives...hdb is reading 46MB/sec and it as 99.44% utilization.  Meanwhile sde is writing at 46MB/sec and is only at 38% utilization:
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
hdb             649.20     0.00   93.60    1.40    46.34     0.01   999.07     3.45   36.33  10.47  99.44
sdf               0.00   646.80    0.00   96.20     0.00    46.44   988.74     0.41    4.24   3.98  38.32


When the array has resynced after failing the first partition, go ahead and mark the second original device as faulted:
# mdadm /dev/md3 -f /dev/hdb3
mdadm: set /dev/hdb3 faulty in /dev/md3


The new drives are a little over twice as fast as the old.  It's reading off of sde to fill sdf:
md3 : active raid1 sdf3[0] sde3[2] hdb3[3](F) hda3[4](F)
      155244032 blocks [2/1] [U_]
      [=====>...............]  recovery = 28.9% (44921984/155244032) finish=18.7min speed=97804K/sec
      bitmap: 39/149 pages [156KB], 512KB chunk 


Once the array is finished up with the second sync, time to remove the 2, now faulty, device from the array:

#mdadm /dev/md3 --remove /dev/hda3 --remove /dev/hdb3
mdadm: hot removed /dev/hda3
mdadm: hot removed /dev/hdb3


At this point I can turn off the swaps on the old drives:
swapoff /dev/hda2 /dev/hdb2

And the system is no longer "using" the old hard drives at all. 

I shutdown the system and remove the old IDE hard drives.  Then I boot to system rescue cd and chroot into the system (following same procedure as initially entering the chroot of your system from Gentoo Handbook) to run grub and install it into the MBR of both new drives.  Grub names drives differently than the kernel.  It uses BIOS numbering as well.  Meaning grub sees the 2 new SATA drives (on the sil3132 controller) as the 5th and 6th drives.  They are named hd4 and hd5 respectively.  The kernel, on the other hand, sees them as sde and sdf because it enumerates the 4 drives plugged into the motherboard first.

First start grub
# grub 

Install to first drive's MBR (boot partition is first partition on 5th drive)
grub> root (hd4,0)
 Filesystem type is ext2fs, partition type 0xfd
grub> setup (hd4) 


Install to second drive's MBR (boot partition is first partition on 6th drive)

grub> root (hd5,0)
 Filesystem type is ext2fs, partition type 0xfd
grub> setup (hd5)


Now I reboot and check the BIOS settings to tell the computer to boot off of one of the 2 new SATA drives and let 'er go.

The arrays and swap are now running exclusively on the new SATA drives:
# cat /proc/mdstat
md1 : active raid1 sdf1[0] sde1[1]
      40064 blocks [2/2] [UU]
      bitmap: 0/5 pages [0KB], 4KB chunk

md3 : active raid1 sdf3[0] sde3[1]
      155244032 blocks [2/2] [UU]
      bitmap: 24/149 pages [96KB], 512KB chunk

# cat /proc/swaps
Filename                                Type            Size    Used    Priority
/dev/sde2                               partition       1004052 0       1
/dev/sdf2                               partition       1004052 0       1

Friday, March 14, 2008

XFS fragmentation

Check your fragmentation levels:
# xfs_db -c frag -r /dev/vg/lv1
actual 37387, ideal 35541, fragmentation factor 4.94%
# xfs_db -c frag -r /dev/vg/lv2
actual 688725, ideal 667471, fragmentation factor 3.09%
# xfs_db -c frag -r /dev/md3
actual 631947, ideal 624800, fragmentation factor 1.13%


On Gentoo, xfs_db is in sys-fs/xfsprogs which, if you have an XFS filesystem, you should already have installed.

If you want to run the defragger, the command is xfs_fsr and on Gentoo you need to install an additional package, sys-fs/xfsdump, to get it. You can read the manpage on xfs_fsr for more info, but the gist is if you don't otherwise supply command line params it will start going through all of your xfs mountpoints and stop after either 10 passes or 7200 seconds. It keeps track of where it was so you can just run it again and it will pick up where it left off if it didn't make it through all 10 passes.

2025 Update:
A little extra output in the command I think is humorous (emphasis mine):

#  xfs_db -c frag -r /dev/vg/root
actual 849197, ideal 830570, fragmentation factor 2.19%
Note, this number is largely meaningless.
Files on this filesystem average 1.02 extents per file

Saturday, February 16, 2008

Migrate an existing Gentoo system to hardened profile

This post is about migrating a system running a current amd64 profile to a hardened profile and all the things entailed in setting up a reasonably "hardened" Gentoo system. I've been wanting to use hardened but in the past when I have looked into it, the process of switching would have required a downgrade of libc that portage doesn't want to allow. Currently the hardened profile uses the same libc that I already have so this presents the opportunity to do the switch.

Covering my ass...

Since this is a potentially deadly operation (the general consensus in #gentoo-hardened was that some people have done it and it's probably ok, BUT Bad Things© could happen) so they don't really recommend doing so. Because of this, I'm doing the following to help mitigate data loss.

I shut down most of my services (switching to single user mode would be better, but I was too lazy to hook up monitor/kb/mouse to server...) and ran a backup to get a snapshot of the system. My /boot and / partitions are mirrored using kernel raid and I told mdadm to kick the second drive out of each of the arrays:
# mdadm /dev/md1 -f /dev/hde1
# mdadm /dev/md3 -f /dev/hde3
# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid1 hdg1[0]
40064 blocks [2/1] [U_]
bitmap: 2/5 pages [8KB], 4KB chunk

md0 : active raid5 sdd1[1] sdc1[2] sdb1[0] sda1[3]
937705728 blocks level 5, 128k chunk, algorithm 2 [4/4] [UUUU]
bitmap: 0/150 pages [0KB], 1024KB chunk

md3 : active raid1 hdg3[0]
155244032 blocks [2/1] [U_]
bitmap: 57/149 pages [228KB], 512KB chunk

unused devices:
So now all of my changes for hardened will be occurring on one drive, If I completely screw up the system and can't fix, I can just switch to the "faulty" drive and rebuild the array. If all goes well, I just re-add the partitions to the 2 raids and they'll resync and all will be rosy.

Getting to the task at hand...

First, switch to the new profile. You can do this with eselect. Let's see what's avaialable on the system:
# eselect profile list
Available profile symlink targets:
[1] default-linux/amd64/2006.1
[2] default-linux/amd64/2006.1/desktop
[3] default-linux/amd64/2006.0/no-symlinks
[4] default-linux/amd64/2006.1/no-multilib
[5] default-linux/amd64/2007.0 *
[6] default-linux/amd64/2007.0/desktop
[7] default-linux/amd64/2007.0/no-multilib
[8] default-linux/amd64/2007.0/server
[9] hardened/amd64
[10] hardened/amd64/multilib
[11] selinux/2007.0/amd64
[12] selinux/2007.0/amd64/hardened
Change the profile:
# eselect profile set 10
Next, you need to build the hardened toolchain:
# emerge -av --oneshot binutils gcc virtual/libc
Tell the system to use the new (older) hardened gcc profile:
# gcc-config -l
[1] x86_64-pc-linux-gnu-3.4.6
[2] x86_64-pc-linux-gnu-3.4.6-hardenednopie
[3] x86_64-pc-linux-gnu-3.4.6-hardenednopiessp
[4] x86_64-pc-linux-gnu-3.4.6-hardenednossp
[5] x86_64-pc-linux-gnu-3.4.6-vanilla
[6] x86_64-pc-linux-gnu-4.1.2 *
# gcc-config x86_64-pc-linux-gnu-3.4.6
* Switching native-compiler to x86_64-pc-linux-gnu-3.4.6 ...
>>> Regenerating /etc/ld.so.cache... [ ok ]

* If you intend to use the gcc from the new profile in an already
* running shell, please remember to do:

* # source /etc/profile

# source /etc/profile
Slight change to the /etc/make.conf CFLAGS (adding -fforce-addr, I don't know what it does but if you download a hardened stage tarball, it's set in the make.conf by default so I'm adding it here) Substitute my march for yours, of course:
CFLAGS="-march=k8 -pipe -O2 -fforce-addr"
Next, I do a test emerge command and look for green (use flags that are changing state). The reason you need to do this is each profile has a set of profile defined USE defaults. The new hardened profile added a couple and removed a few in my case. So basically, do an emerge -ave world and look for green and * which signifies a change in the use flag since the last time you merged a package. Add or remove corresponding use flags to /etc/make.conf (or use app-portage/ufed as I do). Keep running the emerge -ave world and saying n until you are happy with the output and then hit y to actually start merging.
# emerge -ave world
If you run into any snags (a package fails to build), just note the package that failed and restart the emerge with "emerge -ave world --resume --skipfirst". Obviously things can get a little tricky if the package with the problem is a core system library or something, but if you don't use --resume, it's going to start rebuilding the WHOLE system again. In the past I've found it's relatively safe to "fix" the problem in another shell while continuing to build in the primary shell.

So about 9 hours and 312 packages later it's done. I restarted most of my network services just to make sure they wouldn't blow up right off the bat and everything seemed alright so far. I emerged hardened-sources while the world was rebuilding so I kicked off genkernel to configure (according to the various hardened guides), build and install the new kernel with hardened sources. After that I rebooted and everything still came up OK.

After testing things out a bit, I re-added the second drive to the mirrors and let the arrays resync:
# mdadm /dev/md1 -a /dev/hde1
# mdadm /dev/md3 -a /dev/hde3
So those are the basic steps to switch over to hardened. Remember, always have backups ready before you do something like this.

Friday, January 25, 2008

I can't find my UUID!

In the last post I showed how to reference a partition in /etc/fstab using a UUID and ran through a couple real-world scenarios for wanting to do so.

This post is about the trouble I ran into looking for said UUIDs...

I have 6 hard drives in my server, 2 have 3 partitions each (boot, swap, root) and are raid mirrored, the other 4 have 1 each and are set up as raid 5. I'm only concerned with the first 2 for this post. After seeing the error about mounting the swap on the last bootup I figured it would be an easy fix. I knew how to list the UUIDs so I typed in that command and was greeted with a seriously lacking list of partitions:
# ls -l /dev/disk/by-uuid/
total 0
lrwxrwxrwx 1 root root 10 2008-01-25 19:44 c2ceffb9-90be-4564-a946-9d37de7725ba -> ../../hdg2
lrwxrwxrwx 1 root root 22 2008-01-25 19:44 ca583626-4a25-4af7-b6c5-8e59a502dbc2 -> ../../mapper/vg-ballzy
lrwxrwxrwx 1 root root 22 2008-01-25 19:44 f5cc881f-210a-431f-8d52-f1e5b512b57b -> ../../mapper/vg-backup
As you can see, none of the partitions from hde are listed, and only the one from hdg is listed. I'm not sure how or why the other ones are not listed.

Since the swap partitions weren't mounted I first tried mkswap to just "reformat" the swap:
# mkswap /dev/hde2
Setting up swapspace version 1, size = 1028153 kB
no label, UUID=56c2f2af-86dd-4390-ae1a-c7fb71e6ed05
Ok, Looks good so far. Let's try turning it on:
# swapon UUID=56c2f2af-86dd-4390-ae1a-c7fb71e6ed05
swapon: cannot find the device for UUID=56c2f2af-86dd-4390-ae1a-c7fb71e6ed05
But...mkswap just told me the UUID...how can it not be found?!?!?

After a little digging, I came up with the vol_id command and it clued me in to the problem:
# vol_id /dev/hde2
ID_FS_USAGE=raid
ID_FS_TYPE=linux_raid_member
ID_FS_VERSION=0.90.0
ID_FS_UUID=5088bad5:89d678b2:c125e369:2e0dbcdd
ID_FS_UUID_ENC=5088bad5:89d678b2:c125e369:2e0dbcdd
ID_FS_LABEL=
ID_FS_LABEL_ENC=
ID_FS_LABEL_SAFE=
Raid member? OK, I admit, I used to mirror my 2 swap partitions, but after seeing the performance I decided against the protection it afforded and just went back to adding 2 separate swap partitions. It seems the raid superblock was still in the partition and mkswap wasn't overwriting it for whatever reason.

After another quick google search for deleting a raid superblock, I found the proper command and here are the results:
# mdadm --zero-superblock /dev/hde2
# vol_id /dev/hde2
ID_FS_USAGE=other
ID_FS_TYPE=swap
ID_FS_VERSION=2
ID_FS_UUID=fe6bffd9-5b6b-4db9-8929-cf1575a72d67
ID_FS_UUID_ENC=fe6bffd9-5b6b-4db9-8929-cf1575a72d67
ID_FS_LABEL=
ID_FS_LABEL_ENC=
ID_FS_LABEL_SAFE=
Ahhh, the real UUID, and it sees it as swap as well. I then proceeded to update the /etc/fstab after which swapon -a correctly enabled both swaps. As to why the UUIDs are not listed under /dev, I don't know. Maybe after a reboot the other swap will show up? The other /dev/by-* listings show all the partitions properly.

EDIT: Since I'm running Gentoo, a simple udevstart causes udev to restart. Now,
ls -l /dev/disk/by-uuid/
shows both hde2 and hdg2 :).

The mystical UUID

Every filesystem (partition?) should have a uuid. On modern Linux systems you can see them with the following command:
# ls -l /dev/disk/by-uuid/
total 0
lrwxrwxrwx 1 root root 10 2008-01-25 19:44 c2ceffb9-90be-4564-a946-9d37de7725ba -> ../../hdg2
UUIDs, while a bit cumbersome to look at, are extremely nice because you can use them in a lot of places instead of a normal device name (such as /dev/hdg2 in the above example).

Tonight I moved the 2 hard drives I had plugged into the onboard IDE controller of my motherboard into a Promise Ultra100 card. Because of this, the kernel renamed the partitions from /dev/hda and /dev/hdc to /dev/hde/ and /dev/hdg. Upon booting the system I saw the following:
swapon: cannot canonicalize /dev/hda2: No such file or directory
swapon: cannot stat /dev/hda2: No such file or directory
swapon: cannot canonicalize /dev/hdc2: No such file or directory
swapon: cannot stat /dev/hdc2: No such file or directory
UUIDs will help this to never happen again.

Here are the relevant lines from my old /etc/fstab:
/dev/hda2   none    swap    sw,pri=1    0 0
/dev/hdc2 none swap sw,pri=1 0 0
And the new lines:
UUID=fe6bffd9-5b6b-4db9-8929-cf1575a72d67   none    swap    sw,pri=1    0 0
UUID=e2992cf5-bc3a-4b3a-a920-d9dfbe7a5a9a none swap sw,pri=1 0 0
As I said, it doesn't look as pretty, but look what happens with the old /etc/fstab:
#swapon -a
# cat /proc/swaps
Filename Type Size Used Priority
and the new:
# swapon -a
erma ~ # cat /proc/swaps
Filename Type Size Used Priority
/dev/hde2 partition 1004052 0 1
/dev/hdg2 partition 1004052 0 1
If you haven't figured it out by now, by specifying partitions by UUID, you remove the dependency on where they are physically plugged into the motherboard and any kernel naming conventions. I recently had my SATA drives move around a bit after a BIOS update, so UUIDs would help out there as well.

As it happens I had some trouble finding the (correct) UUID of one of my swap partitions but that's the topic of my next post.