Crotchety Linux - Gentoo Stuff

crotchety
adj - subject to whims, crankiness, or ill temper

I use Gentoo Linux both at home and at work. Every so often I hit some snag and I'd like to detail the fix here both for my benefit and to possibly help anyone else having a similar issue.

Sunday, March 14, 2010

Migrating Linux software raid from 4 device raid5 to 6 device raid6

In a previous post, I discussed migrating from 2xIDE device mirror to a 2xSATA device mirror.  Since the old arrays were using 160GB and I bought 500GB drives I figured I'd use the space left over to add a couple more devices to my storage array.

Here's what I'm starting with:
# cat /proc/mdstat
md0 : active raid5 sdd1[1] sdb1[3] sdc1[2] sda1[0]
      937705728 blocks level 5, 128k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/150 pages [0KB], 1024KB chunk

It's a 4x320GB raid5 array.  I'm going to expand the array to include 2 more devices (sde4 and sdf4) and reshape it to a raid6 array at the same time.  I will end up gaining 1 device's worth of space (320GB) and 1 more drive of redundancy.  With raid6, the array will be able to survive 2 failures and still function instead of the 1 failure a raid5 array can survive.

Unfortunately, the current, stable hardened kernel is 2.6.28-r9 and to reshape a raid5 to raid6 requires at least a 2.6.31 kernel.  Additionally mdadm >=3.1.0 is required and 3.0 is currently stable.  The second is reasonably easy to fix:
# echo "=sys-fs/mdadm-3.1.1-r1" >> /etc/portage/package.keywords 
# emerge -av mdadm

For the kernel, I installed layman and added the hardened-development overlay (not covered here) and unmasked the minimum required kernel:
# echo "=sys-kernel/hardened-sources-2.6.31-r11" >> /etc/portage/package.keywords
# emerge -av hardened-sources

I'm also not going to cover configuring/building/installing/booting to the new kernel.  If you're using Gentoo, you should know what you're doing already in that respect.

After all the prerequisites are taken care of (I created the partitions I'm using here during the previous array muddling in the previous blog post) we can move forward.

Add the 2 new devices to the array.
# mdadm /dev/md0 --add /dev/sde4 /dev/sdf4

At this point the new devices will be acting as "spares" as shown below (the (S) next to the device):
# cat /proc/mdstat
md0 : active raid5 sdf4[4](S) sde4[5](S) sdd1[1] sdb1[3] sdc1[2] sda1[0]
      937705728 blocks level 5, 128k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 1/150 pages [4KB], 1024KB chunk

Turn off the write-intent bitmap on the array temporarily.  This is necessary for the reshape to occur.  I originally was getting an error and Neil Brown (mdadm author http://neil.brown.name) told me I needed to remove the bitmap while reshaping:
# mdadm --grow /dev/md0 --bitmap none

To speed up the sync process we're about to cause, issue the following:
# echo 200000 > /proc/sys/dev/raid/speed_limit_max
# echo 200000 > /proc/sys/dev/raid/speed_limit_min

Start the reshape:
# mdadm --grow /dev/md0 --level=6 --raid-devices=6 --backup-file=/root/raid-backup 
mdadm level of /dev/md0 changed to raid6 
mdadm: Need to backup 1536K of critical section..

Watch the *extremely* slow reshape (you can literally watch it with watch -n 1 cat /proc/mdstat):
# cat /proc/mdstat
md0 : active raid6 sda1[4] sdf4[0] sde4[5] sdb1[3] sdd1[1] sdc1[2]
      937705728 blocks super 0.91 level 6, 128k chunk, algorithm 18 [6/7] [UUUUUU]
      [====>................]  reshape = 22.6% (70662528/312568576) finish=286.1min speed=14088K/sec 


At this point, mdadm --detail output still shows my array as being the old size:
Array Size : 937705728 (894.27 GiB 960.21 GB)
Used Dev Size : 312568576 (298.09 GiB 320.07 GB) 


I was curious about this as I should have gained 320GB.  My device size is 320GB, raid5 capacity is n-1 devices: 320x3 = 960GB.  After the reshape it will be n-2 devices: 320x4=1280GB. So I ran a test with some loopback devices and the size of the array will be correct when the reshape is completed.

After the reshape, turn the write intent bitmap back on:
# mdadm --grow /dev/md0 --bitmap internal

As you can see, the array now has the proper 320x4 size (and the superblock version went back 0.90):
# mdadm -D /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Thu Sep  7 18:41:05 2006
     Raid Level : raid6
     Array Size : 1250274304 (1192.35 GiB 1280.28 GB)
  Used Dev Size : 312568576 (298.09 GiB 320.07 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Mar 14 11:40:59 2010
          State : active
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 128K

           UUID : fdc29307:ba90c91c:d9adde8d:723321bc
         Events : 0.692377

    Number   Major   Minor   RaidDevice State
       0       8       84        0      active sync   /dev/sdf4
       1       8       49        1      active sync   /dev/sdd1
       2       8       33        2      active sync   /dev/sdc1
       3       8       17        3      active sync   /dev/sdb1
       4       8        1        4      active sync   /dev/sda1
       5       8       68        5      active sync   /dev/sde4 


Since I use LVM to chop up this array, I just need to grow my pv to make LVM aware of the new, larger, size of the underlying raid array:
# pvresize /dev/md0

pvdisplay now shows the full size:
  PV Size               1.16 TiB / not usable 2.81 MiB

Similarly, vgdisplay shows the extra space available for allocation:
  Free  PE / Size       86234 / 336.85 GiB

And that's about it.  Big thanks to Neil for the tip on the write-intent bitmap.  The combination of Linux kernel raid and mdadm let's you do some pretty amazing things.  I was able to do both the raid1 migrations and this raid5 -> raid6 extend/reshape while the system was up and running with live filesystems.  That's pretty impressive.

Migrating a software raid1 array to 2 new harddrives

My server currently has 2 IDE drives mirrored for the boot/root and swap partitions.  They are 160 and 250GB drives.  The 250 is a refurb sent to me by Seagate after the original matching 160 died.  So the mirror has already saved me once.  I'm going to migrate to newer/faster/larger 500GB SATA discs.  Rather than use the whole 500 for the root filesystem which I don't need.  I decided to keep the partition sizes all the same and use the leftover space on the drive to add capacity and redundancy to my storage array which is the subject for another blog post.

So here's what I'm starting with partition wise:
/dev/hda1          0+      4       5-     40131   fd  Linux raid autodetect
/dev/hda2          5     129     125    1004062+  82  Linux swap / Solaris
/dev/hda3        130   19456   19327  155244127+  fd  Linux raid autodetect

hdb obviously has the exact same partition layout.  
/dev/hda1 and hdb1 are raid1 /dev/md1 for /boot
/dev/hda2 and hdb2 are swap
/dev/hda3 and hdb3 are raid1 /dev/md3 for /

I originally had the swap partitions mirrored as well but after reading up a bit I decided I didn't need THAT level of protection so just added the 2 partitions individually.  (They used to be /dev/md2)

The first step (after adding the new drives to the system) is to partition new disks exactly like the old ones:
sfdisk -d /dev/hda | sfdisk /dev/sde  
sfdisk -d /dev/hda | sfdisk /dev/sdf

At this point I added a 4th partition to sde and sdf taking up the rest of the drives.  That's in preparation for the other migration.

set up swap:
# mkswap /dev/sde2
Setting up swapspace version 1, size = 1004056 KiB
no label, UUID=d9a6fd39-e768-4334-8496-2b0b5ab44bdf
# mkswap /dev/sdf2
Setting up swapspace version 1, size = 1004056 KiB
no label, UUID=529c0773-9a3e-434d-b6e4-16cb0e8f24a2
 

Turn on the new swaps (I mount stuff almost exclusively with UUIDs):
# swapon UUID=d9a6fd39-e768-4334-8496-2b0b5ab44bdf
# swapon UUID=529c0773-9a3e-434d-b6e4-16cb0e8f24a2


Add swaps to fstab (I removed the old ones at this point before I forgot):
UUID=d9a6fd39-e768-4334-8496-2b0b5ab44bdf  none  swap  sw,pri=1 0 0
UUID=529c0773-9a3e-434d-b6e4-16cb0e8f24a2  none  swap  sw,pri=1 0 0
Add 2 new devices to md1 (/boot):
mdadm /dev/md1 --add /dev/sde1 --add /dev/sdf1

Add 2 new devices to md3 (/):
mdadm /dev/md3 --add /dev/sde3 --add /dev/sdf3

Snipped mdadm detail output shows the new devices as spares:
# mdadm --detail /dev/md1
    Number   Major   Minor   RaidDevice State
       0       3        1        0      active sync   /dev/hda1
       1       3       65        1      active sync   /dev/hdb1

       2       8       81        -      spare   /dev/sdf1
       3       8       65        -      spare   /dev/sde1

# mdadm --detail /dev/md3
    Number   Major   Minor   RaidDevice State
       0       3        3        0      active sync   /dev/hda3
       1       3       67        1      active sync   /dev/hdb3

       2       8       83        -      spare   /dev/sdf3
       3       8       67        -      spare   /dev/sde3

To speed up the sync process we're about to cause, issue the following:
# echo 200000 > /proc/sys/dev/raid/speed_limit_max
# echo 200000 > /proc/sys/dev/raid/speed_limit_min

Before I continue.  The next couple steps are where the system doesn't have the full redundancy.  I'm going to mark one of the 2 ide devices in the array as faulty.  The kernel will automatically grab a spare and start rebuilding the array.  Personally, I'm not worrying about this because, if something does happen, I can always re-add the drive I "failed".  Just making the point that when you manually degrade the array, you're mirror isn't redundant until the rebuild/resync is complete.  Continuing on...

Mark one of the old devices in a raid as faulty.  It's very important that you only mark one device faulty!  This will cause the array to grab a spare and start syncing the remaining good device, hdb1 in this case, to the new device .  You can watch the progress of the resync via cat /proc/mdstat.
# mdadm /dev/md1 -f /dev/hda1
mdadm: set /dev/hda1 faulty in /dev/md1 


Since this raid volume is all of 40MB, it resyncs before I can even look at the mdstat output.  Still, I check and make sure it's all synced up and fault the other old partition:
# mdadm /dev/md1 -f /dev/hdb1
mdadm: set /dev/hdb1 faulty in /dev/md1


Again, check mdstat output and make sure it finishes.  It should look similar to this:
md1 : active raid1 sdf1[2] sde1[3] hdb1[1](F) hda3[1](F)
      40064 blocks [2/2] [UU]
      bitmap: 0/5 pages [0KB], 4KB chunk

Now all the data has been copied over to the new drives and we just need to remove the old ones from the array:
# mdadm /dev/md1 --remove /dev/hda1 --remove /dev/hdb1
mdadm: hot removed /dev/hda1
mdadm: hot removed /dev/hdb1

mdstat now says:
md1 : active raid1 sdf1[0] sde1[1]
      40064 blocks [2/2] [UU]
      bitmap: 0/5 pages [0KB], 4KB chunk

Now do the same steps for hda3 and hdb3 for the md3 array.

Fail one of the devices:
# mdadm /dev/md3 -f /dev/hda3
mdadm: set /dev/hda3 faulty in /dev/md3

This array is ~155GB so it takes a little longer to resync, here's the mdstat output:
md3 : active raid1 sdf3[2] sde3[3](S) hdb3[1] hda3[4](F)
      155244032 blocks [2/1] [_U]
      [=====>...............]  recovery = 29.2% (45338816/155244032) finish=38.1min speed=48002K/sec
      bitmap: 25/149 pages [100KB], 512KB chunk

Incidentally, here's some iostat output showing why I want to get rid of the old IDE harddrives...hdb is reading 46MB/sec and it as 99.44% utilization.  Meanwhile sde is writing at 46MB/sec and is only at 38% utilization:
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
hdb             649.20     0.00   93.60    1.40    46.34     0.01   999.07     3.45   36.33  10.47  99.44
sdf               0.00   646.80    0.00   96.20     0.00    46.44   988.74     0.41    4.24   3.98  38.32


When the array has resynced after failing the first partition, go ahead and mark the second original device as faulted:
# mdadm /dev/md3 -f /dev/hdb3
mdadm: set /dev/hdb3 faulty in /dev/md3


The new drives are a little over twice as fast as the old.  It's reading off of sde to fill sdf:
md3 : active raid1 sdf3[0] sde3[2] hdb3[3](F) hda3[4](F)
      155244032 blocks [2/1] [U_]
      [=====>...............]  recovery = 28.9% (44921984/155244032) finish=18.7min speed=97804K/sec
      bitmap: 39/149 pages [156KB], 512KB chunk 


Once the array is finished up with the second sync, time to remove the 2, now faulty, device from the array:

#mdadm /dev/md3 --remove /dev/hda3 --remove /dev/hdb3
mdadm: hot removed /dev/hda3
mdadm: hot removed /dev/hdb3


At this point I can turn off the swaps on the old drives:
swapoff /dev/hda2 /dev/hdb2

And the system is no longer "using" the old hard drives at all. 

I shutdown the system and remove the old IDE hard drives.  Then I boot to system rescue cd and chroot into the system (following same procedure as initially entering the chroot of your system from Gentoo Handbook) to run grub and install it into the MBR of both new drives.  Grub names drives differently than the kernel.  It uses BIOS numbering as well.  Meaning grub sees the 2 new SATA drives (on the sil3132 controller) as the 5th and 6th drives.  They are named hd4 and hd5 respectively.  The kernel, on the other hand, sees them as sde and sdf because it enumerates the 4 drives plugged into the motherboard first.

First start grub
# grub 

Install to first drive's MBR (boot partition is first partition on 5th drive)
grub> root (hd4,0)
 Filesystem type is ext2fs, partition type 0xfd
grub> setup (hd4) 


Install to second drive's MBR (boot partition is first partition on 6th drive)

grub> root (hd5,0)
 Filesystem type is ext2fs, partition type 0xfd
grub> setup (hd5)


Now I reboot and check the BIOS settings to tell the computer to boot off of one of the 2 new SATA drives and let 'er go.

The arrays and swap are now running exclusively on the new SATA drives:
# cat /proc/mdstat
md1 : active raid1 sdf1[0] sde1[1]
      40064 blocks [2/2] [UU]
      bitmap: 0/5 pages [0KB], 4KB chunk

md3 : active raid1 sdf3[0] sde3[1]
      155244032 blocks [2/2] [UU]
      bitmap: 24/149 pages [96KB], 512KB chunk

# cat /proc/swaps
Filename                                Type            Size    Used    Priority
/dev/sde2                               partition       1004052 0       1
/dev/sdf2                               partition       1004052 0       1

Friday, November 13, 2009

How to keep Gentoo safely updated

When I first started using Gentoo, I broke the system a few times by updating it. I've noticed that I have a lot less issues keeping the system up to date these days. I'm sure part of it is the great job the Gentoo Devs are doing but I think the other part is following a procedure each time I update the system.

Update portage
I use app-portage/eix so I run eix-sync. This gives me a nice diff at the end of the changes. I can quickly see if any packages I have installed have updates. An alternative is the tried and true emerge --sync. Some sample output of eix-sync below:
  • [>] == x11-themes/mythtv-themes-extra (0.21_p17416 -> 0.21_p18657): A collection of themes for the MythTV project. (Has been updated, but I don't have it installed)
  • < (Has been removed from portage)
  • [N] >> dev-java/piccolo2d (~1.2.1!t): A Structured 2D Graphics Framework (A new package to portage)
  • [U] == net-libs/gnutls (2.8.3@09/22/09; 2.8.3 -> 2.8.4): A TLS 1.0 and SSL 3.0 implementation for the GNU project (Has been updated to 2.8.4. I have 2.8.3 which was installed on 9/22/09)
Update packages
I use emerge -avuND world. The options are as follows:
  • a means ask. It's like doing a (p)retend but instead of calculating dependencies twice, it's only once.
  • v is verbose output. I like to see the use flags listed.
  • u is update.
  • N is new-use. This means rebuild any package whose use-flags have changed since it was last installed.
  • D is deep. This looks further into the dependencies and will end up keeping more stuff up to date. I find this still doesn't catch *every* little package update, but it's good enough for me.
  • world is the world package set. This means potentially update any package on the system.
Update configuration files
I use dispatch-conf instead of etc-update as, over time, it saves me a lot of time. dispatch-conf is a lot more powerful and you can turn on a lot of options that aren't enabled by default in it's configuration file, /etc/dispatch-conf.conf. The easy ones to turn on, IMO, are the "automerge" options.

Check for broken packages
This one is easy. revdep-rebuild, which is a part of app-portage/gentoolkit.

Restart services referencing old versions of shared libraries

I can't take credit for coming up with this little snippet, but I use it every time. It's especially important after you've applied an update that was a security fix for a network service.
lsof | grep 'DEL.*lib' | cut -f 1 -d ' ' | sort -u

Example:
# lsof | grep 'DEL.*lib' | cut -f 1 -d ' ' | sort -u
console-k
hald
hald-addo
hald-runn
syslog-ng
# /etc/init.d/hald restart
* Caching service dependencies... [ ok ]
* Stopping Hardware Abstraction Layer daemon ... [ ok ]
* Starting Hardware Abstraction Layer daemon ... [ ok ]
# /etc/init.d/syslog-ng restart
* Stopping syslog-ng ... [ ok ]
* Starting syslog-ng ... [ ok ]
# /etc/init.d/consolekit restart
* Stopping ConsoleKit daemon ... [ ok ]
* Starting ConsoleKit daemon ... [ ok ]
# lsof | grep 'DEL.*lib' | cut -f 1 -d ' ' | sort -u
#
Essentially, it searches for open files that the filesystem has tagged as deleted, finds the process that has opened the deleted library and alphabetizes the list. You need to manually restart any service that shows up in that list. A little gotcha is that if sshd shows up in the list, you can restart it and you should stay connected to your session. If you re-run the command above, sshd would still be listed as your current session is still using the old process/version of sshd. If you log out and then log back in and run the command, all should be clear at that point.

Wrap up
Something else that I use that I'm not going to cover here is portage's elog feature. You configure this in /etc/make.conf and you can enable several forms of reporting errors, warnings, info, etc that ebuilds output to you. I have it set up to mail them to me, one per package. After an update, I look over the mails and check to make sure there isn't some important information in there about some action I have to take. In the make.conf manpage it says "Please see /usr/share/portage/config/make.conf.example for elog documentation."

So there you have it. This is the process I follow both at home and at work and things have been running smoothly for the past few years.

Friday, October 9, 2009

Upgrade to net-mail/courier-imap-4.5.0

After upgrading this package and running dispatch-conf I had to update /etc/courier-imap/imapd. While doing so, I merged in the following new block:
##NAME: IMAP_MAILBOX_SANITY_CHECK:0
#
# Sanity check -- make sure home directory and maildir's ownership matches
# the IMAP server's effective uid and gid

IMAP_MAILBOX_SANITY_CHECK=1
I was a little concerned, and sure enough after restarting courier and trying to check my mail, I couldn't get any messages. I checked the mail log and saw the following:
Oct  9 07:42:39 erma imapd-ssl: Connection, ip=[xxx.xxx.xxx.xxx]
Oct 9 07:42:40 erma imapd-ssl: xxxx: Account's mailbox directory is not owned by the correct uid or gid
Rather than just disable the feature (I figured a "sanity check" is a good thing). I searched around a bit and saw some discussion about people having issues when the group membership of the maildir wasn't the user's primary group. So I checked the permissions on my maildir:
drwx------ 29 dstutz root   486 2009-10-08 07:13 .
I tried chgrp -R users .maildir and tried to check my mail again:
Oct  9 07:53:33 erma imapd-ssl: Connection, ip=[xxx.xxx.xxx.xxx]
Oct 9 07:53:33 erma imapd-ssl: LOGIN, user=xxxx, ip=[xxx.xxx.xxx.xxx], port=[19177], protocol=IMAP
Yay! So I did a preemptive chgrp for all the other users on my system and hopefully all will be well going forward. I find it interesting that it even cares about the group membership since the maildir has 700 permissions.

Wednesday, September 17, 2008

VNC-ish CLI

This isn't really Gentoo-specific, but it's a nice trick that you could use to share an ssh session with someone. It works like VNC where both people have full control over the session at the same time.

You need to have GNU screen installed:
emerge -av app-misc/screen

Start a screen session using:
screen -S

Have the other person start screen using:
screen -x

Friday, March 14, 2008

XFS fragmentation

Check your fragmentation levels:
# xfs_db -c frag -r /dev/vg/lv1
actual 37387, ideal 35541, fragmentation factor 4.94%
# xfs_db -c frag -r /dev/vg/lv2
actual 688725, ideal 667471, fragmentation factor 3.09%
# xfs_db -c frag -r /dev/md3
actual 631947, ideal 624800, fragmentation factor 1.13%

On Gentoo, xfs_db is in sys-fs/xfsprogs which, if you have an XFS filesystem, you should already have installed.

If you want to run the defragger, the command is xfs_fsr and on Gentoo you need to install an additional package, sys-fs/xfsdump, to get it. You can read the manpage on xfs_fsr for more info, but the gist is if you don't otherwise supply command line params it will start going through all of your xfs mountpoints and stop after either 10 passes or 7200 seconds. It keeps track of where it was so you can just run it again and it will pick up where it left off if it didn't make it through all 10 passes.

Sunday, February 17, 2008

dmraid != kernel raid

I didn't mention it in the previous post about migrating to hardened gentoo, but initially when I went to re-add the drive back to the mirror I was getting some errors and it wouldn't let me. mdadm told me this:
mdadm: Cannot open /dev/hde1: Device or resource busy
I got similar output trying to add hde3 back to /dev/md3. Those partitions are only used for raid so it was really bothering me that it said they were in use. I googled around a bit and found a reference to the device mapper (which starts up on bootup for me because I use LVM) creating some devices based on a motherboard raid controller. You can get a listing of what the device mapper has created using dmsetup ls. I ran the command and sure enough there were 4 devices listed starting with nvidia_. I was able to remove 3 of the 4 using dmsetup -C but one was saying it was busy and in use and I still couldn't add the partitions to the raid. So I went back and edited grub.conf to remove dodmraid from the kernel line and restarted the system. After it came back up I was able to hot add the 2 partitions and get the mirrors back up and running in a non-degraded state. Also, dmsetup ls now only shows my LVM VGs. I went back and edited my genkernel.conf to tell it to stop adding dmraid support to my initrd in the future.