The second of mjollnir's disks, /dev/sdb, just started hitting 90% utilization with basically no usage, like this:
Notice how we had 0.19 MB read per second and 0.56 written in this ten-second sample, but it was still 93.20% utilized. The site started slowing down a lot and I was getting occasional 503s and load spikes. It looks like pretty much all disk activity halted. I removed sdb from the array so the site would work again, and starting running a SMART test. No errors seem to be logged anywhere. This disk has seen several thousand hours of usage with no problems, so this failure comes out of the blue.
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.30 102.20 2.30 73.00 0.09 0.68 20.93 0.30 3.97 2.03 15.30
sdb 1.10 89.30 2.10 55.70 0.19 0.56 26.62 2.41 40.88 16.12 93.20
sdc 0.70 88.70 4.10 33.30 0.14 0.48 33.88 0.16 4.20 2.22 8.30
sdd 0.20 101.10 6.10 50.10 0.09 0.59 24.83 0.23 4.13 1.83 10.30
If the SMART test comes out okay, I'll try re-adding the disk. If we get a SMART test failure, or the disk acts up again when re-added, we're going to have to replace it too. I really hate these things.
For now, mjollnir is running degraded. sdc will handle sdb's read load now; if sdc fails, mjollnir will die, but I hardly find that likely. Of course, we then still have thor to switch to, so I'm not worried at all.
I'm going to e-mail Garb and GED about what our budget is like. If we can afford to get another disk or two for backup in case of something like this, we should really consider it.