JAN 18
2009

A tale of data loss, narrowly averted

Last week, my file server started behaving very erratically. I checked the logs, and found lots of seek errors for one of the hard drives. Uh oh!

I fired up smartmontools to check the drive's SMART stats. Sure enough, the drive was reporting a lot of nasty errors. To prevent further errors, I immediately unmounted the filesystem. I intended to remount it read-only, but it refused!

I ran xfs_repair (this is an XFS file system) and partially through the test, heard that most dreaded of sounds for a hard drive owner: clicking.

Some Googling led me to consider trying the freezer trick. People say this works if your head is stuck. Going down this path seems like a one-way street though. So I decided to run the badblocks command first. This reported only about 50 bad blocks total on the entire HD.

Unlike some other file systems, XFS does not have the ability to repair bad blocks. At first I thought this was a deficiency, but now I think that's the best approach. By isolating the defective drive and doing as few destructive write operations as possible, you will stand the highest chance of recovering data from the drive.

I decided to simply buy another drive of the exact same size and make an exact image of the defective drive onto the new one. dd_rescue is a great tool for doing this. After about 6 hours, I had an exact replica of the old drive, sans the defective blocks. (The data onthe defective blocks is lost forever, of course.)

I pulled the old drive, inserted the new drive in its place, rebooted, and everything is great! I used xfs_repair to repair the missing data.

(Lest you worry, all my truly important data is backed up in at least two separate places, so I was never in danger of losing anything precious. But data loss of any kind is always a pain, if only because of the time wasted trying to figure out what you've lost.)

In summary:

  • Don't panic!
  • Immediately stop writing to the damaged drive. If possible, put it in hardware read-only mode (i.e., smartctl).
  • Run badblocks on the unmounted drive to see how bad the damage is.
  • If the drive works, but just has a lot of bad blocks, then get another drive of the same size (or larger) and use dd_rescue to create a copy.
  • Re-mount the copied drive in your system and run whatever filesystem repair tools you have to.
  • If the damaged drive has no life at all, then try the freezer trick... what else do you have to lose?!
  • Have a good backup strategy for all of your data!

permalink | comments | technorati
blog comments powered by Disqus