Thursday, August 04, 2011

Ext4 Corruption and Alternative Partition Backup Solutions

After my utter failure restoring partitions with Paragon's toolset, I've been looking into alternatives. Unfortunately, the damage I apparently did to my Ubuntu ext4 file system with the Paragon tools was deeper and longer-lasting than I expected.

Apparently during the failed restore, it wrote a number of files and directories that are deeply corrupted, and now I can't delete them. Booting from a live CD and running a disk checkrepair reveals no errors. The drive's SMART status is just fine. Writing and reading large amounts of data elsewhere in the file system has worked just fine.

Some of the restored files were generated in a hierarchy that starts HardDisk0/Volume1. Trying to remove that directory (with sudo) produces the following:
rm: cannot remove `HardDisk0/Volume1/home/potts/.gksu.lock': Input/output error
rm: cannot remove `HardDisk0/Volume1/home/potts/.sudo_as_admin_successful': Input/output error
rm: cannot remove `HardDisk0/Volume1/etc/apt/secring.gpg': Input/output error
rm: cannot remove `HardDisk0/Volume1/etc/.pwd.lock': Input/output error
(and a few more similar errors). When I try to examine the file stats, I get something like this:
potts@potts-xeon-1:/sandboxes/HardDisk0/Volume1/home/potts$ ls -la
ls: cannot access .gksu.lock: Input/output error
ls: cannot access .sudo_as_admin_successful: Input/output error
total 8
drwxr-xr-x 2 root root 4096 2011-08-03 19:05 .
drwxr-xr-x 3 root root 4096 2010-09-01 16:44 ..
-????????? ? ? ? ? ? .gksu.lock
-????????? ? ? ? ? ? .sudo_as_admin_successful
(When ls can't even tell you anything about a file, that's generally considered a bad sign). It looks like Paragon's tools really screwed the pooch, but I can't put the blame entirely on them, as it shouldn't even be possible to do this to an ext4 file system.

It appears that a number of hidden files or files with special permissions were turned into corrupt inodes or some such; I'm not really an expert on Linux file systems. The troubling part is that e2fsck finds no issues to fix, even when run from a live CD.

This suggests that perhaps I am putting more faith in ext4 than is warranted at present. A robust filesystem ought to be able to recover from anything up to and including bad sectors that cause data loss, isolating that data loss so that it is as minimal as possible. It looks like I may need to wipe this partition yet again if I'm to trust it. Should I drop back to ext3? If ext4 has known problems like this, and I see from some Googling that it does, why is it the default file system for Ubuntu 10.04 LTS?

Anyway, on to other backup tools. I'm still looking for some combination of tools that will allow me to reliably back up the file systems on whole partitions and reliably shuffle and restore them. This does not seem like it is too much to ask for.

The following started out as a comment on the previous blog entry but I'm promoting it to a post here.

I wanted to look into some tools that would support ext 4. Partclone looked like it would do the right thing, but the docs were a little too short on examples for me to understand easily. Clonezilla seems to be a curses-based interface to drive these tools, so I decided to try that.

Clonezilla from the PartedMagic 6.5 ISO seems to work to do the backup of a partition, and it is really fast (under 20 minutes as opposed to seven hours with Paragon), albeit awkward (it seems like it keeps trying to mount my backup USB drive, after which I can't unmount it and the program won't allow me to use it as a destination. I"m sure there must be a way, but I haven't figured it out yet).

However, I just ran an experiment to try to restore a partition and the results were ugly. If you want to restore to a partition with a different number, for example sda2 instead of sda5, you can't do it directly. It fails without an error per se, but does point you at the FAQ. There is a workaround where you can change the partition number as it is encoded in multiple filenames inside the actual backup, which makes me want to scream. There's a workaround involving creating multiple symbolic links, but when I read it, my monocle fell out in horror and I can't bring myself to describe how stupid and ugly it is.

But there is a bigger problem: you can't restore to a smaller partition. So I backed up a 450-GiB partition, and only 60 GiB were used by the file system. The compressed image was about 18 GiB. I wanted to restore this to a 125 GiB partition, which ought to have plenty of room to hold the contents of the file system I'm copying, but apparently that's not allowed. In this case I want to do this as a test, but it seems like migrating to a smaller hard drive is a pretty ordinary real-world scenario. For example, wouldn't it be nice if I could use a partition image to take a file system from a hard drive to an SSD?

But the partclone format seems to store only used blocks, and it seems to be unable to rearrange them into an unfragmented file system upon restore, so it insists on having the same 450-GiB partition (or larger) on the destination drive.

And finally, apparently you can't dig into a backup image to view the hierarchy or pull out one file or directory. This is something Paragon's tools give you (although that was pretty much the only part of performing a restore that I could get working). I could perhaps live with that although it does make it very inconvenient and time-consuming to rescue a single file, something I could easily do with Retrospect on the Mac almost 20 years ago. Meanwhile, we have sparse image support and a better disk utility that comes standard with Mac OS X, one which makes all this seem pretty horrifically primitive.

Maybe I'll have to stick to grsync, but I was hoping to use this tool not just on this server, but on my Windows laptop which is multiple-boot, with Windows 7 and two versions of Ubuntu, and which I would like to rearrange to recover some disk space (hence the desire to restore to a partition that isn't the same number I backed up from). Why is this so hard?

1 comment:

Paul R. Potts said...

Gparted has been an extremely reliable and steadfast friend through this. It scared the crap out of me tonight, though -- I left it completing a lengthy resize and move operation on multiple partitions, a series of five steps. It apparently quit after step 3. No error message, no nothing, just closed up the GUI. On the plus side, it appeared to complete the 3rd step successfully and didn't appear to actually corrupt anything, as it might of if it had failed halfway through a partition move or resize.