Monday, August 01, 2011

Your Backup is Not A Backup if You Can't Restore It

Or at least, it may as well not be.

In the process of tweaking an Ubuntu system, I decided to modify some partitions, secure in a couple of facts:

1. I had only a small set of recent local changes that comprised important data I wasn't willing to live without.

2. I had a complete backup created with a commercial tool, Paragon's Hard Disk Manager Suite 2011.

Well, apparently although I've been doing this sort of thing for a number of years, I apparently forgot some of my own rules about backups: first, one backup is not enough. And second, if you haven't tested the restoration process, your backup very well could be completely useless in a pinch. I ought to add a third rule: don't break a working configuration just to tweak it -- but I know myself well enough to know that I'm unlikely to live by that rule. It's often how I learn. So I'll propose a limited version of that rule: don't break a working configuration just to tweak it without carefully considering expense, time, and effort required to reconstruct it. This was my own system and I estimated that the time and effort would be minimal. Of course, I was hopelessly optimistic about that. But on the other hand, if I wasn't generally optimistic about this sort of thing I'd grow to hate this sort of thing and once that happens, work becomes misery.

So, I generally have been very satisified with Paragon's partition manager, and when I had the chance to upgrade to their whole Hard Disk Manager Suite 2011 for $30, it sounded like a pretty good deal. I did this and then spent some time making partition backups. That all seemed to go well, although it can be quite slow. It took about seven hours to write 70 GiB or so to an uncompressed backup.

The problem came when I wanted to use one of these backups.

The backup in question lived on a Seagate external USB hard drive. It was in Paragon's proprietary archive format, which is in the form of a directory, arc_270711011814809, with a series of files inside with the same name and different extensions: .PBF, .pfm, .001, .002, etc. The idea (I think) is that no physical file is larger than 4 GiB. My entire backup set here is about 70 GiB. It represents a set of sandboxes of code trees checked out from a Subversion repository, with a few uncommitted local changes.

The restore process gives you a GUI that lets you find one of these backups and do something with it. Unfortunately Paragon does not seem to be very good at responsive GUIs. To wit, it's the type of "wizard" GUI that tries to drive you through a basic process, steering you through each step and then allowing you to move foward with a familiar "Next" button. But sometimes that "Next" button is dim, and nothing else in the GUI will respond, and there is no busy cursor or animation or "please wait" or what-have-you at all, for several minutes; the only indication I had that the processes behind the GUI are not actually dead or in an endless loop was that my external hard drive light was flickering, and I could place my hand on the case and feel the heads moving.

It wouldn't bug me much if this was the case for five or ten seconds. But when it takes ten minutes, that's pretty bad user interface implementation. But let's set aside that for now; eventually the GUI let me choose the .PBF file for my backup set and proceed.

The first thing I wanted to do is tell it where to put the restored data. I had created a new set of partitions on the original drive and there was a partition all set up and and waiting. But apparently I had only two options: restore the backed-up partition contents (the file system) to its original partition, as recorded by the backup process originally, or restore it directory-and-file-wise.

That's really a head-scratcher. If I'm resorting to a backup, there's a very good chance that I've lost a hard drive. In that case, the original partition doesn't exist any more. I may have recreated the partition table of the original drive to the letter, using a printout of the partition table or something, but I think it's quite likely that I might have made some changes, and all I really want is to get those files back at the same mount point, so I want to restore the file system to whatever partition I specify, as long as it has enough room for the file system. I'm baffled that I can't do that. So I was unable to test that particular feature.

The next-best-thing is, I suppose, to look inside the backup and restore chunks of it. You have a hierarchical check-box interface that (slowly) churns through the backup file system tree and allows you to select what you'd like to restore.

The problem is that it doesn't work. Or, at least, I was not able to get it to work. Not with either of two separate backup images; not from two separate backup drives; not to a second external drive; not to the same external drive; not to a partition formatted with the same file system; not to a partition formatted with a different file system.

Let me amend that; I eventually was able to get two restore operations to work, when the restore operations were of a very small subset of my actual backup, consisting of only a few files, or a few hundred files, a few tends of mibibytes. These were (I think) where my critical uncommitted change set lived. I hope there wasn't anything else that was important.

The first thing I tried to do was just restore about 70 GiB. I started a restore in the morning. The visual progress indicator made it up to about 5% of the way across its bar by about four hours later. The estimate for the remainder bounced around wildly, between 30 seconds and 25 hours. As a result, I had no useful estimate at all how long the restore would take -- but the visual progress bar was not at all encouraging. On another attempt to restore a relatively small subset of the data, the display showed no visual progress bar at all but a spinning circle, with reassuring text that kept changing, with a generally apologetic tone but reassuring me that the operation would take only a few more seconds. Three hours later I had to kill it.

My computer is a Xeon with a Seagate server-class internal hard drive. It's a year old and it's not slow. I use it to do large software builds.

I killed this restore, and did an experiment -- it took well under an hour to copy 70 GiB from the external hard drive to the internal hard drive using cp on the command line. Neither file system was corrupt. The USB connection worked normally.

I had a four-day weekend coming up, so I tried again. After three full days of checking on the restore operation periodically, the visual progress bar was still far short of the halfway mark. When I checked on it on day 4, the Windows system it runs on top of was crashed with a black screen of death reporting a non-specific I/O error; the options to retry didn't do anything.

Now, I wasn't watching, so I'm not sure what happened when it actually crashed. But I do know that the longer a process takes, the more likely it seems that something in the real world will interfere with it -- for example, it is summer in Saginaw and we get occasional severe thunderstorms. When that happens I want to shut down my computers and turn off their various power strips, which range from cheap ones to rack-mount Furman strips with voltage monitors. If a restore operation is going to take 72 hours or more to complete I can't do that. It also makes a mockery of the idea of having a spare drive on hand so I can bring the server back up quickly.

My work often has real deadlines with real paying clients. My time is, in fact, money under those circumstances -- or at least if enough of it is lost, real money is at risk of being lost too. All I can say is that I got the message that this backup solution is not reliable in a time when I wasn't cranking on an urgent deadline and the stakes were not high.

I've tried various permutations: copying the backup files to a partition on the same drive, and attempting the restore again; the result was the same. I had two backup images to work with; my 70 GiB backup and a much smaller one of about 5 GiB. I had similar results with both of them, although as I mentioned by selecting a very small subset of the small backup, I was able to complete extraction of a single directory containing a few files.

I don't know if the backup is corrupt in some way; I never saw any kind of message indicating that it was, and the original backup processes seemed to complete without any problem. But right now,while I still like Paragon's partition manager, I very strongly advise you against trying to use their backup solution, and I'll be extremely hesitant to experiment again with my Windows system.

I'm going to make a concerted trial of some other backup solutions. Partimage seems to be out of the question now, as it does not support ext4, which is the default for recent versions of Ubuntu. I'll be testing partclone. And quite likely I'll be working something up with good old rsync as well. But right now, I've unfortunately got several days to spend babysitting checkouts from a subversion repository and manual merging of the few files I did manage to salvage from this slow-motion disaster.


Dave Leigh said...

Add this to your rules... never ever ever under any circumstances use any backup software that utilizes a proprietary format. They are unnecessary and useless when the backup software disappears along with the trashed partition. What's worse, I've seen situations where a new version of backup software won't read data written with an older version. This is just stupid. avoid avoid avoid. Stick with tarballs or zip.

Paul R. Potts said...

Thanks, Dave. Right now I have gotten my system rebuilt and am experimenting with partclone and clonezilla and the PartedMagic live CD to clone partitions. One thing's for sure, it doesn't take any 7 hours to clone 60-something GiB. It's more like 18 minutes. I'll post a followup if I get one successfully restored.