The Week the Hard Drives Died
Dropbox, Google Drive, Amazon Glacier, Crashplan, Backblaze, oh my…
We've all seen all sorts of social media posts and online articles about the importance of backing up your data, promos for various cloud backup services and how to avoid data loss. Not to mention flamewars for and against such and such a service.
So I thought I’d share a very recent experience about how badly things can go wrong when they start going wrong. So, here’s my cautionary tale about my recent photography IT disaster and recovery.
Like all stories, it really started a long time ago
I’ve been using computers professionally since 1992. That was the era of the Apple Macintosh II, Adobe Photoshop 2.01 (before layers and multiple undos, but with CMYK) and also when your boss would tell you to set your colour monitor to black and white so your computer would be faster and you would be a more productive employee.
In all that time, I’ve been shooting photographs and collecting high resolution scans and then digital originals on my hard drives. Thousands and thousands of them. In all that time, I’ve never had a Macintosh hard drive die. Not. One. Ever.
My latest computer, a custom-built late 2013 27” iMac has been a Lightroom/Photoshop/web development workhorse since the minute I plugged it in. Maxed out with an eight core 3Ghz processor, 32Gb RAM and 4Gb video card, it was a moster in 2013 and still acts like one. It’s been running OS X, but also Ubuntu Linux and sometimes Windows 7 and XP in multiple virtual machines. The hard drive has been used and abused.
About a month or so ago, and in preparation for the winter semester when I’ll be teaching photography again, I decided to dump Wordpress and put up a new photo portfolio, this time on #Smugmug.
My ambitions for the new portfolio were modest: upload 500 images to start.
My images were for the most part ready to upload. Many had been prepped for submission to Corbis, who were my stock agency before they were gobbled up by Getty. But the theme I selected on Smugmug required consistent use of metadata for the titles and descriptions displayed under each image, as well as having well-organized keywords for efficient search. So, I fired up Adobe Lightroom and bulk-added the metadata wherever it was missing. Not just to the selected final tiffs, but also the parent raw and psd files. And all the sibling files from each shoot.
In other words, I tidied up and standardized my Adobe Lightroom library in a big way.
On any given day of that process, I would add or edit metadata on a total of several thousand files weighing in at a total of anywhere from 30Gb to 500Gb. That’s a lot of data writing and a lot of files to backup. RAW files are easy because Lightroom writes to the sidecar .XMP text files which weigh next to nothing. But Lightroom also directly embeds metadata into files of other file formats such as DNG, JPG, PSD or TIF. In other words, you can add a few bytes of metadata to a TIF but have to backup the entire 120Mb of picture file. (We’ll discuss incremental backups later…)
At first everything seemed fine
Fire up Lightroom, select a set of images, add titles, keywords and captions. It’s actually a process I enjoy. I’ve setup a productive workflow over the years, so time goes by quickly and I am often surprised by how images I have edited at the end of the day.
All was progressing normally when one day my cloud backup app Crashplan would never complete it’s backup routine. I checked the status info to find out what was happening. It was choking on one file. I tried opening that file but I couldn’t open up the offending image in Photoshop because of an input/output error. So I restored from my external USB hard drive backup. Crashplan was happy again.
At this point, I had no idea where that particular image’s problem came from. Was it a recently corrupted file? Was it an older file that suffered from bit rot? Was Lightroom failing to correctly write to disk 100% of the time? Was it a HFS+ filesystem error? I was suspicious, but had few clues to follow. But thankfully I had backups to fall back onto.
Backup, backup, backup
Speaking of backups, it’s worth mentioning that at that time I was using five external USB drives in addition to the computer’s 3Tb internal drive:
- 3 Tb documents: school, personal, software archives, music, movies, virtual machines.
- 4 Tb Backup of all original raw files from my photo outings.
- 4 Tb backup of my internal drive’s photography folder which is my Lightroom library folder with raw files, HDR or panoramic DNGs, PSDs and final TIFFs.
- 1 Tb bootable backup of my operating system and apps, plus the virtual machines and music.
- 2 Tb Back up of my documents and movies.
I use Econ Technologies’ ChronoSync app to run all these disk-to-disk backup routines with the safe copy option turned on, so that the copied data is verified as being error-free. I could easily cut down on the number of external disks by buying a newer larger ones (which I eventually did, read on…), but my next backup strategy will be a UNIX-based NAS file server using the ZFS filesystem. That build is coming in the next months. And yes, ChronoSync supports saving to a NAS.
What about Time Machine?
I’ve always been a fan of Apple’s Time Machine backup software. However, dealing with so many terabytes of photos (plus all my other crap) means that I run out of single disk hard drive space quickly. Time Machine is really designed for single disk use – although it does support backing up to multiple drives. But the interface to configure multiple drives is clunky and tedious as you have to tell it which folders not to back up. One by one.
Using ChronoSync is much straightforward. Create an action. Tell it which files go where. Done. Start over for a different set of files. Simple. Plus you don’t have to shell out the extra cash for the biggest drives available on the market all the time.
So back to the photography story. Editing files. Adding several keywords from a hierarchical controlled vocabulary keyword set with a single click. Saving to all my backup disks and to Crashplan. All seemed well until another corrupt file. And then another. I checked my disks with Apple’s builtin software. Disk Utility returned a few bad blocks. Fixed those.
More photography. Still more bad files.
And yet more restores from backup were needed. In a matter of hours on the last day the reliability of my system became substantially worse. I installed the command-line Smartmonitor SMART tools. Still all green lights.
Of course, I knew something was definitely wrong and going south fast. But was it hardware or software? I suspected hardware because the OS X Console caught the i/o errors. I had the symptoms but not the definitive cause. So I installed EtreCheck. On it’s third run, I finally had the smoking gun: disk failure. And a console screen full of disk2: i/o errors. I printed both and went to the Apple Store.
As per their protocol, Apple tested the iMac with their own diagnostic software which gave the big “Failed” conclusion you can see in the image at the top of this post. Apple replaced the 3 terabyte fusion drive within 48 hours.
Although I had excellent service from Apple during this experience, one issue left a sour taste in my mouth. When I asked for my old dead hard drive back, they wanted to charge me $100 to hand it back!
The staff and manager tried to convince me that Apple was taking care of the recycling of the drive and taking the burden of that responsibility away from me but I still don’t buy it. I paid for that hard disk when I bought the computer. It had my data on it. I paid for a new drive and for the service to install it. You can just hand back my dead hardware without any fuss thank you very much.
Brand new empty computer
When you have catastrophic drive failure, you are basically starting your filesystem from scratch just as if you had bought a brand new computer. As a colleague of mine once said quite wisely to one of our students: "You back up when things are going well on your system. Not when things have gone bad." You need a copy of your working system, at all times.
Don't trust that new drive!
Once back in my office and before loading my data back onto the machine, I booted off my cloned backup boot disk and certified both the old SSD and the new SATA disk drive by using SoftRAID’s certification feature. This test scans the disk sector by sector at a minimum of three times over to guarantee the disk has no problems. Disk manufacturers don’t perform these tests simply because they take too long. My 3Tb drive took about 18 hours to certify. Just because it’s new doesn’t mean it’s trustworthy.
I then re-fused the SSD and the SATA disk into a Fusion drive (SoftRAID’s certification process erased the software configuration that made the two drives appear as one) using the OS X terminal utility coreStorage and then installed the macOS.
Once booted into my new OS, I used Apple’s Migration tool to import data from the bootable clone backup disk. This included all application software and user accounts and everything inside those user accounts: desktop contents, documents, email - all of it. That took the better part of a day.
Then came the restoration of my photography files. I used Chronosync to safe copy everything from the external USB disk back to my photography folder. That took just over sixteen hours. I was now back to normal, with minimal discomfort.
It ain't over until there's more swearing
Back to normal. Or so I thought.
By default, many external USB/Firewire hard drives don't send SMART (Self-Monitoring, Analysis and Reporting Technology System) reports to the operating system. However, certain disk utility software packages such as SoftRAID and SMARTAlec can fetch more information than what is usually available to the operating system. And, to my horror, this is exactly what happened.
Once I got my system back up and running after the internal SATA drive died, I had a flurry of impending external disk failure warnings from my freshly installed utilities. Multiple drives from both Seagate and Western Digital, some old (almost a decade) some relatively young (a year or two), were all flagged as potentially about to kick the bucket.
One of them, a Western Digital 4 terabyte My Book model, with copies of all my original raw files (the default "second location" in Lightroom's Import dialog box) suddenly stopped mounting completely. On restart it would show for a while and then quietly disappear for no apparent reason. It was so bad that it signalled itself to the operating system and to Crashplan that it was now entirely empty. Had Crashplan's settings not been set to preserve deleted files indefinitely all of those images would have been gone. Even in the cloud the disk appeared empty. Only by enabling "View deleted files" could I regain access to these files.
So, in the end, I did replace my numerous external USB drives from last fall with an 8 terabyte external Seagate USB drive which I have affectionally named "SeaMonster". The first files to go on it were the backups of the original raw files. For some unknown technical reason, I was able to mount the Western Digital My Book long enough to copy off the 1.5Tb of data (about six hours) before it unmounted itself yet once again. I tested the files and there are good. Strangely enough.
My Plan B would have been to restore all the files from Crashplan as a last resort, as that would have taken a few days.
Once I had all my files moved over to the SeaMonster, I certified all my externals and threw away any devices that failed the test. That's three drives. And one more WD Passport is likely to follow soon enough.
Thank God for incremental backups
The good news through all of this is, even if I restored all my files to new devices with new labels ("disk names"), the Crashplan backup service is smart enough to only copy the new blocks of information from within the file.
So, for example, if I had modified an image by adding metadata to it before the drive failed and restored this updated copy from backup to a new hard drive, Crashplan would have compared the new file on the new hard disk with the version in the cloud and only uploaded the new blocks of data with the metadata in it.
Considering my initial Crashplan backup took over a month to complete, it great to know that the application won't reupload everything just because the location on disk or some minor metadata update has been made to the file.
In retrospect, I probably should have replaced my internal hard drive immediately after I saw the first i/o errors. However, as I mentioned before, this was the first time that I had a Macintosh hard disk die like this on me. So I was as-per-usual looking for software glitches, simply because of twenty-something years worth of computing experience. Since the time span in between an error a day to dozens per day was quite short, the initial symptoms were more ambiguous than the later stream of kernel messages.
Also, multiple hardware checks were coming back clean so my hunches were nothing more than just that: hunches. I wanted to walk into the Apple Store, hand them undeniable proof, get a new disk and get out asap. The EtreCheck report and screen capture of the kernel i/o errors was just that. But the failure was gradual, it faded in slowly before kicking into overdrive right before the end.
Backup, backup, backup (bis)
The biggest lesson learned in this experience, is that an aggressive backup strategy is essential. The minute I noticed my system was not stable, I started aggressively backing up all my files. When my hard drive finally did die, I had been backing up everything constantly for the few days both online and to disk. The only files I lost were a few hours before the drive actually completely failed, and I intentionally interrupted the backup to disk routine because I feared data corruption was happening on the hard drive. Cloud backups are always slower and out-of-date.
Dropbox me baby!
The only modification I made to my workflow since all this happened is that I’ve moved my Lightroom catalogue to my Dropbox folder so that it is continuously being backed up and there is no such thing as an out of date copy.
Technically, I left the Lightroom catalogue in its original location with my photography files, but created a UNIX symbolic link that acts as an alias within the Dropbox folder. You could do something similar with iCloud, but you have to do it in the opposite direction: move the catalog to the iCloud folder and make the symbolic link in your Pictures folder.
Remember the old adage: it's not if your hard drive will fail, but a question of when. If you are well prepared, there's not much to it. Replace a faulty piece of hardware and get back running. At worst, it can be a business-ending disaster.