Disaster Recovery

It’s been quite a while since I’ve had to write a detailed post-mortem, and luckily this time the impact is very minimal - I’ve accidentally nuked the contents of the hard drive of my laptop, which I rarely use for any “serious” work. It’s made me reconsider disaster recovery plans, because mine didn’t quite stand the test.

The incident

I was flashing the embedded controller of my Thinkpad X230, to make it accept a new and fresh, but third-party battery. Doing this under OpenBSD was a bit of a challenge (mostly because the rest of the world tends to assume you’re running a GNU userland), but I found myself with an image file at the end.

The final step was dding it to a USB stick and booting from it:

doas dd if=patched.x230.img of=/dev/rsd2c bs=4M conv=fsync

On my system, sd0 was the main SSD, configured as a softraid(4) encrypted volume; hence the virtual sd1 was where the “real” operating system and data resided. The USB stick would show up as sd2.

I reboot to find that the stick doesn’t want to boot. Back to the OS, re-read the instructions carefully, remake the image, write it again, read the side notes, and notice:

Ensure your BIOS has been configured to boot from “Legacy” and not “UEFI” before trying to boot.

Ah of course, this must’ve been the culprit. I reboot to the BIOS setup utility, switch to legacy BIOS boot - the stick boots, flashing the EC works! I reboot, type in the passphrase, and find the OpenBSD bootloader unable to find the kernel. What?

I use another machine to write the OpenBSD installer to the USB stick to investigate. bioctl -cC -l sd0a softraid0 unlocks the drive, but I knew this would be the case already. I look at the disk label of sd1, and see there’s only a single DOS partition present. I mount the partition, and see… a bunch of BIOS update files?

I realize what happened. After the first reboot, with the USB stick still plugged in, it was enumerated by the kernel as sd1, while the decrypted root disk became sd2, because softraid kicked in at a later stage during the boot. I blindly ran dd from shell history, and it overwrote my precious bits. Ouch.

Impact

I have regular backups on Glados, aka my home NAS, which I turn on at least once in a week. I don’t keep the box powered on 24x7, because the spinning rust and fans make a bit of noise, it’s not a lot but it bothers me. My Mac Mini is on 24x7 (because it’s practically impossible to hear it, even at full load), so it always gets a chance to make a backup; but with the laptop spending a lot of its time asleep, the “scheduled” backups are pretty much purely opportunistic.

A reminder in my calendar pops up on the first Monday of each month, to tell me to verify the backups - there’s a very long checklist of things to look at (including work stuff), from which I usually pick 2-3 items at random (with a clear bias towards the work stuff). The first Monday of this month would be today (2022-05-02), and the incident happened the day before (because what better things to do on a Sunday than try to brick your EC). Well, it looks like I haven’t looked at the laptop backups for a very long time, because the freshest files I can find are from mid-2021.

It didn’t cross my mind to backup the laptop before messing with the BIOS/EC, because there was no actual data to lose - if I brick the laptop, I can always yoink the SSD and put it in another machine.

How much data did I lose? Well, I’ve been working lately on a cool side project - maybe it’s a topic for another post, but I’d like to build a new “hacker friendly” desktop environment, so there were a lot of quick notes, prototypes, and one-off scripts that I didn’t care to commit and/or push. The most important part were the learnings, so even though all the code so far is lost, I don’t mind a fresh start.

It also never occured to me to backup /etc, and as it turns out, there was a lot of annoying things like SSH & ZeroTier keys, suspend/resume scripts, my xenodm greeter script, etc. So I’ll probably waste a weekend configuring things.

How could I have reduced the impact?

  1. Improve the backup strategy and implementation.

    Having complete and up-to-date backups for this laptop was the somewhat unlikely “happy path”; I also relied on a hacked-together rsync script. Manually verifying backups is tedious enough that I was doing a half-hearted job. I excused myself, because setting up a Time Machine server for macOS already took so much effort, and I didn’t want another involved solution for just one machine.

    I should at the very least use a “real” backup tool like Borg, and use a 24x7 off-site server as the backup repository. I’m considering moving Glados to a friendly server rack at a neighboring DC.

  2. This incident seems very preventable.

    Linux systems typically maintain a set of dynamic symlinks in /dev/disk/by-*, making it possible to address disks by physical device path, UUID, label, etc.

    OpenBSD does assign unique IDs to disks, but you have to look them up separately, and they’re not mapped back to the device files in /dev. You can use these IDs in places like fstab(5), but not with tools like dd(1). (Sounds like a project idea.)

Silver lining

The hacked EC doesn’t dislike the new battery anymore. I think I can get another ten years out of this laptop.