Skip to content

The Great DAB Drive Crash

Also known as "fw0 Drive Crash 2024-09-08", this page details the events of the drive crash on uryfw0 that occured on the 7th/8th of September 2024

Timeline of Events

Saturday, September 7th, The Day before the big DAB switch on show.

11:35 AM, David Hemingway reports the web browsers in the studios were unable to connect to the internet (Proxy Server Refusing Connections), this issue goes unresolved for some time as many members of the team were either: Not in the country, not in York, actively heading back to York, or working on the open day OBs. However, by that afternoon, Ren was able to narrow down the issue to a failure to connect to the proxy server on fw0 that the studio PCs traffic is proxied through. After some discussion in slack on the viability of restarting our firewall the day of a major broadcast (foreshadowing is a literary device), Ash restarts the squid proxy service on fw0, which had crashed earlier that day, and the Studio PCs connection is restored... For now...

Sunday, September 8th, The Day of the big DAB switch on show.

10:02 AM, Joe Brearley, having went into the station early to finish planning the DAB Switch On Show, reports that the browsers in the Studio are once again unable to connect to the internet. Marks remotes into fw0 and confirms squid has once again crashed, they attempt to restart it, causing it to immediately crash. Trying again to restart it in the foreground to read the logs, it exits immediately again with this error

marks@uryfw0:~$ sudo /usr/sbin/squid --foreground -sYC
WARNING: Cannot write log file: /var/log/squid/cache.log
/var/log/squid/cache.log: Read-only file system
         messages will be sent to 'stderr'.

Upon further investigation, Marks is able to determine that fw0's filesystem is not in a good place:

marks@uryfw0:~$ pwd
/home/marks
marks@uryfw0:~$ touch test
touch: cannot touch 'test': Read-only file system
marks@uryfw0:~$ sudo fsck /
sudo: unable to execute /sbin/fsck: Input/output error

By 10:25 AM, Kory had arrived on the scene, disabling the proxy on the Studio PCs to allow traffic to continue, and then confirmed that fw0 was infact not in a good place, with systemd journal rapidly spitting out disk write errors

Systemd Errors

A SMART Check on fw0's boot drive confirms that the drive is pretty much gone, and fw0 is essentually running off of what was cached in RAM at the time of the drive failure

$ sudo smartctl --info /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-30-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Short INQUIRY response, skip product id
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

By 10:43, the decision was made that a new firewall would be built ahead of the DAB Switch on show. First, Kory, having been unsuccessful in locating URY's ventoy USB, loads a Debian 11 image onto their own personal Ventoy USB and is prepared to begin rebuilding the OS by 11:27 AM. Then, Jamie remotes into fmstl to set our FM output to bedbox ahead of the impending outage, though he was initially impeded by a lack of log in credentials, he is eventually supplied them by Michael Grace. Jamie, in attempting to place our FM broadcast to bedbox, instead causes us to broadcast silence for a few minutes, but eventually manages to put it on bedbox.

Soon after this, at 12:00 PM, Kory takes fw0 offline. Kory and Ren then begin removing drives from the turned off fw0 until the find the caddy containing the boot drive. The two are confused however, when they remove all the hard drive caddies from the server, and still hadn't found a single disk. The two then remove fw0 from the rack and open it up, revealing a concealed SATA drive inside, containing the server's OS. Removing this drive and placing fw0 back in the rack, Kory begins to place one of the spare SAS drives into a hard drive caddy in order to insert it into an open hard drive bay, or atleast they would, however, they realise URY doesn't own a screwdriver small enough to unscrew the screws in the hard drive caddy. As such, Kory and Ren make a detour to borrow YSTV's IFixit Kit (YSTV's Treasurer was consulted and approved the use).

Once the SAS drive is mounted into the front of fw0, Kory powers it on and begins the installation process at around 12:45, with Marks remotely providing them with the necessary network configuration around 15 minutes later. After about 20 minutes, Kory and Ren come to realise that the installer has frozen up while attempting to install the Grub Bootloader to the drive, the two then decide to restart the installation. At some point in this process, Ren leaves to attend a HackSoc meeting they were late to. Once the installation freezes in the same place once again, the decision is made to SMART test the drive. Kory then has some difficulty finding a live-bootable ISO that comes preinstalled with smartctl. Around this time, Mia from YSTV arrives to offer the suggestion of running NixOS on fw0. Dissapointingly, Kory briefly considers it. The two then, luckily, discover that GParted Live contains smartctl, and, after running a test, discover that the drive passes. Now more confused, it is suggested that perhaps the reason the boot drive was originally an internal SATA drive was due to some problem with the SAS backplane, and perhaps an install onto another SATA drive connected directly to the motherboard might succeed. Without a 2.5 Inch SATA drive to hand, and time running out before DAB Turn On, Kory removes a personal SATA drive from their bag, wipes the (luckily completed) project from it, and inserts it into fw0. This time (after a failed attempt at building a new USB, and falling back to Kory's Ventoy), the installation succeeds and fw0 is back up and running.

Next, at around 15:52, Ash and Ren arrive (Having returned from the afformentioned HackSoc meeting) to witness the successful installation, and the 4 are then able to transfer the necessary config files and scripts from Marks' Google Drive backup onto newfw0, and, by 16:17, bring the Network back online, ending the outage just 40 minutes before the DAB Switch On Show was set to go on air.

Many things had came back online now, but by 6pm, MyRadio was still taking incredibly long to load. By 6:20, Marks diagnoses the issue as Docker DNS relying on fw0, which hadn't been configured correctly to handle this post-reinstall. Marks then installs Unbound, and MyRadio begins to work as normal once again. At this point, the Incident is delared ended.

Monday, September 9th, Final Critical Recovery Tasks

Just after midnight, Ash returns to set up the DHCP server, and then fix a misconfiguration in it in order to disable IPv6. Then, at around 1pm, Ash installs and configures Squid, before realising that IRN has removed IP Based Authentication, and, as such, the Squid proxy is no longer needed. Ash then configures the firewall script to run on boot, completing all remaining critical recorvery tasks

===INCIDENT END===

Highlighted Future Action Points

  1. Investigate why monitoring didn't alert us earlier to a drive crash
  2. Ensure we have up to date backups of our config files
    • Stratford and Moyles store these
  3. Ensure break-glass accounts exits on all servers (with passwords stored on vault)
  4. Ensure all interested parties (e.g. Computing Officers/Computing Committee) have credentials for all necessary servers/machines
  5. Make sure the Ventoy (Operating System Install USB) is kept in a known location
  6. Ensure we have a good, useable set of screwdrivers/screwdriver bits
    • An IFixIt Kit has been passed on this years grant and will be bought when grant is received.