Lessons learnt after a server crash

My linux server went down last weekend and would no longer boot. It had some sort of hardware issue with the disk controller card, and when it rebooted it would no longer see the card. Strangely, after leaving it off for half hour and trying again it came back to life. However, when I powered it back up it the /etc/fstab file was missing and gave me plenty of problems during start up, about not being able to execute init, couldn’t find the kernel, couldn’t find initrd etc.

So it should have been easy enough to recreate the fstab file and get going again, but it seems something had changed in the hardware and the main harddrive was now hde instead of hda. No idea why. It is hanging off a separate controller card and not the onboard IDE, so would hda and hdb be on the first channel on the onboard, hdc and hdd on the second, them hde be on the first channel on the controller card? I’m not sure.

Anyway, long story short, e2fsck spent about an hour fixing errors on the drive. At some point the ext3 file system got converted to ext2, and who knows what else.

Good job I had weekly backups of my MySQL databases, website files, and deployed webapps on JBoss. Wouldn’t take too long to reinstall the server and start again right? Wrong. Over time I had configured and tweaked so many things to get everything working just right, that I think to get back to this up and running state has taken about 15 hours. Normally I take good notes on what I configure, but there were a few things I had no idea how I originally got working (for example, how I configured Kompete CMS using to use DOMXML and DOMXSL on PHP5.0)

Ok, lessons learnt:

  • data files backups are great, but are at very least a minimum
  • backup the whole system, not just data files

I used Ghost to image the whole machine prior to reinstalling, so I have an image of the whole machine in its somewhat corrupted state prior to reinstall, and once everything is configured just as I want it I’ll reimage again for my main backup.

I guess this was a hard lesson to learn.

Also, I narrowly escaped a second disaster when I was working on configuring services on the machine as the power went out and the machine went down! Luckily it came back up fine with minimal errors on a e2fsck.

Next steps:

  • Reimage the whole machine in it’s ‘up and running’ state’.
  • Buy a UPS
  • Buy an identical second hard drive and set up RAID 1 mirroring

I guess on the whole this machine has done pretty well – up until this point it was up for 152 days without a reboot, and had been working fine with no problems whatsoever (try that on a Windows server).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.