You might remember that I while ago I wrote a post about upgrading my office test-server from Debian Sarge to the almost-ready Etch, and that everything went as smoothly as I’m used to Debian distro-upgrades.
Well, on 2007-04-10 I upgraded my personal production server (that also hosts this blog) to the finally-released Debian 4.0, code-named “Etch.” I sticked pretty closely to the release notes, which always is a good idea even if you’re an experienced user.
At first, all seemed to go well. I performed the pre-upgrade step to pull in the new
libc6, and afterwards performed the dist-upgrade that pulled in the remaining packages to be upgraded or to be newly installed in order to satisfy dependencies.
BTW, I used
aptitude for the task, which now is the recommended package manager for the text console.
dselect has been deprecated. I used it until now, and could operate it blindly, but after using
aptitude for a couple of times I’m impressed how easy and logical to use it is, and I now really love it.
aptitude had to upgrade around 580 packages, which went pretty fast since I have a 6 Mbps DSL line. Lots of configuration work could be performed by answering questions posed to you by
dialog, the tool that provides the text console with dialog boxes with entry fields, radion buttons, and most of the GUI elements you know from X or Windows.
When this main phase of the upgrade was finished, I was about to leave
aptitude when a nasty error message popped up: Something about
/var/lib/blah not being able to be read. I wasn’t sure what this message was trying to tell me, mostly because I was a first-time user of aptitude, so I simply confirmed the message and was dropped to the shell. I then wanted to check the directory in question, and to my horror the
Tab command-line directory completion feature didn’t work, but just produced an error message
Those of you who are experienced Unix users probably know by now that I was into trouble. 🙁
/var XFS filesystem was damaged, and so the kernel had shut it down. I tried to unmount it to have it repaired upon remount, but they kernel wouldn’t let me do that because the filesystem was still in use by dozens of processes, mostly services running on my box. Properly shutting them down was impossible — no wonder since the system basically is unusable without a
/var filesystem. So I had to kill the processes, and after that I was able to unmount
I tried to remount it, but couldn’t, obviously because the damage was more serious than I had hoped. So I tried
xfs_repair on it — which also failed. 🙁
My last resort was to zap the filesystem log (using
-L option,) which was a very dangerous thing to do because this could have totally destroyed the filesystem beyond repair. Fortunately, I was able to remount the filesystem afterwards.
First thing I did after I had repaired
/var was to reboot the machine. I had not upgraded the kernel so far, so I was still running 2.6.8 which was Sarge’s default kernel. The main reason for not upgrading the kernel at that time was that I didn’t want to change the system too much in a single step, because in case the system didn’t come up properly after reboot I could be sure it was a problem in one of the new packages, and not a kernel problem.
Luckily enough, the kernel did come up properly. I was still hesitating to upgrade to Etch’s new default kernel 2.6.18, because a) I wasn’t forced by some dependency, and b) I was scared by the new initial RAM disk procedure now in use with Etch. In Sarge Debian used
initrd-tools, now in Etch the tool to be used is
So I kept 2.6.8 running…
Until I did some package maintaining using
aptitude again, some days later… When I had the same problem described above again — my
/var filesystem was damaged again. 🙁
I employed the procedure described above to repair it, and I was lucky enough to be able to save it yet another time. I still have no explanation why the old kernel could potentially have destroyed the filesystem, because I was running it for about half a year without any problems at all. The filesystem code is all within the kernel, so I don’t think it could be a compatibility problem with a low-level system library, such as libc6.
Anyway, this repeated disastrous experience made me change my mind, and I performed the upgrade to 2.6.18. To my very big surprise, the box came up again fine after rebooting the machine. I had expected to have to boot into the rescue system to revive the machine, because I have my root filesystem on software RAID-1 (
md device), which always is a delicate thing. But the Debian guys really did an excellent job — all went smoothly.
So the mystery still remains why my
/var filesystem was corrupted two times — always after using
aptitude. I know that
aptitude as a user-space program can’t have destroyed the filesystem, but maybe it tried to write to parts of the filesystem that were “semi-broken,” and that triggered a kernel bug that caused the filesystem driver to crash.
I would be very interested to hear your comments on my experience. Did your upgrade work smoothly? Did you have any unusual problems? Do you have any idea as to the filesystem problems I was facing?