A somewhat nightmarish Debian upgrade…

You might remember that I while ago I wrote a post about upgrading my office test-server from Debian Sarge to the almost-ready Etch, and that everything went as smoothly as I’m used to Debian distro-upgrades.

Well, on 2007-04-10 I upgraded my personal production server (that also hosts this blog) to the finally-released Debian 4.0, code-named “Etch.” I sticked pretty closely to the release notes, which always is a good idea even if you’re an experienced user.

At first, all seemed to go well. I performed the pre-upgrade step to pull in the new libc6, and afterwards performed the dist-upgrade that pulled in the remaining packages to be upgraded or to be newly installed in order to satisfy dependencies.

BTW, I used aptitude for the task, which now is the recommended package manager for the text console. dselect has been deprecated. I used it until now, and could operate it blindly, but after using aptitude for a couple of times I’m impressed how easy and logical to use it is, and I now really love it.

aptitude had to upgrade around 580 packages, which went pretty fast since I have a 6 Mbps DSL line. Lots of configuration work could be performed by answering questions posed to you by dialog, the tool that provides the text console with dialog boxes with entry fields, radion buttons, and most of the GUI elements you know from X or Windows.

When this main phase of the upgrade was finished, I was about to leave aptitude when a nasty error message popped up: Something about /var/lib/blah not being able to be read. I wasn’t sure what this message was trying to tell me, mostly because I was a first-time user of aptitude, so I simply confirmed the message and was dropped to the shell. I then wanted to check the directory in question, and to my horror the Tab command-line directory completion feature didn’t work, but just produced an error message Input/output error.

Those of you who are experienced Unix users probably know by now that I was into trouble. 🙁

“Somehow” my /var XFS filesystem was damaged, and so the kernel had shut it down. I tried to unmount it to have it repaired upon remount, but they kernel wouldn’t let me do that because the filesystem was still in use by dozens of processes, mostly services running on my box. Properly shutting them down was impossible — no wonder since the system basically is unusable without a /var filesystem. So I had to kill the processes, and after that I was able to unmount /var.

I tried to remount it, but couldn’t, obviously because the damage was more serious than I had hoped. So I tried xfs_repair on it — which also failed. 🙁

My last resort was to zap the filesystem log (using xfs_repair‘s -L option,) which was a very dangerous thing to do because this could have totally destroyed the filesystem beyond repair. Fortunately, I was able to remount the filesystem afterwards.

First thing I did after I had repaired /var was to reboot the machine. I had not upgraded the kernel so far, so I was still running 2.6.8 which was Sarge’s default kernel. The main reason for not upgrading the kernel at that time was that I didn’t want to change the system too much in a single step, because in case the system didn’t come up properly after reboot I could be sure it was a problem in one of the new packages, and not a kernel problem.

Luckily enough, the kernel did come up properly. I was still hesitating to upgrade to Etch’s new default kernel 2.6.18, because a) I wasn’t forced by some dependency, and b) I was scared by the new initial RAM disk procedure now in use with Etch. In Sarge Debian used initrd-tools, now in Etch the tool to be used is initramfs-tools.

So I kept 2.6.8 running…

Until I did some package maintaining using aptitude again, some days later… When I had the same problem described above again — my /var filesystem was damaged again. 🙁

I employed the procedure described above to repair it, and I was lucky enough to be able to save it yet another time. I still have no explanation why the old kernel could potentially have destroyed the filesystem, because I was running it for about half a year without any problems at all. The filesystem code is all within the kernel, so I don’t think it could be a compatibility problem with a low-level system library, such as libc6.

Anyway, this repeated disastrous experience made me change my mind, and I performed the upgrade to 2.6.18. To my very big surprise, the box came up again fine after rebooting the machine. I had expected to have to boot into the rescue system to revive the machine, because I have my root filesystem on software RAID-1 (md device), which always is a delicate thing. But the Debian guys really did an excellent job — all went smoothly.

So the mystery still remains why my /var filesystem was corrupted two times — always after using aptitude. I know that aptitude as a user-space program can’t have destroyed the filesystem, but maybe it tried to write to parts of the filesystem that were “semi-broken,” and that triggered a kernel bug that caused the filesystem driver to crash.

I would be very interested to hear your comments on my experience. Did your upgrade work smoothly? Did you have any unusual problems? Do you have any idea as to the filesystem problems I was facing?

4 thoughts on “A somewhat nightmarish Debian upgrade…”

  1. I just upgraded (*) my desktop last night, starting around 9 and finishing around 11. I used the netinstall ISO so most of the stuff had to be downloaded and I think this took about an hour, but I was on the phone so I didn’t make the best use of the 2 hours either.

    * I say “upgrade” because I actually did a fresh install. Fortunately I had the OS on a separate partition from /home, so while I did a back-up, it was just precautionary. I didn’t need to do a restore and I avoided overwriting /home during the installation process by choosing manual partitioning (during which I simply set the correct partitions for swap and boot).

    My theory is that some decisions get made during a fresh install that can’t be easily accomplished during an “apt-get dist-upgrade”. I could be wrong about this, but that has been my observation.

    The only glitch in my install was that my USB keyboard doesn’t seem to be recognized during the initial boot into the installation program. I had an old style keyboard that saved the day, but there was probably another workaround I could have used.

    Other than that, I had absolutely NO configuration issues to resolve. Everything was detected. I should add that my USB keyboard isn’t even detected by the BIOS configuration utility, so I’m not even sure this is a Linux problem.

    But I was actually surprised that plug-in cameras and other USB devices seem to work fine. I needed a Samsung driver for my printer which was a snap to install. I had to install Java by hand of course. I’ll still have to get Realplayer, Acrobat and a few other non-free things as I run into the need for them, but this has been by far the easiest Linux install I’ve ever done. Part of that may be that I’ve done quite a few of them by now too.

  2. Thanks for your comments.

    As I said I don’t have a logical explanation for my filesystem corruption. I don’t think it has anything to do with the upgrade procedure as such.

    I agree that a “clean” Debian setup differs substantially from an upgrade, but to the best of the knowledge you should be able to perform the same configuration settings both ways. Remember that you can always reconfigure a package using dpkg-reconfigure pkg-name.

    Here’s a hint for you re. your USB keyboard: In most BIOS’s you have a setting like Enable DOS compatibility or Enable legacy support or something like that. This actually enable a BIOS extension that provides USB keyboard and mouse support to 16-bit programs like DOS applications. It may very well be the case that the Debian setup uses this to detect whether it needs to support USB keyboards. If I were you I would at least try that option.

    And another hint re. installing Java: I always install Java using java-package. This converts a Java package you downloaded from Sun to an installable Debian archive that properly registers within the package manager. Very nifty!

  3. Thank you in return for that info. Someone else mentioned that BIOS setting. I’ll give it a try (or maybe just wait 2 years and hope I remember it then).

    I was totally unaware of Java-package. I hadn’t turned “contrib” on at first and I think I searched for such a thing before those packages had been added to my list.

    Thanks again.

  4. @macbeach: You’re welcome.

    In case you should try out the BIOS settings I recommended above I would appreciate some feedback whether it helped or not. Good luck!

Leave a Reply to macbeach Cancel reply

Your email address will not be published. Required fields are marked *