Categories
Computers

Startled by “component device mismatches” on RAID1 volumes

I was startled today by a message in syslog that seems to point to a problem with my RAID1 volumes:

Mar  1 01:13:54 server mdadm[961]: RebuildFinished event detected on md device /dev/md3, component device  mismatches found: 512

This value is reflected in the following counter:
root:/etc/mdadm# cat /sys/block/md3/md/mismatch_cnt 512

I tried to clarify this by googling around, but I found no definitive answer whether this is an actual problem or not. However, I found a way to resync the MD components so that no mismatches remain:

root:/etc/mdadm# echo repair >> /sys/block/md3/md/sync_action

After you execute the repair you will notice that the counter shows the same number of mismatches again:

root:/etc/mdadm# cat /sys/block/md3/md/mismatch_cnt 512

This was to be expected — because the repair corrected (and thus encountered) this number of mismatches. So, if you force a check again, the counter should be down to 0:

root:/etc/mdadm# echo check >> /sys/block/md3/md/sync_action
root:/etc/mdadm# watch cat /proc/mdstat
[wait until check is finished]
root:/etc/mdadm# cat /sys/block/md3/md/mismatch_cnt
0

By Ralf Bergs

Geek, computer guy, licensed and certified electrical and computer engineer, husband, best daddy.

4 replies on “Startled by “component device mismatches” on RAID1 volumes”

Update: I found the following statement from Neil Brown which seems to be a good explanation why these differences may happen:

Suppose I memory-map a file and often modify the mapped memory. The system will at some point decide to write that block of the file to the device. It will send a request to raid1, which will send one request each to two different devices. They will each DMA the data out of that memory to the controller at different times so they could quite possibly get different data (if I changed the mapped memory between those two DMA request). So the data on the two drives in a mirror can easily be different. If a ‘check’ happens at exactly this time it will notice.

Normally that block will be written out again (as it is still ‘dirty’) and again and again if necessary as long as I keep writing to the memory. Once I stop writing to the memory (e.g. close the file,
unmount the filesystem) a final write will be made with the same data going to both devices. During this time we will never read that block from the filesystem, so the filesystem will never be able to see any difference between the two devices in a raid1.

So: if you are actively writing to a file while ‘check’ is running on a raid1, it could show up as a difference in mismatch_cnt. But you have to get the timing just right (or wrong).

It happened that I saw this error (again) a couple of days ago. Since it startled me again I googled for it — and came across my own above article as the first hit. I had simply forgotten that I had seen and investigated this error already a while ago… 🙂

In my case mismatches had exactly the value of 128 and appeared only for /boot partitions on raid1. I think this pretty normal, because event sequence was as follows:
1) OS installed on 2 HDDs with /boot and / partitions on RAID1 via mdadm
2) /dev/sdb at some point became faulty and was replaced
3) sfdisk -d /dev/sda | sfdisk /dev/sdb
4) mdadm /dev/md0 –add /dev/sdb1
5) echo -e “root (hd1,0)\nsetup (hd1)\nquit” | grub –batch

At stage 5) grub writes directly to HDD, omitting the raid1 level, so /dev/sda1 and /dev/sdb1 become out of sync and weekly cron job 99-raid-check complains about mismatches.

Leave a Reply

Your email address will not be published. Required fields are marked *