Startled by “component device mismatches” on RAID1 volumes

I was startled today by a message in syslog that seems to point to a problem with my RAID1 volumes:

MarĀ  1 01:13:54 server mdadm[961]: RebuildFinished event detected on md device /dev/md3, component deviceĀ  mismatches found: 512

This value is reflected in the following counter:
root:/etc/mdadm# cat /sys/block/md3/md/mismatch_cnt 512

I tried to clarify this by googling around, but I found no definitive answer whether this is an actual problem or not. However, I found a way to resync the MD components so that no mismatches remain:

root:/etc/mdadm# echo repair >> /sys/block/md3/md/sync_action

After you execute the repair you will notice that the counter shows the same number of mismatches again:

root:/etc/mdadm# cat /sys/block/md3/md/mismatch_cnt 512

This was to be expected — because the repair corrected (and thus encountered) this number of mismatches. So, if you force a check again, the counter should be down to 0:

root:/etc/mdadm# echo check >> /sys/block/md3/md/sync_action
root:/etc/mdadm# watch cat /proc/mdstat
[wait until check is finished]
root:/etc/mdadm# cat /sys/block/md3/md/mismatch_cnt
0

Tags:

2 Responses to “Startled by “component device mismatches” on RAID1 volumes”

  1. Administrator Says:

    Update: I found the following statement from Neil Brown which seems to be a good explanation why these differences may happen:

    Suppose I memory-map a file and often modify the mapped memory. The system will at some point decide to write that block of the file to the device. It will send a request to raid1, which will send one request each to two different devices. They will each DMA the data out of that memory to the controller at different times so they could quite possibly get different data (if I changed the mapped memory between those two DMA request). So the data on the two drives in a mirror can easily be different. If a ‘check’ happens at exactly this time it will notice.

    Normally that block will be written out again (as it is still ‘dirty’) and again and again if necessary as long as I keep writing to the memory. Once I stop writing to the memory (e.g. close the file,
    unmount the filesystem) a final write will be made with the same data going to both devices. During this time we will never read that block from the filesystem, so the filesystem will never be able to see any difference between the two devices in a raid1.

    So: if you are actively writing to a file while ‘check’ is running on a raid1, it could show up as a difference in mismatch_cnt. But you have to get the timing just right (or wrong).

  2. Christopher Says:

    Neil Brown also says, that it is a problem mostly found with Swap on Raid1. Link to the whole discussion is: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=518834

Leave a Reply