flaky SIMM likely the direct cause of a file corruption case RRS feed

  • General discussion

  • I observed a file getting corrupted after doing a mass date change operation using PhotoShop Elements 6 on a mass of files on a WHS share under PP1 RC4.


    I opened a Connect case and got prompt attention. Long story made short: while trying to make a re-creation case, I ran the server (and my client for that matter) at really high stress levels. (A dozen or so Explorer file copy threads simultaneously copying files from the workstation and from several different places in the same WHS share to a test folder in the share.) This yielded--all by itself, apparently--at least 27 corrupted files out of 11,334 (25.3 GB).


    Microsoft took a good/bad file pair and looked at the binary differences that defined "corruption"--where, which bits, how many bits. A very definite pattern appeared in this data. They suggested revisiting the machine with memtest86. (You can google it...)


    Sure enough, memtest86 revealed that one of the SIMMs would repeatably fail several of the tests. So that is **probably** the direct cause of this case--using problematic memory that normally isn't even in use owing to high memory stress condition.

    The machine was tested when I built it and had been running more or less fine (appearing anyway--maybe owing to general low use of memory/low stress conditions of use?) for months. This gives a pretty good demo of the weakness of BIOS memory tests, the value of ECC memory--would always use it if more mobos and chipsets supported it--and a reminder of how randomly RAM problems can present themselves.


    Microsoft really turned to on this one and I thank them. Now if only they'd fix "Invalid File Handle" associated with shadow copies. Oh, and can we back the whole thing up someday? Even the system?

    Friday, June 27, 2008 6:27 PM