How do I troubleshoot a computer that is failing to complete its backups? RRS feed

  • Question

  • I have a 64-bit Win7 machine that has had serious problems with PP3.
    In the past, this machine (which formerly ran 64-bit Vista with PP2) would occasionally (once every few weeks) reboot and/or freeze during its nightly backup cycle, but I could never find the cause of the problem (more on that below), so I just ignored it.
    Sadly, after PP3 came out, the problem is worse - on 3/5 nights now, my machine freezes and/or reboots during its backup cycle.

    All my machines (including the WHS itself) are running the latest bits - they're all set to keep up-to-date automatically and upon manual inspection, they all are. I'm not running any funky 3rd-party WHS plugins and my set is pretty basic.

    All of my OTHER machines (i have about ~8 in my house) back up just fine. It's just this one machine (which happens to be my main machine). It's a Gateway FX6000 series machine with about 2TB of drive space on it (of which more than half is free).

    Every night, after 2am, all my machines are scheduled to wake up and back up per the normal WHS policies. The backup always starts, and then usually fails. The Event Log shows nothing abnormal - things are starting, then the next thing you know, the machine has either locked up or rebooted. There is no blue screen. There is no memory dump. However, the backup doesn't succeed.

    Yesterday, I manually ran the backup process and when it got over 90% (somewhere around 92% I think), it froze again, so I guess it's not the automatic backup process, just the backup process in general.

    Because there is no blue screen or memory dump, Windows isn't helping me out much. There are no errors to report. There are no dump files to debug. There are no hints from the machine itself what is failing, or why. But, it's very frustrating to come down in the morning, and more often than not, my main machine has locked up. The Server itself simply indicates that one (or more) of the drives backed up, but one (or more) failed to, so I have a trail of broken backups on the WHS that I regularly go and clean out.

    There were no other major changes to my machine in between the old & new behavior (I didn't add any memory, change any hardware, etc.). I have uninstalled & re-installed the connector hardware. I run Microsoft's security essentials software and it reports my system is clean.

    I'm looking for any guidance about local logfiles I can look for, debugging flags I can enable, etc., that might give me some idea what is failing on my machine, so I can do something about it. Any guidance appreciated. Thanks!
    Monday, December 28, 2009 11:31 PM

All replies

  • Turguin:

    I would start by running chkdsk /r on each of the partitions on the problematic machine in order to rule out disk errors. Then I'd run Memtest86+ overnight in order to rule out memory errors. If you don't have Memtest then schedule and run the built-in Windows 7 memory test routine (Windows Memory Diagnostic). From your description I would be highly suspicious of a memory problem.
    Tuesday, December 29, 2009 1:36 AM
  • I've run chkdsk /f /r on all my hard drives to make sure they're clear, and the OS memory tests report no problems. (running chkdsk was one of the first things I did to make sure it wasn't a corruption problem with the hard drive).

    I've looked around for a temporary directory where WHS backup logs might be stored locally, but haven't had any luck in finding them.
    Tuesday, December 29, 2009 1:38 AM
  • Logs are in C:\Program Data\Microsoft\Windows Home Server\logs

    I'd still really recommend running Memtest86+ for an extended period of time, at least overnight. Some of the most frustrating problems that I've had with PCs were due to bad RAM and were only found after running many passes of Memtest. I had one with my server (before WHS when it was running Linux) that caused a hard lockup about once every 3 months. This one took a week of running Memtest before it showed its ugly little head.
    • Edited by Mark Wharton Tuesday, December 29, 2009 1:52 AM Add Memtest comment
    Tuesday, December 29, 2009 1:43 AM
  • Thank you for the pointer. I'm not sure that the logs tell us much. Here's the last thing the client was doing before it crashed:

    [1]091228.081222.3307: Status: Volume 4 of 4 is F: size 685628194816 used 55059623936
    [1]091228.081222.3537: Status: Start volume F: using \\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy13
    [1]091228.081223.8388: Status: Volume (shadow) size 685628190720 used 58164977664
    [1]091228.081224.1168: Status: File records: 1901 sent, 416 changed, 17 metadata, 1793 received (bytes sent=32768 received=34031)
    [1]091228.081224.1178: Status: Prepare fixups: total restore size 349105553408
    [1]091228.081225.3479: Status: Determine changed: scan 1612286 size 6603923456
    [1]091228.081342.4023: Status: Determine changed: scanned 1612286 of 4050991 total with 4 fixups (bytes sent=33827841 received=0)
    [1]091228.081740.0879: Status: Server phase Reorganize1 complete
    [1]091228.081740.0879: Status: Send changed: 668242 requested of 1612286 total

    If I look at the previous crashing nights, they all end at the same line, just after the 'Send changed' line after Reorganize1 is complete.

    By comparison, here's the tail of a log from a completed, healthy backup on the 21st, started from the same point (Volume 4 of 4):
    [1]091221.014842.3871: Status: Volume 4 of 4 is F: size 685628194816 used 58310983680
    [1]091221.014842.4141: Status: Start volume F: using \\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy15
    [1]091221.014843.8872: Status: Volume (shadow) size 685628190720 used 61412753408
    [1]091221.014844.1382: Status: File records: 2013 sent, 188 changed, 17 metadata, 2072 received (bytes sent=32768 received=39332)
    [1]091221.014844.1422: Status: Prepare fixups: total restore size 345959825408
    [1]091221.014845.3853: Status: Determine changed: scan 216295 size 885944320
    [1]091221.014900.9352: Status: Determine changed: scanned 216295 of 5332875 total with 4 fixups (bytes sent=4519185 received=0)
    [1]091221.015341.9703: Status: Server phase Reorganize1 complete
    [1]091221.015341.9703: Status: Send changed: 129270 requested of 216295 total
    [1]091221.015406.3696: Status: Send changed: 129270 sent with 2 fixups (bytes sent=530392405 received=595)
    [1]091221.015406.3696: Status: AutoExclusion RecycleBin 36423143424
    [1]091221.015406.3696: Status: AutoExclusion ShadowVolumes 3145728000
    [1]091221.015406.3696: Status: Total size of excluded files 39568871424
    [1]091221.015406.3696: Status: Directory exclusion RecycleBin size 36423143424 for \$RECYCLE.BIN
    [1]091221.015406.3696: Status: File exclusion ShadowVolumes size 3145728000 for \System Volume Information\15{3808876b-c176-4e48-b7ae-04046e6cc752}
    [1]091221.015406.7647: Status: Server phase Reorganize2 complete
    [1]091221.015407.0127: Status: Completed 4 volumes
    [1]091221.015407.0987: Status: Bytes sent=1851212370, bytes received=5235071, 0 reconnects

    I have no idea if this means anything, but it's all I can tell for now. Apparently the process dies on the last hard drive. For now, I may exclude that final hard drive (which is really just junk storage), to avoid having to hard boot my computer every morning.
    Tuesday, December 29, 2009 1:52 AM
  • Excluding the drive is a good idea. If this proves that the last hard drive is the problem, try replacing the drive's cable.
    Tuesday, December 29, 2009 2:01 AM
  • Mark, do you know of any way I can get this info reported to Microsoft? Usually ACR would take care of the problem, but because it's a hard boot with no log generated, I don't believe that this problem is making its way back to the WHS team. I would hate for this to be an invisible problem, because it is (IMHO) very serious and possibly more common than anyone knows.
    Tuesday, December 29, 2009 2:03 AM
  • The usual method is to file a bug report on Microsoft Connect (http:\\connect.microsoft.com)
    Tuesday, December 29, 2009 2:08 AM
  • Thank you, I've done so!
    Tuesday, December 29, 2009 2:22 AM
  • Here's the connect issue: https://connect.microsoft.com/WindowsHomeServer/feedback/ViewFeedback.aspx?FeedbackID=522373

    If I remove F: from my list of drives to back up, the software simply locks up on E:.
    If I exclude E:, it fails on D:.

    So apparently it's got nothing to do with the drive itself, but rather, something about finishing up the backup in general that causes the failure. If anyone else has seen this, feel free to vote on the connect bug I've linked to above.
    Saturday, January 2, 2010 1:50 AM