locked
Drive failure troubleshooting steps? RRS feed

  • Question

  • After some 9 months of enjoying WHS running 24/7, I saw my first HD failure message last night.

    The drive involved is a new WD green 1TB drive, attached via a single enclosure eSATA box.  It was attached to a plain vanilla, home-built WHS system.  (Only add-in is Toolkit v1.0, left over from the PP1 beta days.)  Unfortunately, the new HD was installed less than one day following the PP2 upgrade, complicating problem analysis. 

    The end result of the error(s) is that I lost all the backups for my main client PC.  WHS said it couldn't fix/repair/recover this data.  (If I had non-duped shares on that drive, then those would probably be gone as well.)

    The drive had been installed for ~4 hours before WHS reported the failure, coming right after midnight, when WHS apparently initiates some of its maintenance processes.  Since no files had been added to the WHS following the HD install, I presume it was still an empty drive when the failure was reported - other than a couple of WHS DE control files.

    I just now looked at the Toolkit-provided WHS System Event log.  It's a sea of red around the time that the new drive was powered up.  The event codes are 5,11, 9, and a 12.  According to the MS KB, possible cause are controller, cable, and power.  Worrisome is that there was not a red WHS tray icon warning during that time, although it remained gray for a while after powering up - possibly this is when the warning icon should have appeared?  For the next three hours, everything appeared normal in the log and in the console.

    The errors which immediately preceeded the WHS HD failure message were: 5-parity error, 9-controller error, and the final 12-PnP - stating that the drive "disappeared..."

    Since this drive was in an external eSATA enclosure, I guess this could also suggest a faulty enclosure (cable, circuitry, power). 

    My question here is:  what's the best method to bench test this HD & eSATA unit without attaching to the production WHS and  risking another loss of data? 
     - Can I simply add it to the server as a non-pool drive and run chkdsk-type routines? 
     - If so, can I hot swap it so I don't have to power down the server?
     - Can I run some "disk-bang' type routines while having it attached via USB to one of my clients?

    (It is sitting here spinning right now attached via USB to my client PC.)

    TIA,
    cliff

    Thursday, March 26, 2009 6:36 PM

Answers

  • cliff r said:

    After some 9 months of enjoying WHS running 24/7, I saw my first HD failure message last night.

    The drive involved is a new WD green 1TB drive, attached via a single enclosure eSATA box.  It was attached to a plain vanilla, home-built WHS system.  (Only add-in is Toolkit v1.0, left over from the PP1 beta days.)  Unfortunately, the new HD was installed less than one day following the PP2 upgrade, complicating problem analysis. 

    The end result of the error(s) is that I lost all the backups for my main client PC.  WHS said it couldn't fix/repair/recover this data.  (If I had non-duped shares on that drive, then those would probably be gone as well.)

    The drive had been installed for ~4 hours before WHS reported the failure, coming right after midnight, when WHS apparently initiates some of its maintenance processes.  Since no files had been added to the WHS following the HD install, I presume it was still an empty drive when the failure was reported - other than a couple of WHS DE control files.

    I just now looked at the Toolkit-provided WHS System Event log.  It's a sea of red around the time that the new drive was powered up.  The event codes are 5,11, 9, and a 12.  According to the MS KB, possible cause are controller, cable, and power.  Worrisome is that there was not a red WHS tray icon warning during that time, although it remained gray for a while after powering up - possibly this is when the warning icon should have appeared?  For the next three hours, everything appeared normal in the log and in the console.

    The errors which immediately preceeded the WHS HD failure message were: 5-parity error, 9-controller error, and the final 12-PnP - stating that the drive "disappeared..."

    Since this drive was in an external eSATA enclosure, I guess this could also suggest a faulty enclosure (cable, circuitry, power). 

    My question here is:  what's the best method to bench test this HD & eSATA unit without attaching to the production WHS and  risking another loss of data?

    Pull it from the server.

    cliff r said:

    Can I simply add it to the server as a non-pool drive and run chkdsk-type routines?  

    Can you?  Yes.  However, you should connect it to a client PC instead and try it there.

    cliff r said:

    If so, can I hot swap it so I don't have to power down the server? 

    If it's supported by your hardware, yes.  However, I wouldn't do that even if it was supported.

    cliff r said:

    Can I run some "disk-bang' type routines while having it attached via USB to one of my clients?

    (It is sitting here spinning right now attached via USB to my client PC.)

    You can pull it from the server and run chkdsk /r on it (or whatever tools comes with it from WD).

    cliff r said:

    TIA,
    cliff



    To be honest, if it were me, it would already be on its way back to the store. :)
    • Marked as answer by cliff r Friday, March 27, 2009 12:58 AM
    Thursday, March 26, 2009 11:44 PM
    Moderator

All replies

  • cliff r said:

    After some 9 months of enjoying WHS running 24/7, I saw my first HD failure message last night.

    The drive involved is a new WD green 1TB drive, attached via a single enclosure eSATA box.  It was attached to a plain vanilla, home-built WHS system.  (Only add-in is Toolkit v1.0, left over from the PP1 beta days.)  Unfortunately, the new HD was installed less than one day following the PP2 upgrade, complicating problem analysis. 

    The end result of the error(s) is that I lost all the backups for my main client PC.  WHS said it couldn't fix/repair/recover this data.  (If I had non-duped shares on that drive, then those would probably be gone as well.)

    The drive had been installed for ~4 hours before WHS reported the failure, coming right after midnight, when WHS apparently initiates some of its maintenance processes.  Since no files had been added to the WHS following the HD install, I presume it was still an empty drive when the failure was reported - other than a couple of WHS DE control files.

    I just now looked at the Toolkit-provided WHS System Event log.  It's a sea of red around the time that the new drive was powered up.  The event codes are 5,11, 9, and a 12.  According to the MS KB, possible cause are controller, cable, and power.  Worrisome is that there was not a red WHS tray icon warning during that time, although it remained gray for a while after powering up - possibly this is when the warning icon should have appeared?  For the next three hours, everything appeared normal in the log and in the console.

    The errors which immediately preceeded the WHS HD failure message were: 5-parity error, 9-controller error, and the final 12-PnP - stating that the drive "disappeared..."

    Since this drive was in an external eSATA enclosure, I guess this could also suggest a faulty enclosure (cable, circuitry, power). 

    My question here is:  what's the best method to bench test this HD & eSATA unit without attaching to the production WHS and  risking another loss of data?

    Pull it from the server.

    cliff r said:

    Can I simply add it to the server as a non-pool drive and run chkdsk-type routines?  

    Can you?  Yes.  However, you should connect it to a client PC instead and try it there.

    cliff r said:

    If so, can I hot swap it so I don't have to power down the server? 

    If it's supported by your hardware, yes.  However, I wouldn't do that even if it was supported.

    cliff r said:

    Can I run some "disk-bang' type routines while having it attached via USB to one of my clients?

    (It is sitting here spinning right now attached via USB to my client PC.)

    You can pull it from the server and run chkdsk /r on it (or whatever tools comes with it from WD).

    cliff r said:

    TIA,
    cliff



    To be honest, if it were me, it would already be on its way back to the store. :)
    • Marked as answer by cliff r Friday, March 27, 2009 12:58 AM
    Thursday, March 26, 2009 11:44 PM
    Moderator
  •  > To be honest, if it were me, it would already be on its way back to the store. :)

    Thanks for the info.   I agree it should probably go back.  I'm not entirely sure however, that it was just one problem.  The drive was in a fanless enclosure, and the heat or the enclosure electronics might be the suspect as well.

    I'm running a checkdsk now with the drive still enclosed, and attached to the client via USB.  Taking a loooongggg time!

    My overriding concern is that the WHS tray icon was green and the console storage was reported as healthy, right up to midnight when the server mtce processes started.  There was a lot of red showing in the event log 3 hrs before that time and before the system displayed the error status out in the open.

    Next time I install a new drive, the Tookit-reported event log is going to remain open for constant observation!  Not familiar enough with WHS to know of a more practical method to monitor drive health.
     
    cliff

    Friday, March 27, 2009 12:57 AM
  • cliff r said:

     > To be honest, if it were me, it would already be on its way back to the store. :)

    Thanks for the info.   I agree it should probably go back.  I'm not entirely sure however, that it was just one problem.  The drive was in a fanless enclosure, and the heat or the enclosure electronics might be the suspect as well.

    I'm running a checkdsk now with the drive still enclosed, and attached to the client via USB.  Taking a loooongggg time!

    My overriding concern is that the WHS tray icon was green and the console storage was reported as healthy, right up to midnight when the server mtce processes started.  There was a lot of red showing in the event log 3 hrs before that time and before the system displayed the error status out in the open.

    That's normal.  There are no alerts in the WHS Console until chkdsk (which runs every night at midnight) fails 4 consecutive times.

    cliff r said:

    Next time I install a new drive, the Tookit-reported event log is going to remain open for constant observation!  Not familiar enough with WHS to know of a more practical method to monitor drive health.
     
    cliff



    Saturday, March 28, 2009 12:25 AM
    Moderator
  • This problem appears to not be a drive problem, but an incompatibility between the drive, the eSATA port (the BIOS), and the OS. 

    The drive works OK in the external enclosure when connected via a USB port, and when connected as an internal SATA device.  When connected as an eSATA device, it fails miserably - event log a sea of red.

    A dusting off of the MB manual shows BIOS settings for the eSATA port to be IDE OR SATA, and if SATA is desired, then the setting for the external port has to be set as such.  By default, it is IDE.  There is also a setting opportunity for AHCI, but I'm not sure how this might affect the existing internal drives, so will leave alone for now.  (AHCI would be necessary to enable eSATA hot plug and native command queuing.)

    Thanks for your help!

      
    Saturday, March 28, 2009 12:02 PM
  • This problem appears to not be a drive problem, but an incompatibility between the drive, the eSATA port (the BIOS), and the OS. 

    The drive works OK in the external enclosure when connected via a USB port, and when connected as an internal SATA device.  When connected as an eSATA device, it fails miserably - event log a sea of red.

    A dusting off of the MB manual shows BIOS settings for the eSATA port to be IDE OR SATA, and if SATA is desired, then the setting for the external port has to be set as such.  By default, it is IDE.  There is also a setting opportunity for AHCI, but I'm not sure how this might affect the existing internal drives, so will leave alone for now.  (AHCI would be necessary to enable eSATA hot plug and native command queuing.)

    Thanks for your help!

      

    While the incompatibility is a possibility, it's also possible that your eSATA enclosure is having issues (or causing drive problems since there is no fan).  Have you tried connecting that enclosure to another PC and see if the same problems occur?
    Saturday, March 28, 2009 12:42 PM
    Moderator
  • While the incompatibility is a possibility, it's also possible that your eSATA enclosure is having issues (or causing drive problems since there is no fan).  Have you tried connecting that enclosure to another PC and see if the same problems occur?

    I don't have another PC with an eSATA port, but I did use this drive and the same enclosure with a USB cable, both on the server and on a client PC. It worked well and threw no errors.  I guess this would mean OK for heat, although I don't know if this type of "green" drive spins faster (more heat) if eSATA connected.

    The error codes when connected as an eSATA device began almost immediately after boot of the server.

    My next step:  I'm going to install the drive as an internal SATA device in another client PC and let it cook awhile; see how it performs.  If OK, then I think I'll change the BIOS for the eSATA port on the server.  Hopefully that setting won't affect the use of the internal connectors! 

    Thanks again for your help!
    Saturday, March 28, 2009 1:43 PM