Answered by:
Hard disc is suspect

Question
-
Hi. Before you answer, please bear in mind that I have written several utilities for WHS, am a programmer, know more about fsutil than I should, and have been using WHS since the first beta. I am looking for advice from either Microsoft or practical experience only. Sorry if I sound like a knob, but I do actually know what I'm doing.
I have a WHS with 8 discs, one of which is suspect. eventvwr shows that the drive mounted as N regularly can't write to the MFT. I humbly think I know all about that problem, and it indicates that the disc is in trouble. In fact, when I run spinrite on that disc it will actually cause spinrite to fail in level 2 when it hits the suspect blocks! A run on level 4 will actually suceed, surprisingly enough. GRC has made some suggestions which have not progressed the issue.
Another symptom to note is that the server will report a 'network at risk' issue, showing file conflicts. (I find this hilarious in that this is merely a risk warning, whereas a client without AV is a critical issue!!!) After a reboot the warning will go away for around 20 hours or so.
Doing a chkdsk on the mount produces unpredictable results. In other words, if chkdsk and spinrite can't sort it, then this disc is fully hosed.
Obviously I want to remove the disc. However, a standard remove disc process results in the server rebooting part way through and it recovers and leaves the disc in the storage pool (that's actually quite well done, Microsoft - it doesn't leave the whole system hosed).
As part of my strategy I have turned on folder duplication for all shares that I have concerns about losing data. The duplication has apparently succeeded, given the results of perfmon on the physical discs. There are two shares where I haven't done that and am happy to lose the data.
I haven't been able to find an orderly way to remove this disc without having error messages for ever. I think that there should be a way to remove this disc with or without the 'consent' of the WHS interface.
I'd be grateful for any advice on where to proceed from here.
Andrew
Thursday, March 11, 2010 8:28 AM
Answers
-
A practical solution (and one that's been posted fairly frequently here):
You shut down your server, physically remove the disk, then restart and remove the (now "missing") disk using the console. YOu will be warned about the possibility of losing files from your shares (if any unduplicated files are on that disk) and backups (if any components of the backup database are on that disk).
I'm not on the WHS team, I just post a lot. :)- Proposed as answer by kariya21Moderator Friday, March 12, 2010 1:18 AM
- Marked as answer by andrewcalvin Saturday, March 13, 2010 5:24 AM
Thursday, March 11, 2010 12:02 PMModerator
All replies
-
A practical solution (and one that's been posted fairly frequently here):
You shut down your server, physically remove the disk, then restart and remove the (now "missing") disk using the console. YOu will be warned about the possibility of losing files from your shares (if any unduplicated files are on that disk) and backups (if any components of the backup database are on that disk).
I'm not on the WHS team, I just post a lot. :)- Proposed as answer by kariya21Moderator Friday, March 12, 2010 1:18 AM
- Marked as answer by andrewcalvin Saturday, March 13, 2010 5:24 AM
Thursday, March 11, 2010 12:02 PMModerator -
Thanks for your time Ken. I was afraid this was the only answer.
It seems that writing a routine to 'cleanse' a disc (i.e. remove as much data as can be done within a reasonable error level) would be a relatively straightforward activity. It would go something like this:
- request a disc removal
- start moving files
- removal process encounters bad blocks
- software marks the disc as 'bad'
- software no longer writes new files to the disc
- software attempts to migrate all files off it
- one the migration reaches a certain error threshold it returns a message saying 'I've done all I can'
- disc is marked ready for removal
That would provide a much better result in situations where the bad blocks are relatively few and affect perhaps just one or more files. In the current situation ALL files on the disc that aren't duplicated will be hosed, even if one file is actually affected.
Thanks and keep up the great work.Saturday, March 13, 2010 5:20 AM -
The issue is more difficult to deal with than it seems at first, I'm afraid. If the disk has failed entirely, you can expect to write anything on it off. If it hasn't, your workaround is to do as I suggested, then just manually recover what you can.
It's also possible that the team made a conscious decision to do exactly what they do, on the theory that a failing disk should be dropped completely, protecting the disk as much as possible so the end user can implement whatever data recovery efforts seem appropriate. But I'd like to see something a little more robust in this area myself, thus......
Why look, a product suggestion on Connect!
It seems that writing a routine to 'cleanse' a disc (i.e. remove as much data as can be done within a reasonable error level) would be a relatively straightforward activity.
...
I'm not on the WHS team, I just post a lot. :)Saturday, March 13, 2010 5:37 AMModerator -
Sorry - I should have added one more point...
After the forced disc removal process the user is left with the server in a critical or at risk state (the actual state seems to be unpredictable), with no option to ignore the events. Again, this is an area where the software team could greatly improve the user interaction.
After around 10 minutes the server status sometimes automatically returns to "healthy". Again this is unpredictable, and a less than ideal user interaction. [edit - I've done this before, dammit...]Saturday, March 13, 2010 5:41 AM -
That's a good point Ken, although I should say that amongst the large variety of disc failure scenarios, there are several where the failure doesn't affect the whole disc. I have, in 30 years (ok, I'm 46) never met an entire disc that has failed - I've only ever met bad sectors and controller failures.
That's where I think the 'more robust removal scenario' you mention could kick in. In my case I think there are a few thousand suspect blocks, which, even on a 500 GB drive, are relatively few, and probably only affect 1-3 videos. Unfortunately the disc removal process wiped out several hundred.
As I said, that doesn't bother me in this case, but there are plenty of other use cases where this would be a big deal unnecessarily. Why do things "OK" when you could do things "Great"? The Windows APIs exist to do this kind of testing and recovery, and a PoC could be completed within a week.
Warning: Rant follows: I regularly preach to my project teams about how a few hundred hours of programming can save thousands or millions of hours of angst. An acceptable process can be turned into a good process for less than $100k. Why not do it? I deeply believe in programmers working to protect users from their own activities. There are a few software evangelists out there who think along these lines. This results in software that doesn't punish users. Let's say you have a date input field. You could bump a user who doesn't enter what you think is a legal date. But maybe that user is not American, so uses dd/mm/yyyy, and 18/4/63 makes sense to him? Maybe that user enters 'tomorrow'? We see software such as Remember the Milk starting to do these things. Rant over.
WHS OEM patched - 1.6 TBSaturday, March 13, 2010 6:09 AM