locked
New Add-In Planned: WHS DupFileManager RRS feed

  • General discussion

  • Hi all...

     

    With the beta release of PP1, I have finally decided to go ahead with an add-in idea that I had over a year ago.

     

    I am working on an add in that will be an active duplicate file manager looking for file dups that live in any shares (or user defined subset of shares/directories) on the WHS and will move the duplicates to a "Duplicate Share".

     

    I have the .Net code running to traverse a set of directories/shares and generate a database of duplicates based on cryptographic hashes of each file.

     

    I hope to have a basic version running and available for people to beta test in the next week to 10 days, but I'd like some feedback on features.

     

    Here is what I plan:

     

    • The user will be provided with a multi-select tree view of the shares defined on the server and will be allowed to select ALL shares/sub directories or any sub-set that they desire. Only the selected shares will be examined for duplicates
    • The DupFileManager will calculate SHA1 and MD5 hashes of each file found in the selected set of directories/sub directories and will create a list of duplicates.
    • The default list of duplicates will be based solely on SHA1 hashes of the files, but if anyone is paranoid about false positives, they can also require that the MD5 hashes be duplicated as well in order for the file to be considered a true duplicate.
    • The set-up of the program will allow the user to define a DuplicateFile share. This share is where all duplicates found will be moved. The program will NEVER delete duplicates outright. After a lot of consideration about how to handle the duplicates, I have decided that I don't want to release a destructive tool. DuplicateFileManager will NOT delete anything - that is up to the user to do. DupFileManager will only move files to the "quarantine" share so that the duplicates can be reviewed and deleted by the user themselves.
    • Within the DuplicateFile share, the path to the original file will be maintained. This will ease the browsing of the DupFile share and allow the user to see where the files came from. This will support a future feature (not to be included in the initial release but coming "real soon" Smile that will allow the user to restore a file through the console pane to its original location.
    • The DuplicateFile share can optionally be set to have duplication turned on. I had planned to have duplication on by default, but I'm not sure what the default should be. If a server is low on disk space, and duplicate files are being moved from a non-duplicated share to the DuplicateFile share, then we would be consuming more disk space - possibly for no reason (these are duplicates after all). I may decide to create two duplicate file shares - one that is duplication on and one that has it off.
    • The user will be allowed to define "home" or "base" shares or directories where a master copy of a file should live (all other duplicates would be moved to the duplicate share) - the one copy within the "home" share will always stay put.
    • If duplicates are found within a "home" share, or if duplicates are found where all copies reside outside of "home" shares, then the DupFileManager will choose one copy to remain in place, and move all other copies out to the DuplicateFiles share.
    • Once a directory that is outside of a "home" share has had a master file identified in it, then that directory will effectively be used as another "home" share. (i.e. if further duplicates are found that have a copy in that same directory, the copy that resides in that directory will be left in place, and the other copies moved out to the DuplicateFile share). In this way, the number of directories that end up having the "master" copies of the files will be kept to the absolute minimum.

    As I said, I have the guts of the duplicate identification engine running and it is able to calculate hashes and find duplicates and move duplicate files out to a pre-defined path. What I am down to is looking for feedback from potential users regarding features they would like to see.

     

    I need this add-in for myself as I have many, many duplicates of digital photos on the server and I need to clean them up, but I'd like to give something back to the WHS community as well.

     

    Please give me your suggestions.

     

    Thanks,

     

    Ted

     

    Tuesday, June 17, 2008 3:18 PM

All replies

  •  

    Hi Ted

     

    It sounds like a great idea, but you might want to look at these existing add-ons.

     

    https://brentf.homeserver.com/blog/technology/windows-home-server/dupecleaner-v0-0-0-1-beta/

     

    and

     

    http://akiba.geocities.jp/duplicationinfo/

     

    The first one does something similar already.

     

    Simon

    Tuesday, June 17, 2008 5:54 PM
  • Yes, I'm aware of both of those add-ins.

     

    The second one is about showing which physical disks the WHS managed duplicated files reside on. That's not what I'm going for at all.

     

    The first one is the same basic idea as mine, but it was a quick and dirty implementation by Brent at the request of someone on these forums. It does a fine job at finding duplicates, but the UI for managing what to do with them is minimal - this is the same problem I've found with the many duplicate file finders that already exist to run on Windows (outside of a WHS implementation).

     

    I'm looking to create something that will deal with the bulk of the duplicated items without needing user intervention. That's one of the reasons that I don't intend my Add-in to do ANY deleting of files, only moving. Event for myself, I am hesitant to allow the add-in to kill anything automatically.

     

    Although, I do have an idea for an auto-delete as a second pass through - essentially a function to allow the Add-in to scane the DuplicateFile share and if the "master" copy still exists in the other shares, THEN allow it to be deleted.

     

    Thanks for your input.

     

    Ted

     

     

    Tuesday, June 17, 2008 7:12 PM
  • Ted

     

    I use Brent's add-in and it is okay. Yours sounds much more powerful and a natural extension of it. Good luck with your development and if you need a tester drop me a email/PM and I will gladly help.

     

    Simon

    Tuesday, June 17, 2008 9:25 PM
  •  

    Not sure if you are covering this with your first bullet point or if this is possible but the problem I have with Brent's add-In is I can not see the second file (the duplicate file) fully.  I organize files with many folders and subfolders and the location of the first file ends up so long that I only see part of the duplicate which is usually close to the first file.

    Wednesday, June 18, 2008 3:32 PM
  • I'm not sure I understand all that you intend to do or how, I'd like to participate and see if I have suggestions that makes sense.  My reading comprehension seems a bit off this morning, sorry.

     

    Would it be possible to find a duplicates and then go to an interactive mode where I take the time to decide which of the duplicated files gets moved to the 'duplicates' share?  One of the duplicates may be in a place that makes more sense than another.  Maybe even three options, move copy a, move copy b or leave both.  I can see cases where there may be duplication for a reason.  And it makes more sense to leave 11 Another Harry's Bar.mp3 in the music share and move the one in \\serever\user\dad to to the 'duplicates' share.

     

     

    Wednesday, June 18, 2008 3:55 PM
  •  Jeshimon wrote:

    I'm not sure I understand all that you intend to do or how, I'd like to participate and see if I have suggestions that makes sense.  My reading comprehension seems a bit off this morning, sorry.

     

    Would it be possible to find a duplicates and then go to an interactive mode where I take the time to decide which of the duplicated files gets moved to the 'duplicates' share?  One of the duplicates may be in a place that makes more sense than another.  Maybe even three options, move copy a, move copy b or leave both.  I can see cases where there may be duplication for a reason.  And it makes more sense to leave 11 Another Harry's Bar.mp3 in the music share and move the one in \\serever\user\dad to to the 'duplicates' share.

     

     



    I'd be glad to have you participate and have your input. That's why I started this thread because I believe this is a valuable add-in, but I know that I have a kind of unusual perspective on this issue.

    My ideal add-in will NOT be used in interactive mode at all. It will scan the shares (one a schedule - maybe daily, maybe weekly) and will move any duplicates it finds out of my designated "home" shares.

    The main impetus for this is management of digital photos.

    My wife and I both use the digital cameras and we often forget to wipe the card after uploading the batch of photos to the WHS. Because of this many photos are duplicated the next time we dump from the card, and the next time, etc. (There are downsides to 4GB SD cards being so cheap - we can shoot for months of our normal usage and never HAVE to clear the card).

    I know that my photos are all in the \\server\photos share. Any duplicates that are found in there can be safely moved out to the duplicate share. Also, if any duplicates of the photos are found in other shares, I want to move them out of the other shares - and leave the one in the Photo share alone.

    The first beta version is likely to be non-interactive - and non-destructive - and will just move photos out to the duplicate share automatically. I think interactivity will be a key feature, though, and I would expect it to be in the second beta.

    I like the suggestion of being able to choose what to do with each file, or whether to leave both alone. I'll definitely include that in the interactive comparison feature.

    I think along with the "move neither" I will have to include a persistent store for this "override" so that you don't have to keep. This will change the database back-end for the program a little bit, so that feature might take a little longer.

    Thanks for the input. I'm hoping that the first beta version will be available next week.

    I'll post a link to my website when I am ready for people to try it.

    Ted

    Thursday, June 19, 2008 3:59 AM
  • Ted,

    First make it do what you want and I'll play along.  If you want to go further GREAT, but initially get what you want and I'll try to break what you have.  To help find problems.  At first just make it work on the 'photos' share, that way it testing will be quick and easy...

    Thanks,

    Jim

     

    Thursday, June 19, 2008 6:17 AM
  • There are definately places where I want the dups to remain, so I would think instead of always moving them, you can create listings of them, and then have the user go through and decide what to move or delete, and what not to.

     

    I would also say that a designated "never check here for dups" area would be good; Or the ability to tag a directory to never check there for dups.

     

    Additionally, when the user is browsing through the found dups, something that shows the master location would be good, with a mechanism that allows the user to choose which of the two to move or delete.

     

    Summary information that shows the size of the dups (by folder, by type of file, and in total) would be nice.

     

    Having this extend to scanning the client machines attached to determine dups between machines and shares, or machine to machine (not system or program info) would also be great - especially if you could do it by analyzing the backups of those machines that are on the WHS.

     

     

     

     

    Thursday, June 19, 2008 12:18 PM
  • It sounds like a useful addition. Maybe an option to post the results to a text file, so the Owner could decide which is the more appropriate location.

     

    Colin

     

    Thursday, June 19, 2008 4:54 PM
  •  

    What happen to this Add-in? never came out ?
    Friday, July 25, 2008 12:36 PM