PC Backup: The backup system seems like magic. How does it work?
I have 4 PCs in my house. If you sum up the amount of used disk space on all of them it comes to about 220GB.
They are all being backed up by Windows Home Server. When I look at the Home Server Console the Server Storage tab says that only 98GB of disk space is being used. How can that be?!?!
Answers
Unlike most backup products that operate at the file level, the Windows Home Server computer backup solution works on "clusters". Clusters are the lower level constructs of the file system. They are usually 4k bytes in size on most NTFS disks. The "magic" you are seeing is a result of the fact that Windows Home Server makes sure that any particular cluster is stored only once on the server...even if that cluster is found on multiple disks and within multiple files. This is known as "single instance storage" in geeky circles.
Here's some more detail on how this works:
- The server side of the solution is a database (not some off the shelf database, but one developed specifically for this application). The "records" in the database are clusters and hashes of those clusters (a hash is a number that uniquely identifies a cluster based on its contents). The database also contains information on the structure of a volume (NTFS file system information). If a cluster on the C: drive of Mom's computer has the first 4096 characters of "War and Peace" in it, and a cluster on the E: drive of Joey's computer has those same 4096 characters in it then their hashes will be the same.
- When a computer is backed up to the server, the code on the client computer figures out what clusters have changed since the last backup.
- The software then calculates a hash for each of these blocks and sends just the hashes to the server.
- The server looks its database of clusters to see if any have hashes that match those it just received. If a hash matches then that cluster is already stored on the server.
- If they are NOT stored on the server already, then the computer sends them to the server and the server adds them to the database
- All file system information is transfered and stored on the server such that a volume (from any machine) at any backup point (time) can be reconstituted from the database.
And this is how 220GB of data spread out across 4 computers can be stored in 98GB of space on your home server.
All Replies
Unlike most backup products that operate at the file level, the Windows Home Server computer backup solution works on "clusters". Clusters are the lower level constructs of the file system. They are usually 4k bytes in size on most NTFS disks. The "magic" you are seeing is a result of the fact that Windows Home Server makes sure that any particular cluster is stored only once on the server...even if that cluster is found on multiple disks and within multiple files. This is known as "single instance storage" in geeky circles.
Here's some more detail on how this works:
- The server side of the solution is a database (not some off the shelf database, but one developed specifically for this application). The "records" in the database are clusters and hashes of those clusters (a hash is a number that uniquely identifies a cluster based on its contents). The database also contains information on the structure of a volume (NTFS file system information). If a cluster on the C: drive of Mom's computer has the first 4096 characters of "War and Peace" in it, and a cluster on the E: drive of Joey's computer has those same 4096 characters in it then their hashes will be the same.
- When a computer is backed up to the server, the code on the client computer figures out what clusters have changed since the last backup.
- The software then calculates a hash for each of these blocks and sends just the hashes to the server.
- The server looks its database of clusters to see if any have hashes that match those it just received. If a hash matches then that cluster is already stored on the server.
- If they are NOT stored on the server already, then the computer sends them to the server and the server adds them to the database
- All file system information is transfered and stored on the server such that a volume (from any machine) at any backup point (time) can be reconstituted from the database.
And this is how 220GB of data spread out across 4 computers can be stored in 98GB of space on your home server.
- Now that is VERY cool.
Within the computer are the 2 harddrives mirrored??
How does the system handle drive failure? I was just about to ask this same question. Now I know and somewhat understand!!
belto...
This seems to represent a departure from the limited Veritas backup that was integrated in the NT days (the original developer name escapes me at the moment because it was so long ago). I have been hoping to see a new backup algorithm and glad to see it is in Home Server! Now I really want it!
Greg
To gain efficiencies across computers on the home network must each drive have the same cluster size?
If the same file fits wholly within a cluster on both FAT32 and NTFS will it produce the same hash on either file system type?
- How "related" is the SIS in Exchange and WHS in terms of architecture or are related in name only?
A hash, by definition, is not unique to the data it represents. If you really only send the hash to the server for evaluation, then there is a chance this will lead to lost data. How do you deal with hash collisions?
- Great Explanation, and this makes a lot of sense in a multi-PC scenario.
Currently I just have one laptop being backed up to WHS, with 31.5 gigs of used space on the drive. WHS is only showing 15.5 gigs of space being used for backup via the GUI, and looking in the /Folders/GUID/ directory shows about the same amount.
Is this just pure (standard)compression, or is there some other WHS magic happening here? cek wrote: The server looks its database of clusters to see if any have hashes that match those it just received. If a hash matches then that cluster is already stored on the server.
Dear WHS Team, how would your respond to this? How do you guarantee that the data in the cluster matches just because the hash matches?
This property is a consequence of hash functions being deterministic. On the other hand, a function is not injective, i.e. the equality of two hash values ideally strongly suggests, but does not guarantee, the equality of the two inputs.
Just a question out of curiousity:
I don't know if this still happens (I do not pay attention to it, but it is still possible), but what happens if you format a disk on one PC with a cluster size of 4096 and a disk on another PC with a cluster size of 2048?
Best regards,
Chris
