I have been testing Windows 2012 Data Deduplication for quite a long time now (starting from last October when I wrote for the first time on the topic). What I think I was lacking more was real world information about what one could expect from the DeDuplication process on well-known filetypes.
Though Microsoft provided a tool that gives a generic preview of the results a system administrator can expect from this new mechanism (the tool I am talking about is ddpeval.exe), I think that sharing real world statistics on a filetype basis can greatly help the understanding of its usefulness.
So I have decided to use a bunch of disks in a one-to-one relationship with one type of file and enable deduplication at filesystem level.
I had in my stock four 50 GB SATA disks which I formatted with NTFS (remember that Windows 2012 new ReFS doesn't support Dedupe for the moment).
I run three series of tests.
In the first series of tests, I copied on each disk one single file and made enough copies of it to fill 10 gigabytes of space. The file extensions that I choose for this test are quite common files everyone has in his data: .avi for a movie file, .mp3 for a audio file, .doc for a Microsoft Word file and .iso for a disk image. These file extensions cover the types of files you find in media libraries, document libraries as well as software libraries and having an idea of what Windows 2012 deduplication can do for you is important.
On my first disk, named F:, I copied the .avi file whose size in 700 MB. I then made 15 copies of it in order to use 10 GB.
On the second disk, H:, I put the mp3 file whose size is 4.50 MB. I made 2291 copies of it in order to use 10GB.
On the third disk, L:, I stored a 1.75 MB large Microsoft Word document, then I made 5277 copies to take 10GB.
On the last disk, M:, I copied a pretty big 3.09 GB ISO image that I copied three times to take a little bit less than 10GB.
On these four disks I enabled Data Deduplication and waited for it to occur. A few days later I came back to see the results and they were as good as I expected, being those hundreds and thousands of files just replicas of the same original file and the probability of having identical blocks equal to 100%.
Ok, let's make the number talk. Here's the results I got:
Get-DedupStatus FreeSpace SavedSpace OptimizedFiles Volume --------- ---------- -------------- ------ 49.08 GB 9.57 GB 15 F: .avi 49.76 GB 10.07 GB 2291 H: .mp3 49.73 GB 9.02 GB 5277 L: .doc 44.11 GB 6.73 GB 3 M: .iso Get-DedupVolume Enabled SavedSpace SavingsRate Volume ------ ---------- ----------- ------ True 9.57 GB 91 % F: .avi True 10.07 GB 97 % H: .mp3 True 9.02 GB 97 % L: .doc True 6.73 GB 53 % M: .iso Get-DedupMetadata .AVI extension Volume : F: DataChunkCount : 9411 DataChunkAverageSize : 76.2 KB TotalChunkStoreSize : 710.5 MB .MP3 extension Volume : H: DataChunkCount : 59 DataChunkAverageSize : 78.2 KB TotalChunkStoreSize : 17.61 MB .DOC extension Volume : L: DataChunkCount : 27 DataChunkAverageSize : 66.15 KB TotalChunkStoreSize : 12.87 MB .ISO extension Volume : M: DataChunkCount : 36471 DataChunkAverageSize : 73.28 KB TotalChunkStoreSize : 2.56 GB
As you can see I got very very good deduplication performance for documents and music.
Here's a table where the difference between the theoretical size of all the files without deduplication (5th column) and the real size on the disk (6th column) can be easily compared. Also the last column shows the overhead of the deduplication mechanism. This is calculated this way: [Real used disk space MB]-[Size on disk MB].
| Data type | File extension | Original file size MB | Number of copies | Theorical total size MB | Size on disk MB | Real used drive space MB | Dedup overhead MB |
| Music | .mp3 | 4.5 | 2 291 | 10 322.1 | 8.9 | 245.0 | 236.1 |
| Video | .avi | 700.3 | 15 | 10 504.8 | 0.1 | 937.0 | 936.9 |
| WinWord | .doc | 1.8 | 5 277 | 9 237.3 | 20.6 | 274.0 | 253.4 |
| Disc Image | .iso | 3 167.1 | 3 | 9 501.2 | 3 167.1 | 6 021.1 | 2 854.0 |
Here's the same results the way they are reported by Server Manager interface:

In the second series of tests, I used the same volumes above (which I formatted) to store many files of the same type on each of them. Here's what I did.
On volume F: I stored 52 different .avi test files that take 20.6 GB of space. their size varies between 100 MB and 1.5 GB.
On volume H: I put 3714 different .mp3 test files. Their total size is 17.6 GB.
On volume L: I put 773 Word documents. Sum size: 745 MB.
On volume M: I copied 9 .iso images taking 8.87 GB of disk space.
Once all the files where in their respective partitions, I enabled data deduplication and waited for all the optimization, scrubbing and garbage collection jobs to finish.
When I came back a few days later to see the situation, here's what I got:
Get-DedupStatus FreeSpace SavedSpace OptimizedFiles Volume --------- ---------- -------------- ------ 29.11 GB 156.48 MB 52 F: .avi 32.72 GB 772.52 MB 3714 H: .mp3 49.53 GB 495.28 MB 598 L: .doc 42.84 GB 2 GB 9 M: .iso Get-DedupVolume Enabled SavedSpace SavingsRate Volume ------- ---------- ----------- ------ True 156.48 MB 0 % F: .avi True 772.52 MB 4 % H: .mp3 True 495.28 MB 50 % L: .doc True 2 GB 21 % M: .iso
The gain on the document library is simply huge, reaching 50%.On the other side, as you could expect, the optimization gain on .avi files is near to zero. Same for mp3 files. That's because the applications that writes this kind of files already eliminate redundant information and therefore identical blocks are highly unlikely. The theory says that I should be able to get no better results with pictures (.jpg, .jpeg) and other kind of compressed music files. Nonetheless, having Windows 2012 deduplicate your media library or picture library will allow you to have duplicate pictures, or films, or other kind of files on the same volume without necessarily wasting more space.
Let's imagine for instance that you have a set of personal photos and that you want to make a copy of some of them to a folder on the same partition and share them through some kind of web service. The amount of used space would be the same because the Deduplication Engine would be able to see that some block are replicated and replace them with a pointer.
In the third series of tests, this last statement is exactly what I aim to demonstrate.
On volume H:, where the mp3 are stored, I create a subfolder named 'copy of music library' and copied 1000 mp3 files from the root folder to it.
Unsurprisingly the disk space used did not increase at all. We passed from:
FreeSpace SavedSpace OptimizedFiles InPolicyFiles
--------- ---------- -------------- -------------
32.72 GB 772.52 MB 3714 3714
to:
FreeSpace SavedSpace OptimizedFiles InPolicyFiles
--------- ---------- -------------- -------------
32.64 GB 5.39 GB 4714 4714
As you can see, the number of files is increased by 1000, but the free space stayed the same.
So, in the end, my opinion is that block-level deduplication is a nice improvement in the world of storage management both for home and professional use. Windows 2012 does a really good background job of seeking duplicate blocks and I encountered no errors at all for the moment. I have been running this for at least three months now and I am definitively happy with it.
Feel free to contribute to this post by sharing your deduplication experience. I think an interesting debate can be had on this subject if many people pop in and share their thoughts.
So, in the end, my opinion is that block-level deduplication is a nice improvement in the world of storage management both for home and professional use. Windows 2012 does a really good background job of seeking duplicate blocks and I encountered no errors at all for the moment. I have been running this for at least three months now and I am definitively happy with it.
Feel free to contribute to this post by sharing your deduplication experience. I think an interesting debate can be had on this subject if many people pop in and share their thoughts.
Thanks for the useful dedupe commands... I've setup dedupe on a 2012 server, but it refuses to actually dedupe anything on this 150TB volume. Ever hear of any size limits on a dedupe volume?
ReplyDeleteI haven't hard of any limit for the moment. What do you mean by 'it refuses to dedupe'? Is there any event in the logs? Can you check and see if the fsmdhost.exe process is running? Maybe it's taking a while due to the size of data to analyze.
DeleteAlso try ddpeval.exe and see what it reports.
Regards
Carlo
It just sits at a 0 rate and savings. FSMDhost.exe is not running currently, and ddpeval fails with:
Delete"ERROR: Evaluation not supported on system, boot or Data Deduplication enabled volumes", regardless of whether dedupe is on or off.
There is a VSS warning in the dedupe logs, which i am looking into.
Funny, it had no problem with a 60TB volume i had created prior to the bigger one...
Well, you could try to run the scheduled task named 'BackgroundOptimization' under Task schedule r library / microsoft / windows / deduplication and check if fsdmhost appear on your resource monitor.
DeleteAlso you could issue start-dedupjob cmdlet.
What VSS warning do you get?
Log Name: Microsoft-Windows-Deduplication/Operational
DeleteSource: Microsoft-Windows-Deduplication
Date: 2/5/2013 9:06:49 AM
Event ID: 4110
Task Category: None
Level: Warning
Keywords:
User: SYSTEM
Computer: dc-mgmt-02.vdc.local
Description:
Data Deduplication was unable to create or access the shadow copy for volumes mounted at "K:" ("0x80042306"). Possible causes include an improper Shadow Copy configuration, insufficient disk space, or extreme memory, I/O or CPU load of the system. To find out more information about the root cause for this error please consult the Application/System event log for other Deduplication service, VSS or VOLSNAP errors related with these volumes. Also, you might want to make sure that you can create shadow copies on these volumes by using the VSSADMIN command like this: VSSADMIN CREATE SHADOW /For=C:
Operation:
Creating shadow copy set.
Creating shadow copy set.
Running the deduplication job.
Context:
Volume name: K: (\\?\Volume{f4bdc5cf-7e4e-4ccc-895f-41dfa48a5ae8}\)
Event Xml:
4110
0
3
0
0
0x8000000000000000
697
Microsoft-Windows-Deduplication/Operational
dc-mgmt-02.vdc.local
K:
0x80042306
Operation:
Creating shadow copy set.
Creating shadow copy set.
Running the deduplication job.
Context:
Volume name: K: (\\?\Volume{f4bdc5cf-7e4e-4ccc-895f-41dfa48a5ae8}\)
Code: SCANENGC.00002402; Call: SCANENGC.00002312; CMD: C:\Windows\SYSTEM32\FSDMHOST.EXE {080aa921-fa0a-437a-b2bb-990fd347f01d} 4fe7495b-51e3-4f6f-95d8-4d2a56229a60 348eab9b-ec9e-43ea-a335-61d2e11faf71 OptimizationJob ; User: Name: NT AUTHORITY\SYSTEM, SID:S-1-5-18
Further investigation reveals that VSS cannot be enabled, which appears to be a pre-requisite for dedupe to work properly.
ReplyDeleteHi Josh,
DeleteI imagine you tried to issue the suggested VSSADMIN CREATE SHADOW /For=K: and it has failed right?
If you issue the very same command on your 60TB vol what do you get? It would be interesting to understand if there is something wrong with your filesystem or if you are lacking disk space for fsdmhost to run.
As a side question, how long did it take to dedupe your 60TB volume? And hom much space where you able to save (also considering the file types)?
I found this:
Deletehttp://technet.microsoft.com/en-us/library/cc755419%28v=ws.10%29.aspx
Maybe with a little math you can figure out what component is limiting VSS. It could be related to paged pool or non-paged pool exhaustion... which is what I would try to rule out at first.
How much RAM do you have on that server?
I did try crating a shadow via the GUI, which failed. I currently have 32GB of ram on this server. More then enough to handle the dedupe and vss requirements. Thanks for that article, it may provide some good clues.
DeleteDedupe on the 60TB was very quick when it was just a bunch of duplicate isos.
After much trial and error, and zero documentation on this issue from microsoft... The answer is this:
ReplyDelete1. Dedupe on windows 2012 requires VSS. If VSS fails, so will deduplication.
2. VSS will fail on any single volume larger then 64TB
3. Therefore, dedupe is limited to 64TB max size volumes
@Josh
DeleteThanks very much for sharing this information!! I was aware of the dependence upon VSS writer, but I am surprised by deduplication being limited to 64TB and Microsoft not telling us! Maybe they didn't bother testing their configuration maximums... waiting for someone else doing it on their behalf...
Thanks, and keep us updated if you discover something else!
Carlo
Interesting topic! This blog is top-notch!
ReplyDeleteThomas M.
Hello, I'm curious about read performance on the de-duplicated volume. I'm familiar with DataDomain appliances, which are great with write performance but terrible at random reads. Did you happen to do any performance testing on the de-duplicated volume?
ReplyDeleteFor the use case I'm considering, which is a backup repository for VM images, random read i/o is fairly important. We use Veeam Backup, and we want to use their Surebackup feature; this spins up the VM directly from the backup image in an isolated network. Our experience with DataDomain and this feature hasn't been very good, so we're looking into alternatives...
@Loren
DeleteFor the moment I didn't test read performance, but in theory there should be a 5-10% (some say 3%) overhead on seldom read files, while files in cache have a performance improvement (Dedupe engine has sort of caching system).
Honestly this performance hit is transparent to the end-user.
Please let me know how it goes for you!
Interesting. On the DataDomain, random read performance takes a hit on the order of 70%. I wouldn't have expected it to be that much different for another de-duplicating file system.
DeleteHi Gordon,
Deletesuch a good read performance in Windows Dedupe engine is to to the algorithm behind the Master File Table, which is a B-tree.
In a few words, your Windows Disk has a list of files and folders which are organized in a hierarchical way so that no search takes i.e. 3 steps to find a given filename, then from there a first data chunk is read and for the rest of data a pointer tells you where they are on the disk.
Deduplication adds a new type of pointers (a reparse point specific to deduplication) which sends one or more files to the same sectors on disk.
So you see, there were pointers before and there are pointers now, and perf stays roughly the same.
For more, check this: http://www.happysysadm.com/2012/10/data-deduplication-in-windows-server.html or on wikipedia you'll find a good explanation of how a b-tree works.
Hope this helps!
Carlo