Wednesday, January 16, 2013

Real world data deduplication savings under WIndows 2012

I have been testing Windows 2012 Data Deduplication for quite a long time now (starting from last October when I wrote for the first time on the topic). What I think I was lacking more was real world information about what one could expect from the DeDuplication process on well-known filetypes.

Though Microsoft provided a tool that gives a generic preview of the results a system administrator can expect from this new mechanism (the tool I am talking about is ddpeval.exe), I think that sharing real world statistics on a filetype basis can greatly help the understanding of its usefulness.

So I have decided to use a bunch of disks in a one-to-one relationship with one type of file and enable deduplication at filesystem level.

I had in my stock four 50 GB SATA disks which I formatted with NTFS (remember that Windows 2012 new ReFS doesn't support Dedupe for the moment).

I run three series of tests.

In the first series of tests, I copied on each disk one single file and made enough copies of it to fill 10 gigabytes of space. The file extensions that I choose for this test are quite common files everyone has in his data: .avi for a movie file, .mp3 for a audio file, .doc for a Microsoft Word file and .iso for a disk image. These file extensions cover the types of files you find in media libraries, document libraries as well as software libraries and having an idea of what Windows 2012 deduplication can do for you is important.

On my first disk, named F:, I copied the .avi file whose size in 700 MB. I then made 15 copies of it in order to use 10 GB.

On the second disk, H:, I put the mp3 file whose size is 4.50 MB. I made 2291 copies of it in order to use 10GB.

On the third disk, L:, I stored a 1.75 MB large Microsoft Word document, then I made 5277 copies to take 10GB.

On the last disk, M:, I copied a pretty big 3.09 GB ISO image that I copied three times to take a little bit less than 10GB.

On these four disks I enabled Data Deduplication and waited for it to occur. A few days later I came back to see the results and they were as good as I expected, being those hundreds and thousands of files just replicas of the same original file and the probability of having identical blocks equal to 100%.

Ok, let's make the number talk. Here's the results I got:
Get-DedupStatus

FreeSpace    SavedSpace   OptimizedFiles   Volume
---------    ----------   --------------   ------
49.08 GB     9.57 GB      15               F: .avi
49.76 GB     10.07 GB     2291             H: .mp3
49.73 GB     9.02 GB      5277             L: .doc
44.11 GB     6.73 GB      3                M: .iso

Get-DedupVolume

Enabled      SavedSpace   SavingsRate      Volume
------       ----------   -----------       ------
True         9.57 GB      91 %             F: .avi
True         10.07 GB     97 %             H: .mp3
True         9.02 GB      97 %             L: .doc
True         6.73 GB      53 %             M: .iso

Get-DedupMetadata

.AVI extension
Volume                         : F:
DataChunkCount                 : 9411
DataChunkAverageSize           : 76.2 KB
TotalChunkStoreSize            : 710.5 MB

.MP3 extension
Volume                         : H:
DataChunkCount                 : 59
DataChunkAverageSize           : 78.2 KB
TotalChunkStoreSize            : 17.61 MB

.DOC extension
Volume                         : L:
DataChunkCount                 : 27
DataChunkAverageSize           : 66.15 KB
TotalChunkStoreSize            : 12.87 MB

.ISO extension
Volume                         : M:
DataChunkCount                 : 36471
DataChunkAverageSize           : 73.28 KB
TotalChunkStoreSize            : 2.56 GB
As you can see I got very very good deduplication performance for documents and music.

Here's a table where the difference between the theoretical size of all the files without deduplication (5th column) and the real size on the disk (6th column) can be easily compared. Also the last column shows the overhead of the deduplication mechanism. This is calculated this way: [Real used disk space MB]-[Size on disk MB].

Data type File extension Original file size MB Number of copies Theorical total size MB Size on disk MB Real used drive space MB Dedup overhead MB
Music .mp3 4.5 2 291 10 322.1 8.9 245.0 236.1
Video .avi 700.3 15 10 504.8 0.1 937.0 936.9
WinWord .doc 1.8 5 277 9 237.3 20.6 274.0 253.4
Disc Image .iso 3 167.1 3 9 501.2 3 167.1 6 021.1 2 854.0

Here's the same results the way they are reported by Server Manager interface:


In the second series of tests, I used the same volumes above (which I formatted) to store many files of the same type on each of them. Here's what I did.

On volume F: I stored 52 different .avi test files that take 20.6 GB of space. their size varies between 100 MB and 1.5 GB.

On volume H: I put 3714 different .mp3 test files. Their total size is 17.6 GB.

On volume L: I put 773 Word documents. Sum size: 745 MB.

On volume M: I copied 9 .iso images taking 8.87 GB of disk space.

Once all the files where in their respective partitions, I enabled data deduplication and waited for all the optimization, scrubbing and garbage collection jobs to finish.

When I came back a few days later to see the situation, here's what I got:
Get-DedupStatus

FreeSpace    SavedSpace   OptimizedFiles   Volume
---------    ----------   --------------   ------
29.11 GB     156.48 MB    52               F: .avi
32.72 GB     772.52 MB    3714             H: .mp3
49.53 GB     495.28 MB    598              L: .doc
42.84 GB     2 GB         9                M: .iso


Get-DedupVolume

Enabled      SavedSpace   SavingsRate      Volume
-------      ----------   -----------      ------
True         156.48 MB    0 %              F: .avi
True         772.52 MB    4 %              H: .mp3
True         495.28 MB    50 %             L: .doc
True         2 GB         21 %             M: .iso

The gain on the document library is simply huge, reaching 50%.

On the other side, as you could expect, the optimization gain on .avi files is near to zero. Same for mp3 files. That's because the applications that writes this kind of files already eliminate redundant information and therefore identical blocks are highly unlikely. The theory says that I should be able to get no better results with pictures (.jpg, .jpeg) and other kind of compressed music files. Nonetheless, having Windows 2012 deduplicate your media library or picture library will allow you to have duplicate pictures, or films, or other kind of files on the same volume without necessarily wasting more space.

Let's imagine for instance that you have a set of personal photos and that you want to make a copy of some of them to a folder on the same partition and share them through some kind of web service. The amount of used space would be the same because the Deduplication Engine would be able to see that some block are replicated and replace them with a pointer.

In the third series of tests, this last statement is exactly what I aim to demonstrate.

On volume H:, where the mp3 are stored, I create a subfolder named 'copy of music library' and copied 1000 mp3 files from the root folder to it.

Unsurprisingly the disk space used did not increase at all. We passed from:
FreeSpace    SavedSpace   OptimizedFiles  InPolicyFiles
---------    ----------   --------------  -------------
32.72 GB     772.52 MB    3714            3714         
to:
FreeSpace    SavedSpace   OptimizedFiles  InPolicyFiles
---------    ----------   --------------  -------------
32.64 GB     5.39 GB      4714            4714         

As you can see, the number of files is increased by 1000, but the free space stayed the same.

So, in the end, my opinion is that block-level deduplication is a nice improvement in the world of storage management both for home and professional use. Windows 2012 does a really good background job of seeking duplicate blocks and I encountered no errors at all for the moment. I have been running this for at least three months now and I am definitively happy with it.

Feel free to contribute to this post by sharing your deduplication experience. I think an interesting debate can be had on this subject if many people pop in and share their thoughts.

27 comments:

  1. Thanks for the useful dedupe commands... I've setup dedupe on a 2012 server, but it refuses to actually dedupe anything on this 150TB volume. Ever hear of any size limits on a dedupe volume?

    ReplyDelete
    Replies
    1. I haven't hard of any limit for the moment. What do you mean by 'it refuses to dedupe'? Is there any event in the logs? Can you check and see if the fsmdhost.exe process is running? Maybe it's taking a while due to the size of data to analyze.
      Also try ddpeval.exe and see what it reports.
      Regards
      Carlo

      Delete
    2. It just sits at a 0 rate and savings. FSMDhost.exe is not running currently, and ddpeval fails with:
      "ERROR: Evaluation not supported on system, boot or Data Deduplication enabled volumes", regardless of whether dedupe is on or off.

      There is a VSS warning in the dedupe logs, which i am looking into.

      Funny, it had no problem with a 60TB volume i had created prior to the bigger one...

      Delete
    3. Well, you could try to run the scheduled task named 'BackgroundOptimization' under Task schedule r library / microsoft / windows / deduplication and check if fsdmhost appear on your resource monitor.

      Also you could issue start-dedupjob cmdlet.

      What VSS warning do you get?

      Delete
    4. Log Name: Microsoft-Windows-Deduplication/Operational
      Source: Microsoft-Windows-Deduplication
      Date: 2/5/2013 9:06:49 AM
      Event ID: 4110
      Task Category: None
      Level: Warning
      Keywords:
      User: SYSTEM
      Computer: dc-mgmt-02.vdc.local
      Description:
      Data Deduplication was unable to create or access the shadow copy for volumes mounted at "K:" ("0x80042306"). Possible causes include an improper Shadow Copy configuration, insufficient disk space, or extreme memory, I/O or CPU load of the system. To find out more information about the root cause for this error please consult the Application/System event log for other Deduplication service, VSS or VOLSNAP errors related with these volumes. Also, you might want to make sure that you can create shadow copies on these volumes by using the VSSADMIN command like this: VSSADMIN CREATE SHADOW /For=C:


      Operation:
      Creating shadow copy set.
      Creating shadow copy set.
      Running the deduplication job.

      Context:
      Volume name: K: (\\?\Volume{f4bdc5cf-7e4e-4ccc-895f-41dfa48a5ae8}\)
      Event Xml:



      4110
      0
      3
      0
      0
      0x8000000000000000

      697


      Microsoft-Windows-Deduplication/Operational
      dc-mgmt-02.vdc.local



      K:
      0x80042306


      Operation:
      Creating shadow copy set.
      Creating shadow copy set.
      Running the deduplication job.

      Context:
      Volume name: K: (\\?\Volume{f4bdc5cf-7e4e-4ccc-895f-41dfa48a5ae8}\)
      Code: SCANENGC.00002402; Call: SCANENGC.00002312; CMD: C:\Windows\SYSTEM32\FSDMHOST.EXE {080aa921-fa0a-437a-b2bb-990fd347f01d} 4fe7495b-51e3-4f6f-95d8-4d2a56229a60 348eab9b-ec9e-43ea-a335-61d2e11faf71 OptimizationJob ; User: Name: NT AUTHORITY\SYSTEM, SID:S-1-5-18

      Delete
  2. Further investigation reveals that VSS cannot be enabled, which appears to be a pre-requisite for dedupe to work properly.

    ReplyDelete
    Replies
    1. Hi Josh,
      I imagine you tried to issue the suggested VSSADMIN CREATE SHADOW /For=K: and it has failed right?
      If you issue the very same command on your 60TB vol what do you get? It would be interesting to understand if there is something wrong with your filesystem or if you are lacking disk space for fsdmhost to run.

      As a side question, how long did it take to dedupe your 60TB volume? And hom much space where you able to save (also considering the file types)?

      Delete
    2. I found this:
      http://technet.microsoft.com/en-us/library/cc755419%28v=ws.10%29.aspx
      Maybe with a little math you can figure out what component is limiting VSS. It could be related to paged pool or non-paged pool exhaustion... which is what I would try to rule out at first.
      How much RAM do you have on that server?

      Delete
    3. I did try crating a shadow via the GUI, which failed. I currently have 32GB of ram on this server. More then enough to handle the dedupe and vss requirements. Thanks for that article, it may provide some good clues.

      Dedupe on the 60TB was very quick when it was just a bunch of duplicate isos.

      Delete
  3. After much trial and error, and zero documentation on this issue from microsoft... The answer is this:

    1. Dedupe on windows 2012 requires VSS. If VSS fails, so will deduplication.

    2. VSS will fail on any single volume larger then 64TB

    3. Therefore, dedupe is limited to 64TB max size volumes

    ReplyDelete
    Replies
    1. @Josh

      Thanks very much for sharing this information!! I was aware of the dependence upon VSS writer, but I am surprised by deduplication being limited to 64TB and Microsoft not telling us! Maybe they didn't bother testing their configuration maximums... waiting for someone else doing it on their behalf...

      Thanks, and keep us updated if you discover something else!

      Carlo

      Delete
  4. Interesting topic! This blog is top-notch!
    Thomas M.

    ReplyDelete
  5. Hello, I'm curious about read performance on the de-duplicated volume. I'm familiar with DataDomain appliances, which are great with write performance but terrible at random reads. Did you happen to do any performance testing on the de-duplicated volume?

    For the use case I'm considering, which is a backup repository for VM images, random read i/o is fairly important. We use Veeam Backup, and we want to use their Surebackup feature; this spins up the VM directly from the backup image in an isolated network. Our experience with DataDomain and this feature hasn't been very good, so we're looking into alternatives...

    ReplyDelete
    Replies
    1. @Loren

      For the moment I didn't test read performance, but in theory there should be a 5-10% (some say 3%) overhead on seldom read files, while files in cache have a performance improvement (Dedupe engine has sort of caching system).

      Honestly this performance hit is transparent to the end-user.

      Please let me know how it goes for you!

      Delete
    2. Interesting. On the DataDomain, random read performance takes a hit on the order of 70%. I wouldn't have expected it to be that much different for another de-duplicating file system.

      Delete
    3. Hi Gordon,

      such a good read performance in Windows Dedupe engine is to to the algorithm behind the Master File Table, which is a B-tree.

      In a few words, your Windows Disk has a list of files and folders which are organized in a hierarchical way so that no search takes i.e. 3 steps to find a given filename, then from there a first data chunk is read and for the rest of data a pointer tells you where they are on the disk.

      Deduplication adds a new type of pointers (a reparse point specific to deduplication) which sends one or more files to the same sectors on disk.

      So you see, there were pointers before and there are pointers now, and perf stays roughly the same.

      For more, check this: http://www.happysysadm.com/2012/10/data-deduplication-in-windows-server.html or on wikipedia you'll find a good explanation of how a b-tree works.

      Hope this helps!
      Carlo

      Delete
  6. Fantastic artical. Do you know How it would be behaving million of small files ?

    ReplyDelete
    Replies
    1. Thanks!
      Well, it depends on the size of the files and on teir mime type. Can you be more specific?

      Delete
  7. Any tests with large files? More specifically files that are larger than 3 TB? I'm trying to implement dedupe with our Veeam backups but dedupe doesn't seem to work properly.

    ReplyDelete
    Replies
    1. No larger than I mentioned in the post. What kind of unexpected behavior are you getting? Which error message? Does dedupe works for standard sized files?

      Delete
    2. Deduplication does not work on files larger than 1TB.
      http://msdn.microsoft.com/en-us/library/hh769303(v=vs.85).aspx

      Delete
  8. Hi Carlo,

    This is an excellent article indeed and I'm glad I bumped into it, thanks a bunch!

    Have you had a change to test the following scenarios yet:

    1) Backup a deduped Server 2012 R2 volume using conventional backup software like Backup Exec;

    2) Full [volume] and selective [individual files] restore to the original source, and alternate dedupe-enabled Server 2012 R2 target;

    3) Full [volume] and selective [individual files] restore to an alternate, dedupe-unaware, legacy OS target volume e.g. Server 2008 R2; Server 2003 R2, all NTFS formatted?

    Thanks.
    Miro

    ReplyDelete
    Replies
    1. Hi Miro,

      no, I didn't test Backup Exec nor any other backup software but I thin it shouldn't matter since the deduplication mechanis is transparent.

      I tried instead to restore single files on an alternate dedupe-enabled server and it works flawlessy.

      Legacy-OS won't be able to access deduped files larger than 64KB.

      Hope this helps. Keep me update on your tests if you wish.
      Carlo

      Delete
  9. Thanks for the interesting article. Have you by any chance had a look at the 2012R2 deduplication feature? I've read reports that it has improved quite a bit, especially in the amount of data it can process in 24h (so it is now able to handle larger files).

    Hopefully stability wise as well - I also read of some people getting metadata corruption on the deduped volumes after a while.

    ReplyDelete
    Replies
    1. Hi,

      yes, I have upgraded some of my 2012 systems to R2 and got no problems at all. I did not notice any particular improvement nor regression but I didn't go too deep in my tests.

      No metatada corruption on my side. Can you post the links that talk of this issue so I can have a look at it please?

      Carlo

      Delete
  10. Hi,

    Nice article. I've had similar numbers in my tests. The only thing that confuses me is the folder size that I see on folder properties. Example, folder with 22 MKV files, Size 6.87GB, Size on disk 0 bytes?!. On that disk there is more than 3000 AVI and MKV files. I really don't expect any savings on these types of files, so why is Windows reporting numbers like this? I cant find anywhere anything about this.

    Regards

    ReplyDelete
  11. Since dedupe uses reparse points, what you're seeing is that all of the "files" in that folder are just pointers into the chunk store. In your case, there might be no savings at all from deduplication or compression, but the chunks are still in the store, so all the files are reparse points and size on disk shows zero. In general, even when size on disk is not zero, it's not a useful measure of how much space is used by the files, nor is it a useful measure of how much space was saved by dedupe.

    With dedupe, you can't really determine how much disk space is "used" by a particular set of files if those files are in policy, since the chunks that make up that file will (hopefully!) belong to lots of files. But you can determine how much space you would reclaim if you deleted the files and then ran a cleanup job. To do that, use the Measure-DedupFileMetadata cmdlet in powershell: http://technet.microsoft.com/en-us/library/jj659278(v=wps.620).aspx

    Hope this helps!

    ReplyDelete

Related Posts Plugin for WordPress, Blogger...