4

We need to regularly transfer large (60GB) Hyper-V virtual machines images around our WAN (UK to USA) over 50Mbps leased lines. We also use DFS-R between the sites. Historically, I've used 7-zip to zip up the virtual machine (into smaller 100MB chunks) and then dropped the files into a DFS-R transfer folder. When the backlog clears, unzip at the other end.

I wonder if I'm wasting my time and might as well drop the entire VM (VMDX files mainly) in the transfer folder and let DFS-R compress it during the transfer.

So the question is - how efficient is the DFS-R compression algorithm compared to 7-zip's native 7z format? 7-zip packs the image down to about 20GB so a 70% saving.

I get the feeling that the extra time to pack and unpack outweighs any possible higher compression ratio in the 7-zip algorithm. That said, transferring 100MB chunks feels "better" than one big 50GB VMDX file.

Rob Nicholson
  • 1,707
  • 8
  • 29
  • 56

2 Answers2

6

DFS-R uses something called Remote Differential Compression.

Instead of comparing and transfering an entire file, the algorithm will compare the signature of sequential chunks of data between the source and the target replica. This way, only differing chunks of data needs to be tranfered across the wire, in order to "reconstruct" the file at the target location.

As such, RDC is not really comparable to the compression algorithms used in 7-zip. Although they use similar techniques (building signature dictionaries over ranges of data), the 7-zip algorithm is designed to rearrange bytes into a lossless container format where all data is "squeezed" together, where RDC's purpose is to identify differences between similar files or file versions, in order to minimize the volume of data transfered in order to keep the replicas in sync

If you already have similar VMDX files at the target location, there's no need for splitting the file into 100MB chunks. Just be sure to always use the same compression algorithm(s) when zipping the images

This behavior (comparing similar files, not distinct versions of the same file, and extracting chunks) is known as "cross-file RDC" and the publicly available documentation is pretty sparse, but the AskDS blog team has a short but pretty good clarification in this Q&A post

Mathias R. Jessen
  • 25,161
  • 4
  • 63
  • 95
  • I was aware of RDC but assumed that only applies to transferring the same files again. When you say "similar VMDX files at the target location", do you mean anywhere in the file system? So if we had the previous VMDX file sat in another folder and transferred a slightly updated version (these are XenApp gold images so each VMDX is very similar), then RDC would kick in? Or does RDC only apply if you create an updated copy of the file with exactly the same path name? – Rob Nicholson Aug 29 '13 at 10:58
  • Later - I assume it's the later otherwise DFS-R would have to potentially match millions of other files which have a performance overhead. So RDC only kicks in when an existing file is updated at one end (in the same location). So when using 7-zip which generates new files (different file names and folder), then RDC does not help – Rob Nicholson Aug 29 '13 at 11:05
  • No, I literally mean similar files, not newer versions of the same file. This behavior is known as cross-file RDC and was introduced in Windows 2008 R2, but restricted to connections between 2 servers where at least one was an Enterprise or Datacenter Edition server. In Windows Server 2012, the edition check has been removed, and cross-file replication works between 2 standard edition servers as well :-) – Mathias R. Jessen Aug 29 '13 at 13:47
3

As Mathias already noted, DFS-R employs the "remote differential compression" algorithm similar to rsync's to only transmit the changed / appended portions of a file already present on the remote side. Additionally, the data is compressed before transfer using the XPRESS compression algorithm (Reference: Technet blog) since the very first appearence of DFS-R in Server 2003 R2. I could not find any details on the actual variant of XPRESS used, but since the compression has to happen on-the-fly, it might be using LZNT1 (basically LZ77 with reduced complexity) as this is what is used in NTFS for the very same purpose.

If you want to monitor the compression ratios, consider enabling DFS-R debug logging and evaluating the log files.

The compression ratio for any of the EXPRESS algorithms is likely to be lower (probably even by a factor as large as 2) than what you get with 7zip, which has algorithms optimized for file size reduction, not CPU usage reduction. But then again using RDC, which allows for transmitting only changed portions of the file, you are likely to get significantly less data over the wire than your 20 GB archive.

Pre-creating a 7zip archive to be transferred with RDC might seem like a good idea to get the best of both worlds - only transmit changes but with a higher compression ratio for the changed portions - but isn't. Compression would mangle the entire file and even a single byte changed at the beginning of the data stream would cause the compressed file look entirely different than before. There are compression algorithm modifications to mitigate this problem, but 7zip does not seem to implement them so far.

All in all, you are likely to significantly save on bytes transmitted over the wire when using DFS-R to transfer file modifications. But it is rather unlikely you are going to get any savings in time and you are inducing significant I/O and CPU load on the destination as well as the source as both files (source and destination) need to be read and checksummed before the actual transmission can start.

Edit: if you have new files, RDC indeed would be of little help - there is no counterpart to rsync's --fuzzy parameter which would look for similar files at the destination and take them as a baseline for differential transfers. If you know you have a similar file (e.g. a baseline image of the transferred VM HD), you could pre-seed the destination directory with this one, though.

the-wabbit
  • 40,737
  • 13
  • 111
  • 174
  • 1
    I concur that 7-zip (or any archive) would totally mess up RDC as the compression byte stream totally changes even if you changed just one byte. I came across this http://tinyurl.com/p6j3sj4 which takes about compression (not RDC) of the file and how, possibly, DFS-R will be trying to re-compress our 7z files which is an overhead we don't need. We can add 7z to the exclusion list – Rob Nicholson Aug 29 '13 at 11:06
  • But also, I doubt RDC is coming into play in our scenario as these are brand new files with new names & locations so there is no previous file for RDC to work on. Therefore, I assume that the compression algorithm mentioned above works on the entire file and transfers the packed file, unpacking at the other end - RDC is a side-note – Rob Nicholson Aug 29 '13 at 11:10
  • @RobNicholson if you have new files, RDC indeed would be of little help - there is no counterpart to rsync's `--fuzzy` parameter which would look for similar files at the destination and take them as a baseline for differential transfers. If you *know* you have a similar file (e.g. a baseline image of the transferred VM HD), you could pre-seed the destination directory with this one, though. Re: compression: as I wrote, there *are* compression implementations which could be expected to work well with RDC, most notably `gzip --rsyncable`. – the-wabbit Aug 29 '13 at 11:59
  • So this now comes down to "how good is the DFS-R compression compared to 7-zip" where good is both compression ratio as well as effeciency of the algorithm – Rob Nicholson Aug 30 '13 at 11:41
  • Another consideration I guess is robustness of the DFS-R transfer algorithm. Consider replicating a single 60GB file. In the old days, one split this into (say) 100MB smaller chunks when using FTP so that if the link failed and FTP resume wasn't supported, you only had to resume one 100MB file and not the entire 60GB file. So another consideration is what DFS-R does if the link temporarily fails (as does happen over a WAN). Does it start to transfer the entire file again or can it resume? – Rob Nicholson Aug 30 '13 at 11:43
  • @RobNicholson Regarding the compression ratio the easiest way would probably be to simply place an image file in a DFS-R replicated structure and try to find the corresponding file in the DFS-R Staging directory - this is the temporary location where checksums and the pre-compressed file would go before transmission. If the link temporarily fails, the remote side should be keeping the already transferred chunks and proceed with the transfer as DFS-R replication is re-established. – the-wabbit Aug 30 '13 at 12:40