4

My company regularly needs to send about 2TB of data from the US to the UK (the size of the compressed delta is 2TB). Even though each side has good internet connectivity, sending files directly is too slow and unreliable. At 1MB/s, the transfer would take more than 20 days, if it completes without error.

As a result, the best solution found so far is to "ship the brick", ie. sending a hard drive by regular mail.

I was wondering if there exists any sort of service that offers better network connectivity across continents? I considered going through AWS S3, but their outbound transfer prices are quite expensive...

Note: the problem is not the software. We use rsync already. It works well and is robust. The problem is the speed and reliability of over-the-Atlantic internet connections. As an answerer said, a dedicated link is not in our budget. What I'm looking for is a cost-effective solution that would be a little more practical than shipping a disk.

static_rtti
  • 151
  • 3
  • 6
  • Tried to clarify my question a bit. Sorry it wasn't clear in the first place. – static_rtti Feb 04 '20 at 14:33
  • What are your OS options? If the entire file changes over, ship the disk. If only small parts, use a file system that supports differential snapshots transmit only those. On Windows, Volume Shadow Copy might be usable. On Linux, ZFS. – Andrew Henle Feb 04 '20 at 15:00
  • I think you're running into a common problem with anything using TCP over large distances. I like this explanation of the differences https://www.keycdn.com/support/udp-file-transfer. I don't have an off the shelf solution for you but you might be looking for something like BitTorrent which can maximize bandwidth while not being sensitive to round trip latencies. Or perhaps you can tune the endpoints to favor bandwidth https://en.wikipedia.org/wiki/TCP_tuning. – Yolo Perdiem Feb 04 '20 at 20:24
  • In certain situations, "shipping the brick" *is* the most practical solution. Even Amazon recognized this with their "AWS Snowmobile" which is just a truck-sized brick. :) – JustinB Feb 05 '20 at 05:55
  • @JustinB to be fair to amazon, the snowmobile is for 100PB. I find it hard to believe that snail mail is still the best option for 2TB :) – static_rtti Feb 05 '20 at 09:25
  • 1
    Even when using rsync properly, it still has to communicate a non-trivial amount with the remote system to find the files/chunks that need updating. It might be worth trying to generate the rsync update against a local server that matches the remote server, to get rid of that part of the work. https://russt.me/2018/07/creating-and-applying-diffs-with-rsync/ – JustinB Feb 19 '20 at 02:40
  • 1
    @static_rtti thinking a bit outside the box: bittorrent ... it already has the ability to break very large files into smaller chunks, validate checksums on each, and keep re-trying as needed to get the job done. As an added bonus, if your office has better bandwidth than you are getting trans-Atlantic (which I assume it does), you could set up some reflector clients near your sending office which would get parts of the 2TB file from your office faster, then provide multiple trans-Atlantic sources for the receiver to use in parallel. – JustinB Feb 24 '20 at 18:01

3 Answers3

1

Well, your queston lacks certain important information:

  • Can the file(s) change between adjacent attempts at transferring them?
  • What platform the server and the client run on?

Still, a couple of options:

  • Plain old HTTP supports download resumption — via the Range field in the request header.

    So if you have a server supporting that (actually any production software such as nginx, apache, lighthttpd and gazillions other, receiving the whole file would amount to running something like this on the client:

    while true; do
        wget -nd -c http://server:port/path/to/the/file && break
    done
    
  • Advanced software such as rsync supports resuming of file transfer using advanced techniques which allow to synchronize two directory hierarchies even in the presence of file updates between the adjacent synchronization sessions.

    I'm not sure, but on Windows™, robocopy should be able to server as a poor man's rsync: it's not that good at supporting updates on the source side but IIRC it's able to resume transfers.

  • There exist other "do-it-no-matter-what" synchronization tools such as SyncThing.

Note that HTTP and robocopy expect you have a regular network connectivity between the server and the client; if it's provided by a VPN you might need to look at tuning its performance.

rsync is able to use SSH to spawn and talk with the remote rsync instance; and you might need to tweak that SSH call to make it use the fastest available cypher, turn off compression etc.

kostix
  • 1,150
  • 1
  • 7
  • 13
0

If you do not have the full environment to setup a webserver and such, you can always use: https://github.com/warner/magic-wormhole

It has restart/retry capabilities and can be run on a number of services.

Another tool I've used to transfer from London to New York : https://github.com/fast-data-transfer/fdt

Tag
  • 43
  • 1
  • 1
  • 3
0

If it's alway 2TB of completly new data, I would say, check for a real 1GB fiber internet link (commercial). However, it might not be in your budget. (not to insult)...

That said, if only a ''small'' amount of the 2TB change's from day to day, I would only transfer the data that differ from the full 2TB.

rsync would do a great job for that. There are plenty of tutorial on how to use rsync to accomplish that on the internet.

That said, there is also the question of: Is you data confidential ?

If so, I would start and initiate a SSH tunnel before the rsync or rsync with ssh options directly.

The other way would be to use a web service such as Apache with a TLS cert.

If internet then no option, I would say, a bunch of 3TB hard drives and a UPS/FedEx workflow to send and receive the disk on a schedule... It might be less expensive then a internet connection just for that...

Never underestimate the bandwith of a van(or plane) full of tapes !

yield
  • 771
  • 1
  • 9
  • 24
  • *rsync would do a great job for that* Not likely. `rsync` has to **discover** differences, and that requires transmitting the entire file contents so comparison can be done. – Andrew Henle Feb 04 '20 at 14:57
  • It would list files and do a checksum or something similar, it does not transfer the data... Which, IMO is better than transfering 2TB.... – yield Feb 04 '20 at 15:39
  • *It would list files and do a checksum or something similar, it does not transfer the data...* Oh? Once `rsync` scans **both** files and computes a checksum for each file and finds a difference there, what does `rsync` do then? It **must** transfer the **entire** 2 TB across to compare the data and find the differences so it can update the target file. "Compute checksum of 2TB. Compute checksum of another 2 TB. Compare the two. Ooops, they differ, now I gotta send the entire 2 TB across the link to find out where they differ." There's no other way to do it. `rsync` can't scale. – Andrew Henle Feb 04 '20 at 15:50
  • 1
    (cont) Having to **discover** differences is a huge problem when there are terabytes of data to deal with. Then syncing that data across an ocean is also a problem. When you have to sync that much data, the best answers start with **knowing** the differences in advance. And that rules out `rsync`. https://www.google.com/search?q=rsync+takes+too+long – Andrew Henle Feb 04 '20 at 15:52
  • Oh well.... UPS... – yield Feb 04 '20 at 15:59
  • @AndrewHenle rsync divides files into blocks and checksums each block separately - that's how it manages to transfer "small" changes efficiently. – Harald Feb 04 '20 at 19:26
  • @Harald It still has to **discover** the changes. There's no way that can beat knowing the changes in advance. Given the cross-ocean latency and literally terabytes of data to search through for changes in this case, what happens if the rate of changes is greater than the rate `rsync` can discover and transmit them? Again: `rsync` can't scale. – Andrew Henle Feb 04 '20 at 19:41
  • @AndrewHenle I'm not going to argue with you here, especially since rsync won't help OP's problem. But if you're curious, you should read the documentation: https://rsync.samba.org/tech_report/ – Harald Feb 06 '20 at 21:49