9

I have a few large files that I need to copy from one Linux machine to about 20 other Linux machines, all on the same LAN as quickly as is feasible. What tools/methods would be best for copying these files, noting that this is not going to be a one-time copy. These machines will never be connected to the Internet, and security is not an issue.

Update:

The reason for my asking this is because (as I understand it) we are currently using scp in serial to copy the files to each of the machines and I have been informed that this is "too slow" and a faster alternative is being sought. According to what I have been told, attempting to parallelize the scp calls simply slows it down further due to hard drive seeks.

Jonathan Callen
  • 200
  • 2
  • 8

5 Answers5

26

BitTorrent. It's how Twitter deploys some things internally.

http://engineering.twitter.com/2010/07/murder-fast-datacenter-code-deploys.html (web archive link)

Mikolasan
  • 107
  • 5
mfinni
  • 36,144
  • 4
  • 53
  • 86
  • Now that is clever :) – Scott Nov 18 '11 at 15:13
  • I'll try this and see how well it works in our environment – Jonathan Callen Nov 18 '11 at 15:20
  • 3
    In addition to my answer (which I do think will do a good job, if you can implement it), the below answer for NFS is a very good one. A good NFS server should cache files so you won't keep hitting disk. Also, with that, don't copy the files *from* the server *to* the clients. Initiate it from the client and let the NFS server's cache help out. – mfinni Nov 18 '11 at 15:24
  • 1
    Be sure to try it in a non-production environnement, in the presentation they say (iirc) it made some switches suffer a lot during the first deployments because of the number of packets exchanged. – Shadok Nov 18 '11 at 16:55
  • I would love to favourite this answer –  Nov 18 '11 at 17:31
  • That's not going to be any faster since it still has to send all of the data 20 times. You want a multicast solution so the data is only sent once. – psusi Nov 18 '11 at 18:09
  • 1
    @psusi Why do you say it has to send all of the data 20 times? Once the other peers have part of the file, they can start sending the parts they have to the other peers themselves. – Jonathan Callen Nov 18 '11 at 18:11
  • @JonathanCallen, whether it is the original server or one of the peers doesn't matter; the data still has to be sent over the lan once for every client. In fact, bit torrent sometimes sends the same chunk twice by accident. I suppose though, that if your switch has a multi gigabit backplane then multiple machines sending to each other could end up going faster than the single server sending 20 times, but still the best solution is to multicast the data just once. – psusi Nov 18 '11 at 18:14
  • 2
    The problem for the OP is not the LAN, it's the disk on the central server. – mfinni Nov 18 '11 at 18:16
  • 1
    @pSusi - multicast would certainly be another valid answer. Post that as an answer, not as a knock on my answer. – mfinni Nov 18 '11 at 18:18
  • @mfinni, what makes you think that? Any decent server hd should have no problem keeping up with a gigabit lan. A good single disk can handle 100 MB/s these days, let alone any kind of raid. – psusi Nov 18 '11 at 18:20
  • 1
    From the OP : "According to what I have been told, attempting to parallelize the scp calls simply slows it down further due to hard drive seeks." That's what makes me think that. – mfinni Nov 18 '11 at 18:35
  • 1
    Facebook deploy with BitTorrent too - http://torrentfreak.com/facebook-uses-bittorrent-and-they-love-it-100625/ – Daniel Lo Nigro Nov 19 '11 at 06:13
12

How about UFTP, it uses multicast to deliver files over UDP to multiple clients at once. Not for everyone and I'm no expert on it but it sounds like it does what you want.

dbush
  • 153
  • 9
Chopper3
  • 101,299
  • 9
  • 108
  • 239
  • 1
    Disclaimer: This will require equipment that supports multicast. – user606723 Nov 18 '11 at 16:33
  • I was rather hoping this'd be on the same vlan - reducing the impact of this use. – Chopper3 Nov 18 '11 at 16:56
  • @user606723: Doesn't everything modern? Maybe some consumer junk does not, but I haven't run into anything with broken multicast in a while. Too much uses it these days. I think Windows Active Directory even uses multicast. – Zan Lynx Nov 18 '11 at 19:08
  • I actually have no experience with this @ZanLynx. I know that many offices/computer labs use consumer/unmanaged switches at the last hop. How will these switches behave with multicast? – user606723 Nov 18 '11 at 19:42
3

Have you tried to copy this data with rsync? If you have 1 Gbit LAN or faster, copying over 4*20 GB should not be a problem.

How often will this copy occur? Does it matter if it takes couple of minutes to finish?

Janne Pikkarainen
  • 31,852
  • 4
  • 58
  • 81
3

scp-tsunami it's the way!

https://code.google.com/p/scp-tsunami/

It's commonly used to distribute disk images on virtualization clusters, its performances are near bittorrent but it's simpler to use for daily usage.

Giovanni Toraldo
  • 2,587
  • 19
  • 27
2

Setting up an NFS share and having each machine pull from this shared repo of large files would likely be the fastest method (NFS is very quick and has little overhead).

You could add an additional NIC or two to the source server and bond them together to give you better throughput.

Implementation could be a simple cron job on each target server that blindly fetches from the share every hour/day/whatever. You could also setup a daemon to poll for new files; you could also just script a control session to SSH (with key pairs) into each target box and instruct them to fetch the file when you execute your script.

gravyface
  • 13,957
  • 19
  • 68
  • 100
  • 1
    I believe my predecessor attempted to use NFS for this and found that (at the time), the RAM cache wasn't large enough for the entire transfer, which was causing the load on the hard drive to become the limiting factor instead of the network speed. – Jonathan Callen Nov 18 '11 at 15:39