9

At the moment i'm running rsync for 2.2 million files total of 250GB and that just takes ages 700K files in 6 hours.

Does anyone know a rsync like tool that can do this with multiple threads so it goes faster?

Tom van Ommen
  • 93
  • 1
  • 1
  • 3

4 Answers4

7

I doubt cpu is the limiting factor here. You're most likely limited by both network bandwidth for the transfer, and disk IO; especially latency for all those stat calls.

Can you break down the filesystem hierarchy into smaller chunks to process in parallel?

What are the source files, and what's writing or modifying them? Would it be possible to send changes as they happen at the application level?

JimB
  • 1,924
  • 12
  • 15
  • Syncing Zarafa Attachment files, all gzipped by default. i could run multiple instances but thats less efficient than 10 threads. And the network is 1GBit to 1GBit but different datacenters but it shouldnt be a issue. got 24 SAS disks on the source side and intelligent storage with SSD on the destination. – Tom van Ommen Jun 10 '11 at 14:20
  • 1
    @Tom van Ommen - why do you think you're CPU limited? How is multiple processes less efficient than threads if you really are CPU limited? – JimB Jun 10 '11 at 14:31
  • NO cpu limit i said ;) but it was just a guess that 10 separate processes are less efficient than 10 threads. – Tom van Ommen Jun 10 '11 at 14:33
  • The OP said nothing about CPU limitations. If he is I/O-bound, it is a very valid reason for multiple threads – Mike Pennington Jun 10 '11 at 14:34
  • 1
    @Tom van Ommen, 10 processes do have more overhead than 10 threads; however, locking data structures between threads is a coding nightmare. It's often much more efficient (for the coder's time) to just spawn multiple processes and be done with it – Mike Pennington Jun 10 '11 at 14:36
  • @Tom van Ommen - procs vs thread won't make much of a difference in this case. If you're not using ssh, how are the files being transferred (rsync over tcp, nfs, iscsi, rsh) – JimB Jun 10 '11 at 14:39
  • 1
    @Guacamole - multiple thread could help in some situations, but if his link is saturated, he's not going to push any more through no matter how many thread he has. Rsync does use threads for concurrency, and isn't internally blocking on IO. – JimB Jun 10 '11 at 14:40
  • @JimB like i said before no transport just Rsync over TCP. – Tom van Ommen Jun 10 '11 at 14:43
  • Just spawned 10 processes lets see how that goes ;) thanks for the tips everyone so far! – Tom van Ommen Jun 10 '11 at 14:43
  • @Tom van Ommen - Not that I'm doubting you (many people don't realize this part), are you really are running an rsync daemon on the other end? Specifying the target as `hostname:remote/path` uses ssh, which can't saturate a 1GB link. Make sure you're using a double colon `::`, or `rsync://` – JimB Jun 10 '11 at 14:48
  • @JimB, the I/O delays come from TCP transfers; I don't recall seeing anything about a saturated link... let's not go inventing a new question – Mike Pennington Jun 10 '11 at 15:10
  • 1
    @Guacamole - All I'm pointing out is that if he's using ssh as a transport, his throughput is limited by ssh itself (specifically the static receive window, unless he's using the HPN ssh patches). – JimB Jun 10 '11 at 15:28
1

If the disk subsystem of the receiving server is an array with multiple disks, running multiple rsync processes can improve performance. I am running 3 rsync processes to copy files to an NFS server (RAID6 with 6 disks per raid group) to saturate Gigabit Ethernet.

This guy reports on a basic python harness that spawns multiple rsync processes http://www.reliam.com/company/featured_geek

sinysee
  • 11
  • 1
1

I've read many questions similar to this. I think the only real answer is break up the copy/move manually. IOps will be the issue here. If it makes you feel any better, I'm in the process of moving ~200 milllion files consuming well over 100TB of disk space.

Wayne
  • 11
  • 1
0

You may consider checking out the multithreaded cp clone for linux (open source): http://static.usenix.org/event/lisa10/tech/slides/kolano.pdf

maxim
  • 1
  • Whilst this may theoretically answer the question, [it would be preferable](http://meta.stackexchange.com/q/8259) to include the essential parts of the answer here, and provide the link for reference. – Scott Pack Oct 19 '12 at 22:48