25

I'm wondering if it's possible for rsync to copy one directory to multiple remote destinations all in one go, or even in parallel. (not necessary, but would be useful.)

Normally, something like the following would work just fine:

$ rsync -Pav /junk user@host1:/backup
$ rsync -Pav /junk user@host2:/backup
$ rsync -Pav /junk user@host3:/backup

And if that's the only option, I'll use that. However, /junk is located on a slow drive with quite a few files, and rebuilding the filelist of some ~12,000 files each time is agonizingly slow (~5 minutes) compared to the actual transfer/updating. Is it possible to do something like this, to accomplish the same thing:

$ rsync -Pav /junk user@host1:/backup user@host2:/backup user@host3:/backup 
Dave M
  • 4,514
  • 22
  • 31
  • 30
Jessie
  • 353
  • 1
  • 3
  • 6

10 Answers10

14

Here is the information from the man page for rsync about batch mode.

BATCH MODE

Batch mode can be used to apply the same set of updates to many identical systems. Suppose one has a tree which is replicated on a number of hosts. Now suppose some changes have been made to this source tree and those changes need to be propagated to the other hosts. In order to do this using batch mode, rsync is run with the write-batch option to apply the changes made to the source tree to one of the destination trees. The write-batch option causes the rsync client to store in a "batch file" all the information needed to repeat this operation against other, identical destination trees.

Generating the batch file once saves having to perform the file status, checksum, and data block generation more than once when updating multiple destination trees. Multicast transport protocols can be used to transfer the batch update files in parallel to many hosts at once, instead of sending the same data to every host individually.

To apply the recorded changes to another destination tree, run rsync with the read-batch option, specifying the name of the same batch file, and the destination tree. Rsync updates the destination tree using the information stored in the batch file.

For your convenience, a script file is also created when the write-batch option is used: it will be named the same as the batch file with ".sh" appended. This script file contains a command-line suitable for updating a destination tree using the associated batch file. It can be executed using a Bourne (or Bourne-like) shell, optionally passing in an alternate destination tree pathname which is then used instead of the original destination path. This is useful when the destination tree path on the current host differs from the one used to create the batch file.

   Examples:

          $ rsync --write-batch=foo -a host:/source/dir/ /adest/dir/
          $ scp foo* remote:
          $ ssh remote ./foo.sh /bdest/dir/

          $ rsync --write-batch=foo -a /source/dir/ /adest/dir/
          $ ssh remote rsync --read-batch=- -a /bdest/dir/ <foo

In these examples, rsync is used to update /adest/dir/ from /source/dir/ and the information to repeat this operation is stored in "foo" and "foo.sh". The host "remote" is then updated with the batched data going into the directory /bdest/dir. The differences between the two examples reveals some of the flexibility you have in how you deal with batches:

  • The first example shows that the initial copy doesn’t have to be local -- you can push or pull data to/from a remote host using either the remote-shell syntax or rsync daemon syntax, as desired.

  • The first example uses the created "foo.sh" file to get the right rsync options when running the read-batch command on the remote host.

  • The second example reads the batch data via standard input so that the batch file doesn’t need to be copied to the remote machine first. This example avoids the foo.sh script because it needed to use a modified --read-batch option, but you could edit the script file if you wished to make use of it (just be sure that no other option is trying to use standard input, such as the "--exclude-from=-" option).

    Caveats:

    The read-batch option expects the destination tree that it is updating to be identical to the destination tree that was used to create the batch update fileset. When a difference between the desti‐ nation trees is encountered the update might be discarded with a warning (if the file appears to be up-to-date already) or the file-update may be attempted and then, if the file fails to verify, the update discarded with an error. This means that it should be safe to re-run a read-batch operation if the command got interrupted. If you wish to force the batched-update to always be attempted regardless of the file’s size and date, use the -I option (when reading the batch). If an error occurs, the destination tree will probably be in a partially updated state. In that case, rsync can be used in its regular (non-batch) mode of operation to fix up the destination tree.

    The rsync version used on all destinations must be at least as new as the one used to generate the batch file. Rsync will die with an error if the protocol version in the batch file is too new for the batch-reading rsync to handle. See also the --protocol option for a way to have the creating rsync generate a batch file that an older rsync can understand. (Note that batch files changed for‐ mat in version 2.6.3, so mixing versions older than that with newer versions will not work.)

    When reading a batch file, rsync will force the value of certain options to match the data in the batch file if you didn’t set them to the same as the batch-writing command. Other options can (and should) be changed. For instance --write-batch changes to --read-batch, --files-from is dropped, and the --filter/--include/--exclude options are not needed unless one of the --delete options is specified.

    The code that creates the BATCH.sh file transforms any filter/include/exclude options into a single list that is appended as a "here" document to the shell script file. An advanced user can use this to modify the exclude list if a change in what gets deleted by --delete is desired. A normal user can ignore this detail and just use the shell script as an easy way to run the appropriate --read-batch command for the batched data.

    The original batch mode in rsync was based on "rsync+", but the latest version uses a new implementation.

I would imagine you could try

rsync --write-batch=foo -Pav /junk user@host1:/backup
foo.sh user@host2:/backup
foo.sh user@host3:/backup
Flimm
  • 460
  • 5
  • 16
Chloe
  • 1,164
  • 4
  • 19
  • 35
  • The suggested command does not work: `remote destination is not allowed with --read-batch` – kynan Mar 21 '17 at 17:15
  • Show the complete command. `-` for a file name means to read from standard input, and STDIN is also being read from `foo` in the example, a local file. – Chloe Mar 22 '17 at 02:08
  • 2
    This appears to be the maximally correct solution for what I was trying to do, although my use case for this has long since evaporated into the aether. :D – Jessie May 01 '17 at 02:22
5

You could try using unison. It should be much faster at building the file list because it keeps a cache of the files.

Jason Axelson
  • 334
  • 1
  • 5
  • 17
  • 2
    Note: Unison doesn't keep a 'cache' of the files. It only keeps a database of the file names, timestamps, checksums. It still does a scan of the file system and creates a checksum to compare to the remote. Unison's only advantage is two-way sync. I recommend Unison, but it won't help here. – Chloe Feb 28 '13 at 01:40
5

The rsync --batch-mode supports multicast. If this is possible on your network, it might be worth looking into that.

0x90
  • 83
  • 8
codecrank
  • 51
  • 1
  • 1
  • Just pointing out that "supports" is too strong a word. `rsync` itself doesn't do any multicast. The man page says "Multicast transport protocols can be used to transfer the batch update files in parallel to many hosts at once...", so if you already have a multicast solution in place you're golden. It might be more accurate so say rsync _suggests_ multicast, but it's up to the user to set it up and point rsync to it. – pyansharp Mar 28 '22 at 14:10
2

Not a direct answer, but if you use rsync version 3+ it will start transferring before it generates the entire filelist.

Another option, still not very efficient, would be to run them as jobs so a few run at the same time.

Also, I just thought of this strangness if you don't mind using tar:

tar cf - . | tee >(ssh localhost 'cat > test1.tar') >(ssh localhost 'cat > test2.tar') >/dev/null

Where each localhost would be different servers of course (assumes key-based login). Never used the above before though.

Kyle Brandt
  • 83,619
  • 74
  • 305
  • 448
  • Hmm! Strangely enough, cwrsync (rsync 3.0.7) seems not to do that. I'll have to look into why that is, though, as that would be a big help in cutting down these enormous runtimes. Thanks! – Jessie Apr 29 '10 at 18:28
  • That version on both sides? – Kyle Brandt Apr 29 '10 at 18:29
  • No, actually; the local machine is cwrsync 3.0.7 and the remote host (well, the one I'm working with now) is rsync 3.0.3 on Debian Lenny. Doesn't seem like that'd be too big a version difference for it to misbehave, but I dunno.. I'll look into upgrading the Debian side. – Jessie Apr 29 '10 at 18:33
  • 1
    What an odd little one-liner. That'd probably work, though, if I wasn't leveraging the fact that rsync need not reduplicate a few gigs of data over several slow links when, at most, only a few hundred kb of it has changed. Also, getting both ends to (cw)rsync 3.0.7 still did file-list building and transferring serially. Not too concerned about that, though. – Jessie Apr 29 '10 at 18:56
  • Isn't "tar cf - ." the same as "tar c ." ? – Johan Boulé Jul 28 '15 at 15:20
2

how about changing filesystems?

Some time ago, i switched a multi-terabyte FS from ext3 to XFS. The time to scan the directories (with around 600,000 files last time i checked) went from 15-17 minutes to less than 30 secs!

Javier
  • 9,268
  • 2
  • 24
  • 24
1

How about running the rsync jobs from host1, host2, and host3? Or, run a job to copy to host1, and then run it on host2 and host3 to get it from host1.

mfinni
  • 36,144
  • 4
  • 53
  • 86
1

In looking for this answer myself, I think you'd need to make a batch using rsync first and then sending it to them all, which would make it so the file list would need be crunched just the one time, and then you could just background all three rsyncs to run them in parallel.

Morgan
  • 191
  • 3
1

Another possible solution is just running as many rsync processes in parallel as you have hosts, i.e. fork.

0

You could write a shell function to rsync to each of the remote servers. Here is an implementation for fish shell:

function rsync2
    set servers (string split "," $argv[1])
    echo -e (string join "\t" $servers)
    set cmds $argv[2..-1]
    for server in $servers
       echo $server
       set -e real_cmds
       for cmd in $cmds
          set -a real_cmds (string replace "SERVER" $server -- $cmd)
       end
       rsync $real_cmds
    end
end

Then you can do:

rsync2 remote_addr1,remote_addr2 -avz --exclude='*/' --include='.jpg' --info=progress2 SERVER:image_folder .
Shaohua Li
  • 121
  • 2
0

A better solution would be creating a repository with git and just pushing to the 3 hosts. Faster, you wouldn't need the file list part and it consumes less resources.

Good luck,
João Miguel Neves

jneves
  • 1,041
  • 6
  • 15
  • 10
    git does not preserve modification times nor permissions (except for the execute bit) and would require storing a second copy of the data as git objects in `.git/` although pushes to the remotes which would already have most of the data would be faster. git is not a replacement for rsync. – Dan D. Dec 17 '11 at 16:48
  • Plus, git is publicly view-able, unless you pay. – Chloe Feb 28 '13 at 01:38
  • beside of the technical issue, git is not for normal user, normal user just need exploring files via http or file share. – LiuYan 刘研 Feb 28 '13 at 02:43
  • 8
    @Chloe, you mistake git for GitHub. Git itself is free opensource distributed version control system, and anyone can host git repository by any means, including `http`, `nfs` and `afp`. GitHub is a website that takes care of creating and maintaining git repos for you, and makes them public (unless you pay). – toriningen Aug 30 '14 at 12:54
  • 1
    @Chloe GitHub is publicly viewable, but BitBucket provides private repos. – sws Jan 18 '15 at 21:12
  • 2
    Also, Git does not keep track of empty directories. – Flimm May 06 '15 at 10:05