Syncing large amount of files across multiple machines in a scalable way

Asked Nov 11 '20 at 07:08

Active Nov 12 '20 at 10:13

Viewed 213 times

I'm looking for a way to sync a large number of machines (hundreds) with a remote repository.

The repository is comprised of small files (around 20KB), but the total arrives at a few GB and continue to grow with time.

The goal is to have changes at the remote repository propagate as fast as possible (no more than 2 seconds) to all the machines. (sync)

There are tools that provide exactly this functionality such as S3 sync or Rclone but carry a major disadvantage:

The sync command will need to enumerate all of the files in the bucket to determine whether a local file already exists in the bucket and if it is the same as the local file. The more documents you have in the bucket, the longer it's going to take. This means that once the bucket gets big even a small change will cost a lot of time.

I wonder if there is a way (a tool or a method) to sync only modified files, without having to go through all of the files. You can imagine a comparison of meta data at source and remote, determining what are the diffs and acting accordingly.

How would you go about it?

edited Nov 12 '20 at 10:13

asked Nov 11 '20 at 07:08

user12396421

Syncing large amount of files across multiple machines in a scalable way

0 Answers0