1

I have a job that clones a repo then s3 syncs changes files over to an s3 bucket. I'd like to sync only changed files. Since the repo is cloned first, the files always have a new timestamp so s3 sync will always upload them. I thought about using "--size-only", but my understanding is that this can potentially miss files that have legitimately changed. What's the best way to go about this?

sebastian
  • 2,008
  • 4
  • 31
  • 49

1 Answers1

2

There are no answers out of the box that will sync changed files if the mtime cannot be counted on. As you point out, this means that if a file does not change in size, then using the "--size-only" flag will cause aws s3 sync to skip those files. To my mind there are two basic paths, the solution you use will depend on your exact needs.

Take advantage of Git

First off, you could use the fact you have the files stored in git to help update the modified time. git itself will not store the metadata, the maintainers have a philisphy that doing so is a bad idea. I won't argue for or against this, but there are two basic ways around this:

You could store this metadata in git. There are multiple approaches to doing this, one such is metastore which uses a tool that's installed alongside git to store the metadata and apply it later. This does require adding a tool to all users of your git repo, which may or may not be acceptable.

Another option is to attempt to recreate the mtime from metadata that's already in git. For instance, git-restore-mtime does this by using the timestamp of the most recent commit that modified the file. This would require running an external tool before running the sync command, but it shouldn't require any other workflow changes.

Using either of these options would allow a basic aws sync command to work, since the timestamps would be consistent from one run to another.

Do your own thing

Fundamentally, you want to upload files that have changed. aws sync attempts to use file size and modification timestamps to detect changes, but if you wanted to, you could write a script or program to enumerate all files you want to upload, and upload them along with a small bit of extra metadata including something like a sha256 hash. Then on future runs, you can enumerate the files in S3 using list-objects and use head-object on each object in turn to get the metadata to see if the hash has changed.

Alternatively, you could use the "etag" of each object in S3, as that is returned in the list-objects call. As I understand it, the etag formula isn't documented and subject to change. That said, it is known, you can find implementations of it here on Stack Overflow and elsewhere. You could calculate the etag for your local files, then see if the remote files differ and need to be updated. That would save you having to do the head-object on each object as you check for changes.

Anon Coward
  • 9,784
  • 3
  • 26
  • 37
  • 1
    This tool seems like it maybe viable. It does md5 comparison, deletes objects not found in destination but found in source, and does so recursively - https://s3tools.org/usage – sebastian May 06 '21 at 17:39