3

I would like to compare the file contents of two S3-compatible buckets and identify files that are missing or that differ.

Should I use checksum to do it instead?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
meitale
  • 137
  • 4
  • 15
  • We'd love to help, but unfortunately it is not clear from your question what you are trying to do. Feel free to Edit your question and add additional information. For tips on asking a question, please see: [How do I ask a good question?](http://stackoverflow.com/help/how-to-ask) – John Rotenstein Apr 01 '18 at 21:48
  • I was thining to compares the file contents of two S3-compatible buckets and returning files that are missing or that differ. should i use checksum to do it instaed? – meitale Apr 04 '18 at 08:36

1 Answers1

2

It appears that your requirement is to compare the contents of two Amazon S3 buckets and identify files that are missing or differ between the buckets.

To do this, you could use:

  • Object name: This, of course, will help find missing files
  • Object size: A different size indicates different contents and the size is given with each bucket listing.
  • eTag: An eTag is an MD5 checksum on the contents of an object. If the same file has a different eTag, then the contents is different.
  • Creation date: This is not actually a reliable way to identify differences, but it can be used with other metadata to determine whether you want to update a file. For example, if two files differ the object in the destination bucket has a newer date than the object in the source bucket, you probably don't need to copy the file across. But if the source file was modified after the destination file, it's likely to be a candidate for re-copying.

Instead of doing all the above logic yourself, you can also use the AWS Command-Line Interface (CLI). It has a aws s3 sync command that will compare files from the source and destination, and will then copy files that are modified or missing.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • can i do it by using boto3 with python? in all scrips online i see everyone using boto and noy boto3. – meitale Apr 10 '18 at 09:32
  • boto3 is the current and preferred version to use. – John Rotenstein Apr 10 '18 at 21:28
  • i was reading the AWS CLI and i am not sure how to use it with python. I was going over all options in google and it looks like no one compared buckets with python in S3. can you give me example to run from it? – meitale Apr 11 '18 at 13:03
  • If you are writing Python code, then you should use the boto3 SDK. The AWS CLI is for use from the command-line, such as manual commands or shell scripts. (In fact, the CLI is written in Python and uses boto3 itself!) So, check whether the `aws s3 sync` command suits your needs. If not, you will likely have to write your own code to accomplish the desired task. – John Rotenstein Apr 11 '18 at 21:27
  • eTag is not always an MD5 checksum so different value for this does not necessarily mean different file contents. https://stackoverflow.com/questions/53882724/aws-s3-etag-not-matching-md5-after-kms-encryption – Bushrod Aug 26 '20 at 19:40
  • i inferred user does not intend to copy contents – nf071590 Dec 02 '20 at 16:14