7

How to check if local file is same as file stored in S3 without downloading it? To avoid downloading large files again and again. S3 objects have e-tags, but they are difficult to compute if file was uploaded in parts and solution from this question doesn't seem to work. Is there some easier way avoid unnecessary downloads?

helloV
  • 50,176
  • 7
  • 137
  • 145
DikobrAz
  • 3,557
  • 4
  • 35
  • 53

3 Answers3

6

I would just compare the last modified time and download if they are different. Additionally you can also compare the size before downloading. Given a bucket, key and a local file fname:

import boto3
import os.path

def isModified(bucket, key, fname):
  s3 = boto3.resource('s3')
  obj = s3.Object(bucket, key)
  return int(obj.last_modified.strftime('%s')) != int(os.path.getmtime(fname))
helloV
  • 50,176
  • 7
  • 137
  • 145
  • 1
    Agreed. Filename + size + modified-time is normally sufficient. If you need to be 100% sure that things haven't been changed, then use ETag. – John Rotenstein Jun 14 '17 at 04:09
  • 1
    Is there any way to download the file from S3 while preserving the modified date? Otherwise, using this will never work because every time you download the file, the local copy will have a newer creation & modified date than the remote counterpart. (at least in MacOS) – Pol Alvarez Vecino Apr 20 '21 at 08:00
  • For the record, I realized I can just change the modified date of the downloaded object and set it to its remote counterpart, so this solution will work like a charm. – Pol Alvarez Vecino Apr 20 '21 at 08:15
  • How do you 'change the modified date of the downloaded object' ? – YoavEtzioni Sep 01 '21 at 13:02
2

Can you use a small local database, e.g. a text file?

  • Download an S3 object once. Not its ETag.
  • Compute whatever signature you want.
  • Put the (ETag, signature) pair into the 'database'.

Next time, before you proceed with downloading, look up the ETag in the 'database'. If it's there, compute the signature of your existing file, and compare with the signature corresponding to the ETag. If they match, the remote file is the same that you have.

There's a possibility that the same file will be re-uploaded with different chunking, thus changing the ETag. Unless this is very probable, you can just ignore the false negative and re-download the file in that rare case.

9000
  • 39,899
  • 9
  • 66
  • 104
  • 1
    The "database" could well be an S3 object tag. That way you don't need an extra resource, and you don't have to re-compute the signature if the object path/key changes. – Chris Johnson Jun 13 '17 at 22:01
  • I guess it will work, or I can just compute signature and attach it as metadata to S3 object. Just seems that this is very standard operation and there should be some way to do it without writing your own solution. Also I'm wondering how `aws s3 sync` console command works. – DikobrAz Jun 13 '17 at 22:03
  • You could also use [What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB?](https://stackoverflow.com/questions/12186993/what-is-the-algorithm-to-compute-the-amazon-s3-etag-for-a-file-larger-than-5gb) to calculate the Etag yourself, but storing it in a database would avoid having to do it repeatedly. – John Rotenstein Jun 14 '17 at 04:08
0

If you don't need an immediate inventory, you can generate s3 storage inventory then import them into your database for future usage.

Compute the local file Etag as shown here for normal file and huge multipart file.

mootmoot
  • 12,845
  • 5
  • 47
  • 44