0

I have a file in S3 which a zip file, say something.zip which contains a folder say something and inside the folder some contents. I'm using boto3 for python3.6 and downloading it and then unzipping it and using it for my needs. Later in a point of time, when I need to use the files in the something folder, I need to verify that it is indeed not tampered in any way. I don't want to download the whole file again and then unzip it. So I thought of zipping the something folder to something.zip again and then calculating the S3 ETag. I am using this function down below to verify the upload by calculating the ETag before uploading and then checking with the ETag provided from the list_objects function in boto_client and it works perfectly and I am able to verify the uploads.

def calculate_s3_etag(folder_path, chunk_size=8 * 1024 * 1024):
    md5s = []
    with open(folder_path, 'rb') as fp:
        while True:
            data = fp.read(chunk_size)
            if not data:
                break
            md5s.append(hashlib.md5(data))

    if len(md5s) == 1:
        return '"{}"'.format(md5s[0].hexdigest())

    digests = b''.join(m.digest() for m in md5s)
    digests_md5 = hashlib.md5(digests)
    return '"{}-{}"'.format(digests_md5.hexdigest(), len(md5s))

But when I do the same for the zipped file I created by zipping my something folder to something.zip, it doesn't work and I am not able to verify the folder. Am I doing it wrong? I referred to some discussion threads to check but was not able to find this specific use-case anywhere. As much as I learned, if I would have calculated the ETag on the original downloaded file itself, it would have worked, right?

Is there any way to do this? Or if there is a better way to achieve my objective? I just need to check whether the contents of the folder is indeed in the same state when I downloaded it form S3.

Note: My file size is anywhere around 10MB to 800MB so I don't think the 5GB problem will affect me but I don't have much experience with S3.

krxat
  • 513
  • 4
  • 16
  • ...from reading this https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/ I do get the feeling ETag should not be used for this purpose: it can change (e.g. bucket encryption) and is not necessarily an MD5 sum... – mrxra Aug 07 '20 at 13:52
  • But the fact that it works when I calculate while uploading is kind of confusing. Is there a better way check the integrity of the folder? – krxat Aug 07 '20 at 14:39
  • ...to the best of my knowledge, the aws sdk validates data integrity during transfer already. for your particular need to _avoid_ data transfer if the file has not changed, there seem to be others using some etag hackery more or less successfully: such as this one https://stackoverflow.com/questions/12186993/what-is-the-algorithm-to-compute-the-amazon-s3-etag-for-a-file-larger-than-5gb (where I assume your code originally comes from). if ETag works, use it, don't be surprised if it breaks. As an alternative you could use a custom metadata attribute to store _your_ md5sum. – mrxra Aug 07 '20 at 15:31
  • ...you could also try to rely on a combination http headers such as last-modified and content-length that you can retrieve via ```head_object```...it's not a hash though but might be sufficient (and probably more stable) for your use-case – mrxra Aug 07 '20 at 15:55
  • The object in the S3 won't change for sure. The contents of the local file might. Does that work in this usecase then? – krxat Aug 07 '20 at 18:39
  • ...in that case I'd indeed use a custom metadata attribute: calculate the md5 sum and store it together with the object on s3. and each time before you upload/download you get the metadata first and compare it with the md5 of your local version – mrxra Aug 07 '20 at 19:55
  • Oh, can I store any kind of value in the metadata? – krxat Aug 07 '20 at 19:56
  • not sure what you mean by _any_, but I guess you can pretty much store any kind of key/value pair...such as an md5: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/add-object-metadata.html – mrxra Aug 08 '20 at 06:59

0 Answers0