I have a file in S3 which a zip
file, say something.zip
which contains a folder say something
and inside the folder some contents. I'm using boto3
for python3.6
and downloading it and then unzipping it and using it for my needs. Later in a point of time, when I need to use the files in the something
folder, I need to verify that it is indeed not tampered in any way. I don't want to download the whole file again and then unzip it. So I thought of zipping the something
folder to something.zip
again and then calculating the S3 ETag. I am using this function down below to verify the upload by calculating the ETag before uploading and then checking with the ETag provided from the list_objects
function in boto_client
and it works perfectly and I am able to verify the uploads.
def calculate_s3_etag(folder_path, chunk_size=8 * 1024 * 1024):
md5s = []
with open(folder_path, 'rb') as fp:
while True:
data = fp.read(chunk_size)
if not data:
break
md5s.append(hashlib.md5(data))
if len(md5s) == 1:
return '"{}"'.format(md5s[0].hexdigest())
digests = b''.join(m.digest() for m in md5s)
digests_md5 = hashlib.md5(digests)
return '"{}-{}"'.format(digests_md5.hexdigest(), len(md5s))
But when I do the same for the zipped file I created by zipping my something
folder to something.zip
, it doesn't work and I am not able to verify the folder. Am I doing it wrong? I referred to some discussion threads to check but was not able to find this specific use-case anywhere. As much as I learned, if I would have calculated the ETag on the original downloaded file itself, it would have worked, right?
Is there any way to do this? Or if there is a better way to achieve my objective? I just need to check whether the contents of the folder is indeed in the same state when I downloaded it form S3.
Note: My file size is anywhere around 10MB to 800MB so I don't think the 5GB problem will affect me but I don't have much experience with S3.