7

Referred Posts: Amazon S3 & Checksum, How to encode md5 sum into base64 in BASH

I have to download a tar file from S3 bucket with limited access. [ Mostly access permissions given only to download ]

After I download I have to check the md5 check sum of the downloaded file against the MD5-Check Sum of the data present as metadata in S3

I currently use a S3 file browser to manually note the "x-amz-meta-md5" of the content header and validate that value against the computed md5 of the downloaded file.

I would like to know if there is programmatic way using boto to capture the md5 hash value of a S3 file as mentioned as metadata.

from boto.s3.connection import S3Connection

conn = S3Connection(access_key, secret_key)
bucket=conn.get_bucket("test-bucket")
rs_keys = bucket.get_all_keys()
for key_val in rs_keys:
    print key_val, key_val.**HOW_TO_GET_MD5_FROM_METADATA(?)**

Please correct if my understanding is wrong. I am looking for a way to capture the header data programmatically

Community
  • 1
  • 1
user1652054
  • 445
  • 2
  • 11
  • 23

3 Answers3

9

When boto downloads a file using any of the get_contents_to_* methods, it computes the MD5 checksum of the bytes it downloads and makes that available as the md5 attribute of the Key object. In addition, S3 sends an ETag header in the response that represents the server's idea of what the MD5 checksum is. This is available as the etag attribute of the Key object. So, after downloading a file you could just compare the value of those two attributes to see if they match.

If you want to find out what S3 thinks the MD5 is without actually downloading the file (as shown in your example) you could just do this:

for key_val in rs_keys:
    print key_val, key_val.etag
garnaat
  • 44,310
  • 7
  • 123
  • 103
  • 2
    Thanks for the suggestion. The Etag value seems not to match with the Computed MD5 check sum. I did also see in the referred posts that etag is not an appropriate value of MD5. "x-amz-meta-md5" is the key in my S3 File browser that gives me the MD5 value. But, this key is not available in metadata or content headers to be obtained programatically. – user1652054 Jun 03 '13 at 04:05
  • 3
    The ``etag`` attribute will be of the form ``"797cc49509a9df16481fac4fae144e0a"`` while the ``md5`` attribute will be ``797cc49509a9df16481fac4fae144e0a``. Note the enclosing double-quotes in the ``etag``. You need to take that into account when comparing the values. The ``x-amz-meta-md5`` key is not a standard S3 metadata value but a custom one. Perhaps that has been added by the S3 File browser? – garnaat Jun 03 '13 at 13:03
  • 5
    One other comment. I reviewed the boto source code and confirmed that boto automatically checks the value of the ``etag`` header with the computed ``md5`` when downloading a file. It will raise ``S3DataError`` exception if they do not match. – garnaat Jun 03 '13 at 13:30
  • 2
    We had an issue when having downloaded the file successfully yet the file downloaded was corrupt. I hope you are referring to the following code in boto: FileName: boto / boto / s3 / resumable_download_handler.py `code` self.etag_value_for_current_download = f.readline().rstrip('\n') ** # We used to match an MD5-based regex to ensure that the etag # read correctly. Since ETags need not be MD5s, we now do a simple # length sanity check instead. `code` Please confirm if there is another file where the downloaded file is checked against md5 checksum – user1652054 Jun 10 '13 at 08:37
  • 3
    ETag is **not** reliable for MD5 checksums! From [the S3 Documentation: "The ETag may or may not be an MD5 digest of the object data."](http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html). See [this Stack Overflow answer](http://stackoverflow.com/a/19304527/38140) for more detail. – gotgenes Aug 19 '15 at 21:37
  • 1
    @garnaat Can you point to where S3 check the ETag against the calculated MD5? I see that `_get_file_internal()` calculates the MD5 but never actually checks it (I'm not sure why). See here: https://github.com/boto/boto/blob/develop/boto/s3/key.py#L1555 – Ben Hoyt Feb 17 '16 at 13:10
6

It seems well established that the ETag is not the md5sum if the file was assembled after running a multi-part upload. I think in that case one's only recourse is to download the file and perform a checksum locally. If the result is correct, the S3 copy must be good. If the local checksum is wrong, the s3 copy may be bad, or the download might have failed. If you no longer have the original file or a record of its md5sum, I think you're out of luck. It would be great if the md5sum of the assembled file were available, or if there were a way to locally compute the expected etag of a file to be uploaded via multipart.

0

Following approaches to get md5sum value by using only 'boto3.resource('s3')', (There are many more.)

  • Etag is the same as md5sum value.
s3_resource = boto3.resource('s3')
head_response = s3_resource.meta.client.head_object(Bucket=bucket_name, Key=object_key)
object_ETag = head_response['ETag'][1:-1]

or

s3_resource = boto3.resource('s3')
s3_object   = s3_resource.Object(bucket_name, object_key)
object_ETag = s3_object.e_tag.strip('"')
  • Read and apple md5sum
s3_resource = boto3.resource('s3')
s3_object   = s3_resource.Object(bucket_name, object_key)
read_body   = s3_object.get()['Body'].read(object_size)
temp_hash   = hashlib.md5()
temp_hash.update(read_body)
s3_md5sum = temp_hash.hexdigest()

or

s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object(bucket_name, object_key).get()
temp_hash = hashlib.md5()
temp_hash.update(s3_object['Body'].read())
s3_md5_hash = temp_hash.hexdigest()

or

s3_resource = boto3.resource('s3')
object      = s3_resource.Object(bucket_name,object_key).get()
s3_md5_hash = hashlib.md5(object['Body'].read()).hexdigest()

or

s3_resource = boto3.resource('s3')
object  = s3_resource.Object(bucket_name,object_key).get()
for byte_block in iter(lambda: object_1['Body'].read(), b''):
    s3_md5_hash.update(byte_block)
s3_md5_hash = s3_md5_hash.hexdigest()
P_M
  • 21
  • 2