Swift, Amazon S3, eTag and MD5 hash for files > 5MB

Question

In my app, I download videos from the Amazon S3 cloud to the sandbox. In order to make sure that the downloaded files are not corrupt, I compare the eTag of the object (delivered by Amazon) with the MD5 hash of the downloaded object which resides in the local file system. For small videos (< 5MB) my algorithm works fine - eTag and MD5 hash are identical.

For bigger files, both parameters no longer match - as far as I know, Amazon generates the eTag differently for files > 5MB - the eTag also has a trailing hyphen with a number behind (maybe it's the number of chunks?):

8c18c4ed68bc9db377cb2d3225c0ee31-4

In the Internet, I could find no solution or code snippet calculating the correct MD5 hash for bigger files.

Calculating the MD5 hash, I tried both

localData.md5().toHexString() // CryptoSwift

both

var md5: String? {

   let hash = localData.withUnsafeBytes { (bytes: UnsafePointer<Data>) -> [UInt8] in
   var hash: [UInt8] = [UInt8](repeating: 0, count: Int(CC_MD5_DIGEST_LENGTH))
   CC_MD5(bytes, CC_LONG(localData.count), &hash)
       return hash
   }
   return hash.map { String(format: "%02x", $0) }.joined()
}

Has anyone an idea how to resolve this? Maybe I should focus on another approach - for example checking if the downloaded video can be opened?

I have no experience with Amazon S3, but according to https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html, the eTag may be the MD5 hash or not. — Martin R, Oct 12 '18 at 10:32
Hi Martin, thank you! I know this document - I did upload all files via the web interface - thus I'm wondering why the eTags are generated differently by Amazon. — Ulrich Vormbrock, Oct 12 '18 at 10:42
The title of the referenced duplicate question refers to a file larger than 5GB but is in fact referring to any file uploaded using the multipart upload API -- which is what triggers the modified etag behavior. This can be used for smaller files because it allows upload parallelism at the developer's discretion. Multipart is mandatory for files over 5 GB, but optional for files of any size down to 5MB. (Files under 5MB can technically even be uploaded using multipart, but the number of parts cannot exceed 1, since each part except the last must be >= 5MB). — Michael - sqlbot, Oct 13 '18 at 11:23
Anecdotally, many libraries (and the console) that automatically select an upload algorithm seem to switch to multipart mode for files with minimum sizes somewhere roughly in the range of 20 to 100 MB, and use this for all larger files. — Michael - sqlbot, Oct 13 '18 at 11:26

score 0 · Accepted Answer · answered Oct 12 '18 at 11:08

0

I think a more viable strategy would be to store a pre-calculated hash in your structured response (you most likely have a JSON, XML, <insert your favourite wire format here> that references the S3 URL, don't you?).

  {
    "url": "https://.../myfile.mpeg",
    "sha256": "9e7bf344f14a1fd2f98abbd736fa3c777ef6088e9b964858bbb524e88322a938"
  }

Relying on S3's ETag generation algorithm will break anytime when they decide to change the implementation. Plus, CDNs usually handle ETags poorly, and ETags tend to differ from mirror to mirror (worked in a company that rolled a private CDN where that was the case). So if you decide to move away from S3, your logic may break as well.

answered Oct 12 '18 at 11:08

ivanmoskalev

2,004
1
16
25

Thank you, ivanmoskalev! Your idea sounds brilliant. It's correct that - before downloading the videos from S3 - I download first a JSON file with further infos about the video. Thus why not adding a md5 or sha256 key? Generating such key is easy: simply load up the terminal and run the md5 or shasum commands. Another big advantage: I no longer depend on the eTag algorithms of the provider. – Ulrich Vormbrock Oct 12 '18 at 16:23
@UlrichVormbrock glad my answer helped! Please mark it as accepted if the issue is closed so that the question would stop showing up in unanswered. – ivanmoskalev Oct 15 '18 at 10:06
Ok I did it - I did oversee the checkmark at the left-hand-side of your answer. Besides your approach works fine: I download the fingerprint from JSON and put it into CoreData. Later I countercheck this fingerprint with the fingerprint of the downloaded file. – Ulrich Vormbrock Oct 15 '18 at 11:12
Glad it worked for you! – ivanmoskalev Oct 15 '18 at 12:34

Swift, Amazon S3, eTag and MD5 hash for files > 5MB

1 Answers1