0

Reference the discussion in this link:

What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB?

The steps to recreate the MD5 hash is to 1) concatenate the md5 hashes for each upload part, 2) convert the concatenated hash into binary, 3) get the md5 hash of the binary, then 4) add the hyphen and number of parts to the hash. That all sounds easy enough, but where I'm struggling is in step 3. To get the hash of the binary I need to convert the string into a byte array. To get the byte array I need to know what encoding format to use. That's the part I'm missing. Do I use ASCII, UTF8, Unicode, BigEndian, something else?

I've tried using the four formats above and none have produced the correct hash. I just can't seem to figure this one out. The code I'm using is:

CompleteMultipartUploadResponse compResp = new CompleteMultipartUploadResponse();
CompleteMultipartUploadRequest compReq = new CompleteMultipartUploadRequest();
string requestETagHash = "";

compResp = client.CompleteMultipartUpload(compReq);
string compETag = compResp.ETag;                                            
foreach (PartETag s in compReq.PartETags)
{
    requestETagHash += s.ETag.Replace('\"', ' ').Trim().Split('-').First();
}

StringBuilder sb = new StringBuilder();
foreach (char c in requestETagHash)
{
    try
    {
         sb.AppendFormat(Convert.ToString(Convert.ToInt16(c.ToString(), 16), 2).PadLeft(4, '0'));
    }
    catch (Exception ex)
    {
        MessageBox.Show("Hash error:\n\n" + ex.Message);
    }
}
//What encoding is used in this line?
byte[] b = System.Text.Encoding.UTF8.GetBytes(sb.ToString());

byte[] data = md5Hash.ComputeHash(b, 0, b.Length);

StringBuilder sBuilder = new StringBuilder();
for (int i = 0; i < data.Length; i++)
{
    sBuilder.Append(data[i].ToString("x2"));
}

Any in solving this would be appreciated.

Community
  • 1
  • 1
user1750310
  • 81
  • 1
  • 6
  • How are you uploading the actual data? It's not clear where text comes in here at all. – Jon Skeet May 17 '16 at 16:46
  • Note from the question you linked to: "Since MD5 checksums are hex representations of binary data, just make sure you take the MD5 of the decoded binary concatenation" – Jon Skeet May 17 '16 at 16:56
  • Basically it sounds like you're doing this too late - you should be computing each MD5 hash as a `byte[]`, then a) concatentating those `byte[]` hashes together (so you can hash the result again); b) converting each hash into hex for the etag. – Jon Skeet May 17 '16 at 16:57
  • Thanks, Jon. I was going to comment the code to make it more clear what is happening where, but can't seem to figure out how to do that. Regarding the note you quoted, that is what is tripping me up, and why I was converting the hash (which is hex) to binary. There's a piece in there that I'm not getting. – user1750310 May 17 '16 at 17:06
  • The hash doesn't start out as hex. You haven't shown the code that computes the hash of your data to start with. (You seem to be making the request right near the start, which is very odd to begin with... normally you'd do this before making the request, wouldn't you?) – Jon Skeet May 17 '16 at 17:08
  • I'm not computing a hash for each part prior to uploading as that doesn't seem to be necessary. I am collecting them in 'requestETagHash' once the upload is complete. From what I can figure out, that is where they should be converted to binary? – user1750310 May 17 '16 at 17:13
  • Well doing it before uploading is what I'd do - surely you want to check that the data that was received is the same as the data you had to start with. You shouldn't be converting anything to binary (although you *could* parse the hex back to bytes) - you should generate the MD5 hash (which is naturally binary to start with) yourself. – Jon Skeet May 17 '16 at 17:18
  • For single-part uploads I do compute the hash and verify it against the returned hash. My understanding of the multi-part uploads, though, is that won't work because the hash of the reassembled file that gets returned will be different. I've tried hashing the concatenated hash as well, but that isn't working either. The hash values I see when stepping through the code (VS2013) are in hex. If I'm not supposed to convert them to binary, then what is the meaning of "...make sure you take the MD% of the decoded binary concatenation."? I'm stumped. – user1750310 May 17 '16 at 17:27

1 Answers1

0

Problem solved. Thank you, Jon! Your comment about my getting the hash late got me thinking about where to find the hash's byte array vs. the hex value I was using. I modified my code to get and concatenate the hash byte array immediately after uploading each file part. Then, after receiving the CompleteMultiPartUploadResponse response, I hash that concatenated array, and voila, I get the same hash as the eTag returned from S3 for the completed upload.

user1750310
  • 81
  • 1
  • 6
  • I'm still struggling to get the hash match the with the ETag of complete upload. Can you please share me a code snippet you modified? – Rex Dec 02 '20 at 19:14