73

I've used Amazon S3 a little bit for backups for some time. Usually, after I upload a file I check the MD5 sum matches to ensure I've made a good backup. S3 has the "etag" header which used to give this sum.

However, when I uploaded a large file recently the Etag no longer seems to be a md5 sum. It has extra digits and a hyphen "696df35ad1161afbeb6ea667e5dd5dab-2861" . I can't find any documentation about this changing. I've checked using the S3 management console and with Cyberduck.

I can't find any documentation about this change. Any pointers?

jjh
  • 1,410
  • 2
  • 13
  • 13
  • 1
    I think it has something to do with the file being >5Gb and therefore a multi-part upload. But I still can't find what the etag now means for large files. – jjh Jul 06 '11 at 04:48
  • 1
    Files > 16GB will be chunked into 5GB multiparts. – seanyboy Mar 06 '14 at 14:46

15 Answers15

48

You will always get this style of ETag when uploading an multipart file. If you upload the whole file as a single file, then you will get an ETag without the -{xxxx} suffix.

Bucket Explorer will show the unsuffixed ETag for a multipart file up to 5Gb.

AWS:

The ETag for an object created using the multipart upload api will contain one or more non-hexadecimal characters and/or will consist of less than 16 or more than 16 hexadecimal digits.

Reference: https://forums.aws.amazon.com/thread.jspa?messageID=203510#203510

Martin
  • 830
  • 1
  • 7
  • 24
Tej Kiran
  • 496
  • 5
  • 2
36

Amazon S3 calculates Etag with a different algorithm (not MD5 Sum, as usually) when you upload a file using multipart.

This algorithm is detailed here : http://permalink.gmane.org/gmane.comp.file-systems.s3.s3tools/583

"Calculate the MD5 hash for each uploaded part of the file, concatenate the hashes into a single binary string and calculate the MD5 hash of that result."

I just develop a tool in bash to calculate it, s3md5 : https://github.com/Teachnova/s3md5

For example, to calculate Etag of a file foo.bin that has been uploaded using multipart with chunk size of 15 MB, then

# s3md5 15 foo.bin

Now you can check integrity of a very big file (bigger than 5GB) because you can calculate the Etag of the local file and compares it with S3 Etag.

  • 1
    How do you know what chunk size was used if someone else did it? – DavidGamba Jul 25 '14 at 14:58
  • I am not entirely sure this is correct but it worked to calculate the chunk size for my files: `chunks=$(echo $etag | cut -d'-' -f 2); filesize=$(du -b $file | cut -f 1); echo "($filesize / (1024 * 1024)) / $chunks" + 1 | bc`. So with that maybe your script can add an option like `s3md5 -etag $etag filename` so you don't need to provide the chunk size. – DavidGamba Jul 25 '14 at 15:45
  • 2
    @DavidG Scan the feasible range of chunk sizes. While not foolproof, I've found that Amazon Import/Export and the 'aws s3' command line tool use integer-GB chunks for files from a few to several hundred GB (perhaps always). If you take the part of the etag after the hyphen, it tells you the number of chunks. That usually leaves a small number of possible chunk sizes that could generate that number of chunks. You can iterate through them to see if it matches the etag you are given. I use Antonio's script for each iteration. – RaveTheTadpole Sep 02 '14 at 20:16
  • @RaveTheTadpole Thanks, that is what I am doing now, take a look at the commit I added to the script: https://github.com/Teachnova/s3md5/commit/994bca02c23fad5766f3bfb93a8ed7abfb06fce8 in line 173. – DavidGamba Sep 02 '14 at 20:34
  • -1. You should reconsider that tool of yours: it does not really add any functionality over s3cmd, you mix `$( )` and backtick subshells, do `RM_BIN='/bin/rm -rf'` and lots of other mistakes. I stopped reading, but it's definitely not a tool I would rely on. – 7heo.tk Apr 28 '15 at 13:33
  • 1
    @7heo-tk could you tell me which s3cmd subcommand you use to calculate Etag locally? – Antonio Espinosa May 08 '15 at 09:18
  • You can use the head or getAttributes calls to find the number of parts. From there you might be able to infer part size. But parts don't have to be a power of 2 in size, or even equal in size. – falsePockets Nov 22 '22 at 08:07
23

Also in python...

#!/usr/bin/env python3
import binascii
import hashlib
import os

# Max size in bytes before uploading in parts. 
AWS_UPLOAD_MAX_SIZE = 20 * 1024 * 1024
# Size of parts when uploading in parts
# note: 2022-01-27 bitnami-minio container uses 5 mib
AWS_UPLOAD_PART_SIZE = int(os.environ.get('AWS_UPLOAD_PART_SIZE', 5 * 1024 * 1024))

def md5sum(sourcePath):
    '''
    Function: md5sum
    Purpose: Get the md5 hash of a file stored in S3
    Returns: Returns the md5 hash that will match the ETag in S3    
    '''

    filesize = os.path.getsize(sourcePath)
    hash = hashlib.md5()

    if filesize > AWS_UPLOAD_MAX_SIZE:

        block_count = 0
        md5bytes = b""
        with open(sourcePath, "rb") as f:
            block = f.read(AWS_UPLOAD_PART_SIZE)
            while block:
                hash = hashlib.md5()
                hash.update(block)
                block = f.read(AWS_UPLOAD_PART_SIZE)
                md5bytes += binascii.unhexlify(hash.hexdigest())
                block_count += 1

        hash = hashlib.md5()
        hash.update(md5bytes)
        hexdigest = hash.hexdigest() + "-" + str(block_count)

    else:
        with open(sourcePath, "rb") as f:
            block = f.read(AWS_UPLOAD_PART_SIZE)
            while block:
                hash.update(block)
                block = f.read(AWS_UPLOAD_PART_SIZE)
        hexdigest = hash.hexdigest()
    return hexdigest
Trevor Boyd Smith
  • 18,164
  • 32
  • 127
  • 177
Spedge
  • 1,668
  • 1
  • 19
  • 36
  • 3
    open mode should be "rb" instead of "r+b" so that read-only files can be processed. – Marc Rochkind Apr 02 '15 at 19:06
  • One occurrence still needs fixing. – Marc Rochkind Apr 03 '15 at 13:00
  • you are adding a string and integer. md5string + binascii.unhexlify(hash.hexdigest()) Unless you meant to add together all the integers? – Micah Sep 08 '15 at 21:56
  • 2
    Is there a reason why you do `md5string = md5string + binascii.unhexlify(hash.hexdigest())` vs `md5string = md5string + hash.digest()`? – Marc Jan 27 '16 at 20:33
  • 6
    Worked for me by changing `AWS_UPLOAD_PART_SIZE = 8 * 1024 * 1024` and adjusting for python 3 by changing empty strings to`b""` – DikobrAz Jun 13 '17 at 21:48
  • @DikobrAz i updated the code (before seeing your comment) and my updated code handles both of those issues (updated code is python3, and you can set the part size value via env var). – Trevor Boyd Smith Jan 27 '22 at 22:47
8

Here is an example in Go:

func GetEtag(path string, partSizeMb int) string {
    partSize := partSizeMb * 1024 * 1024
    content, _ := ioutil.ReadFile(path)
    size := len(content)
    contentToHash := content
    parts := 0

    if size > partSize {
        pos := 0
        contentToHash = make([]byte, 0)
        for size > pos {
            endpos := pos + partSize
            if endpos >= size {
                endpos = size
            }
            hash := md5.Sum(content[pos:endpos])
            contentToHash = append(contentToHash, hash[:]...)
            pos += partSize
            parts += 1
        }
    }

    hash := md5.Sum(contentToHash)
    etag := fmt.Sprintf("%x", hash)
    if parts > 0 {
        etag += fmt.Sprintf("-%d", parts)
    }
    return etag
}

This is just an example, you should handle errors and stuff

roeland
  • 6,058
  • 7
  • 50
  • 67
  • Thanks for the inspiration - I made a utility based on this and posted as a separate answer. 2 issues found for me: runs out of memory for big files, and assumes files – lambfrier Feb 04 '19 at 05:19
7

Here's a powershell function to calculate the Amazon ETag for a file:

$blocksize = (1024*1024*5)
$startblocks = (1024*1024*16)
function AmazonEtagHashForFile($filename) {
    $lines = 0
    [byte[]] $binHash = @()

    $md5 = [Security.Cryptography.HashAlgorithm]::Create("MD5")
    $reader = [System.IO.File]::Open($filename,"OPEN","READ")

    if ((Get-Item $filename).length -gt $startblocks) {
        $buf = new-object byte[] $blocksize
        while (($read_len = $reader.Read($buf,0,$buf.length)) -ne 0){
            $lines   += 1
            $binHash += $md5.ComputeHash($buf,0,$read_len)
        }
        $binHash=$md5.ComputeHash( $binHash )
    }
    else {
        $lines   = 1
        $binHash += $md5.ComputeHash($reader)
    }

    $reader.Close()

    $hash = [System.BitConverter]::ToString( $binHash )
    $hash = $hash.Replace("-","").ToLower()

    if ($lines -gt 1) {
        $hash = $hash + "-$lines"
    }

    return $hash
}
Steve Rukuts
  • 9,167
  • 3
  • 50
  • 72
seanyboy
  • 5,623
  • 7
  • 43
  • 56
5

If you use multipart uploads, the "etag" is not the MD5 sum of the data (see What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB?). One can identify this case by the etag containing a dash, "-".

Now, the interesting question is how to get the actual MD5 sum of the data, without downloading? One easy way is to just "copy" the object onto itself, this requires no download:

s3cmd cp s3://bucket/key s3://bucket/key

This will cause S3 to recompute the MD5 sum and store it as "etag" of the just copied object. The "copy" command runs directly on S3, i.e., no object data is transferred to/from S3, so this requires little bandwidth! (Note: do not use s3cmd mv; this would delete your data.)

The underlying REST command is:

PUT /key HTTP/1.1
Host: bucket.s3.amazonaws.com
x-amz-copy-source: /buckey/key
x-amz-metadata-directive: COPY
Community
  • 1
  • 1
hrr
  • 1,807
  • 2
  • 21
  • 35
  • Unless cp somehow executes on the S3 server, this seems to involve twice as much traffic as a download. Once downloaded, the checksum can be easily calculated on the file itself. Can you explain what's going on with your answer? – Marc Rochkind Apr 02 '15 at 14:48
  • 2
    Yes, this "copy" executes on the S3 server (I have updated the answer to mention this explicitly) -- this is why I find this very useful for computing MD5 sums. – hrr Apr 07 '15 at 15:02
  • This didn't work for me. The etag remained unchanged. – Synesso Jun 17 '15 at 06:03
  • @Synesso, this is surprising. Can you create an md5-etag if you copy to another object, i.e., `s3cmd cp s3://bucket/key s3://bucket/key2`? – hrr Jun 19 '15 at 18:21
4

Copying to s3 with aws s3 cp can use multipart uploads and the resulting etag will not be an md5, as others have written.

To upload files without multipart, use the lower level put-object command.

aws s3api put-object --bucket bucketname --key remote/file --body local/file
Synesso
  • 37,610
  • 35
  • 136
  • 207
4

This AWS support page - How do I ensure data integrity of objects uploaded to or downloaded from Amazon S3? - describes a more reliable way to verify the integrity of your s3 backups.

Firstly determine the base64 encoded md5sum of the file you wish to upload:

$ md5_sum_base64="$( openssl md5 -binary my-file | base64 )"

Then use the s3api to upload the file:

$ aws s3api put-object --bucket my-bucket --key my-file --body my-file --content-md5 "$md5_sum_base64"

Note the use of the --content-md5 flag, the help for this flag states:

--content-md5  (string)  The  base64-encoded  128-bit MD5 digest of the part data.

This does not say much about why to use this flag, but we can find this information in the API documentation for put object:

To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.

Using this flag causes S3 to verify that the file hash serverside matches the specified value. If the hashes match s3 will return the ETag:

{
    "ETag": "\"599393a2c526c680119d84155d90f1e5\""
}

The ETag value will usually be the hexadecimal md5sum (see this question for some scenarios where this may not be the case).

If the hash does not match the one you specified you get an error.

A client error (InvalidDigest) occurred when calling the PutObject operation: The Content-MD5 you specified was invalid.

In addition to this you can also add the file md5sum to the file metadata as an additional check:

$ aws s3api put-object --bucket my-bucket --key my-file --body my-file --content-md5 "$md5_sum_base64" --metadata md5chksum="$md5_sum_base64"

After upload you can issue the head-object command to check the values.

$ aws s3api head-object --bucket my-bucket --key my-file
{
    "AcceptRanges": "bytes",
    "ContentType": "binary/octet-stream",
    "LastModified": "Thu, 31 Mar 2016 16:37:18 GMT",
    "ContentLength": 605,
    "ETag": "\"599393a2c526c680119d84155d90f1e5\"",
    "Metadata": {    
        "md5chksum": "WZOTosUmxoARnYQVXZDx5Q=="    
    }    
}

Here is a bash script that uses content md5 and adds metadata and then verifies that the values returned by S3 match the local hashes:

#!/bin/bash

set -euf -o pipefail

# assumes you have aws cli, jq installed

# change these if required
tmp_dir="$HOME/tmp"
s3_dir="foo"
s3_bucket="stack-overflow-example"
aws_region="ap-southeast-2"
aws_profile="my-profile"

test_dir="$tmp_dir/s3-md5sum-test"
file_name="MailHog_linux_amd64"
test_file_url="https://github.com/mailhog/MailHog/releases/download/v1.0.0/MailHog_linux_amd64"
s3_key="$s3_dir/$file_name"
return_dir="$( pwd )"

cd "$tmp_dir" || exit
mkdir "$test_dir"
cd "$test_dir" || exit

wget "$test_file_url"

md5_sum_hex="$( md5sum $file_name | awk '{ print $1 }' )"
md5_sum_base64="$( openssl md5 -binary $file_name | base64 )"

echo "$file_name hex    = $md5_sum_hex"
echo "$file_name base64 = $md5_sum_base64"

echo "Uploading $file_name to s3://$s3_bucket/$s3_dir/$file_name"
aws \
--profile "$aws_profile" \
--region "$aws_region" \
s3api put-object \
--bucket "$s3_bucket" \
--key "$s3_key" \
--body "$file_name" \
--metadata md5chksum="$md5_sum_base64" \
--content-md5 "$md5_sum_base64"

echo "Verifying sums match"

s3_md5_sum_hex=$( aws --profile "$aws_profile"  --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.ETag' | sed 's/"//'g )
s3_md5_sum_base64=$( aws --profile "$aws_profile"  --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.Metadata.md5chksum' )

if [ "$md5_sum_hex" == "$s3_md5_sum_hex" ] && [ "$md5_sum_base64" == "$s3_md5_sum_base64" ]; then
    echo "checksums match"
else
    echo "something is wrong checksums do not match:"

    cat <<EOM | column -t -s ' '
$file_name file hex:    $md5_sum_hex    s3 hex:    $s3_md5_sum_hex
$file_name file base64: $md5_sum_base64 s3 base64: $s3_md5_sum_base64
EOM

fi

echo "Cleaning up"
cd "$return_dir"
rm -rf "$test_dir"
aws \
--profile "$aws_profile" \
--region "$aws_region" \
s3api delete-object \
--bucket "$s3_bucket" \
--key "$s3_key"
htaccess
  • 2,800
  • 26
  • 31
2

Here is C# version

    string etag = HashOf("file.txt",8);

source code

    private string HashOf(string filename,int chunkSizeInMb)
    {
        string returnMD5 = string.Empty;
        int chunkSize = chunkSizeInMb * 1024 * 1024;

        using (var crypto = new MD5CryptoServiceProvider())
        {
            int hashLength = crypto.HashSize/8;

            using (var stream = File.OpenRead(filename))
            {
                if (stream.Length > chunkSize)
                {
                    int chunkCount = (int)Math.Ceiling((double)stream.Length/(double)chunkSize);

                    byte[] hash = new byte[chunkCount*hashLength];
                    Stream hashStream = new MemoryStream(hash);

                    long nByteLeftToRead = stream.Length;
                    while (nByteLeftToRead > 0)
                    {
                        int nByteCurrentRead = (int)Math.Min(nByteLeftToRead, chunkSize);
                        byte[] buffer = new byte[nByteCurrentRead];
                        nByteLeftToRead -= stream.Read(buffer, 0, nByteCurrentRead);

                        byte[] tmpHash = crypto.ComputeHash(buffer);

                        hashStream.Write(tmpHash, 0, hashLength);

                    }

                    returnMD5 = BitConverter.ToString(crypto.ComputeHash(hash)).Replace("-", string.Empty).ToLower()+"-"+ chunkCount;
                }
                else {
                    returnMD5 = BitConverter.ToString(crypto.ComputeHash(stream)).Replace("-", string.Empty).ToLower();

                }
                stream.Close();
            }
        }
        return returnMD5;
    }
2

To go one step beyond the OP's question.. chances are, these chunked ETags are making your life difficult in trying to compare them client-side.

If you are publishing your artifacts to S3 using the awscli commands (cp, sync, etc), the default threshold at which multipart upload seems to be used is 10MB. Recent awscli releases allow you to configure this threshold, so you can disable multipart and get an easy to use MD5 ETag:

aws configure set default.s3.multipart_threshold 64MB

Full documentation here: http://docs.aws.amazon.com/cli/latest/topic/s3-config.html

A consequence of this could be downgraded upload performance (I honestly did not notice). But the result is that all files smaller than your configured threshold will now have normal MD5 hash ETags, making them much easier to delta client side.

This does require a somewhat recent awscli install. My previous version (1.2.9) did not support this option, so I had to upgrade to 1.10.x.

I was able to set my threshold up to 1024MB successfully.

jdolan
  • 580
  • 5
  • 14
2

Based on answers here, I wrote a Python implementation which correctly calculates both multi-part and single-part file ETags.

def calculate_s3_etag(file_path, chunk_size=8 * 1024 * 1024):
    md5s = []

    with open(file_path, 'rb') as fp:
        while True:
            data = fp.read(chunk_size)
            if not data:
                break
            md5s.append(hashlib.md5(data))

    if len(md5s) == 1:
        return '"{}"'.format(md5s[0].hexdigest())

    digests = b''.join(m.digest() for m in md5s)
    digests_md5 = hashlib.md5(digests)
    return '"{}-{}"'.format(digests_md5.hexdigest(), len(md5s))

The default chunk_size is 8 MB used by the official aws cli tool, and it does multipart upload for 2+ chunks. It should work under both Python 2 and 3.

hyperknot
  • 13,454
  • 24
  • 98
  • 153
1

Improving on @Spedge's and @Rob's answer, here is a python3 md5 function that takes in a file-like and does not rely on being able to get the file size with os.path.getsize.

# Function : md5sum
# Purpose : Get the md5 hash of a file stored in S3
# Returns : Returns the md5 hash that will match the ETag in S3
# https://github.com/boto/boto3/blob/0cc6042615fd44c6822bd5be5a4019d0901e5dd2/boto3/s3/transfer.py#L169
def md5sum(file_like,
           multipart_threshold=8 * 1024 * 1024,
           multipart_chunksize=8 * 1024 * 1024):
    md5hash = hashlib.md5()
    file_like.seek(0)
    filesize = 0
    block_count = 0
    md5string = b''
    for block in iter(lambda: file_like.read(multipart_chunksize), b''):
        md5hash = hashlib.md5()
        md5hash.update(block)
        md5string += md5hash.digest()
        filesize += len(block)
        block_count += 1

    if filesize > multipart_threshold:
        md5hash = hashlib.md5()
        md5hash.update(md5string)
        md5hash = md5hash.hexdigest() + "-" + str(block_count)
    else:
        md5hash = md5hash.hexdigest()

    file_like.seek(0)
    return md5hash
1

I built on r03's answer and have a standalone Go utility for this here: https://github.com/lambfrier/calc_s3_etag

Example usage:

$ dd if=/dev/zero bs=1M count=10 of=10M_file
$ calc_s3_etag 10M_file
669fdad9e309b552f1e9cf7b489c1f73-2
$ calc_s3_etag -chunksize=15 10M_file
9fbaeee0ccc66f9a8e3d3641dca37281-1
lambfrier
  • 757
  • 1
  • 8
  • 13
0

Of course, the multipart upload of files could be common issue. In my case, I was serving static files through S3 and the etag of .js file was coming out to be different from the local file even while the content was the same.

Turns out that even while the content was the same, it was because the line endings were different. I fixed the line endings in my git repository, uploaded the changed files to S3 and it works fine now.

Gaurav Toshniwal
  • 3,552
  • 2
  • 24
  • 23
-2

The python example works great, but when working with Bamboo, they set the part size to 5MB which is NON STANDARD!! (s3cmd is 15MB) Also adjusted to use 1024 to calculate bytes.

Revised to work for bamboo artifact s3 repos.

import hashlib
import binascii


# Max size in bytes before uploading in parts. 
AWS_UPLOAD_MAX_SIZE = 20 * 1024 * 1024
# Size of parts when uploading in parts
AWS_UPLOAD_PART_SIZE = 5 * 1024 * 1024

#
# Function : md5sum
# Purpose : Get the md5 hash of a file stored in S3
# Returns : Returns the md5 hash that will match the ETag in S3
def md5sum(sourcePath):

    filesize = os.path.getsize(sourcePath)
    hash = hashlib.md5()

    if filesize > AWS_UPLOAD_MAX_SIZE:

        block_count = 0
        md5string = ""
        with open(sourcePath, "rb") as f:
            for block in iter(lambda: f.read(AWS_UPLOAD_PART_SIZE), ""):
                hash = hashlib.md5()
                hash.update(block)
                md5string = md5string + binascii.unhexlify(hash.hexdigest())
                block_count += 1

        hash = hashlib.md5()
        hash.update(md5string)
        return hash.hexdigest() + "-" + str(block_count)

    else:
        with open(sourcePath, "rb") as f:
            for block in iter(lambda: f.read(AWS_UPLOAD_PART_SIZE), ""):
                hash.update(block)
        return hash.hexdigest()
Rob
  • 1
  • 1
  • 1
    So all you've done in this iteration is repost the above snippet, but change the AWS_UPLOAD_PART_SIZE variable? Couldn't you just have added a comment to that effect? Now we've got two sets of the code to keep up to date. – Spedge Oct 01 '15 at 13:09
  • Posted the whole thing for convenience. – Rob Oct 01 '15 at 18:14