113

If I have existing files on Amazon's S3, what's the easiest way to get their md5sum without having to download the files?

Switch
  • 5,126
  • 12
  • 34
  • 40
  • 2
    The ETag header is MD5, but not for multipart files. Here is more info on how you can use it: http://stackoverflow.com/questions/6591047/etag-definition-changed-in-amazon-s3/31086810#31086810 – roeland Jul 21 '15 at 19:35
  • 4
    Is there no way to calculate an MD5 on an S3 object without retrieving the entire object and calculating locally? Currently, none of the answers actually address this very simple question and instead focus purely on the ETag. Most answers proposing the usage of the ETag even admit it's not a suitable replacement for a calculated MD5. – bsplosion Mar 05 '20 at 16:34

14 Answers14

52

AWS's documentation of ETag says:

The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below:

  • Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.
  • Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have ETags that are not an MD5 digest of their object data.
  • If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.

Reference: http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html

Community
  • 1
  • 1
Dennis
  • 620
  • 5
  • 7
27

ETag does not seem to be MD5 for multipart uploads (as per Gael Fraiteur's comment). In these cases it contains a suffix of minus and a number. However, even the bit before the minus does not seem to be the MD5, even though it is the same length as an MD5. Possibly the suffix is the number of parts uploaded?

Duncan Harris
  • 627
  • 6
  • 8
  • 3
    This suffix seems to appear only when the file is large (greater than 5GB). By inspecting the few files I have that are large, it does appear that the suffix represents the number of parts uploaded. However, the first portion does not appear to have the same md5 hash as the original file. When calculating this hash amazon must be folding in some extra data for each part. I would like to know the algorithm so that I can check some of my files. – broc.seib Aug 29 '12 at 19:19
  • 2
    Hash algorithm is described here: http://stackoverflow.com/questions/6591047/etag-definition-changed-in-amazon-s3 – Nakedible Sep 04 '14 at 22:02
  • @broc.seib I'm seeing a suffix for files much smaller, such as one that's 18.3MB. I wonder if it depends on what is used to upload the file; I'm using `aws s3 cp ...` – Mark Apr 05 '18 at 18:11
  • @Mark The answer posted here has more detail: https://stackoverflow.com/a/19896823/516910 – broc.seib Apr 05 '18 at 22:54
16

This is a very old question, but I had a hard time find the information below, and this is one of the first places I could find, so I wanted to detail it in case anyone needs.

ETag is a MD5. But for the Multipart uploaded files, the MD5 is computed from the concatenation of the MD5s of each uploaded part. So you don't need to compute the MD5 in the server. Just get the ETag and it's all.

As @EmersonFarrugia said in this answer:

Say you uploaded a 14MB file and your part size is 5MB. Calculate 3 MD5 checksums corresponding to each part, i.e. the checksum of the first 5MB, the second 5MB, and the last 4MB. Then take the checksum of their concatenation. Since MD5 checksums are hex representations of binary data, just make sure you take the MD5 of the decoded binary concatenation, not of the ASCII or UTF-8 encoded concatenation. When that's done, add a hyphen and the number of parts to get the ETag.

So the only other things you need is the ETag and the upload part size. But the ETag has a -NumberOfParts suffix. So you can divide the size by the suffix and get part size. 5Mb is the minimum part size and the default value. The part size has to be integer, so you can't get things like 7,25Mb each part size. So it should be easy get the part size information.

Here is a script to make this in osx, with a Linux version in comments: https://gist.github.com/emersonf/7413337

I'll leave both script here in case the page above is no longer accessible in the future:

Linux version:

#!/bin/bash
set -euo pipefail
if [ $# -ne 2 ]; then
    echo "Usage: $0 file partSizeInMb";
    exit 0;
fi
file=$1
if [ ! -f "$file" ]; then
    echo "Error: $file not found." 
    exit 1;
fi
partSizeInMb=$2
fileSizeInMb=$(du -m "$file" | cut -f 1)
parts=$((fileSizeInMb / partSizeInMb))
if [[ $((fileSizeInMb % partSizeInMb)) -gt 0 ]]; then
    parts=$((parts + 1));
fi
checksumFile=$(mktemp -t s3md5.XXXXXXXXXXXXX)
for (( part=0; part<$parts; part++ ))
do
    skip=$((partSizeInMb * part))
    $(dd bs=1M count=$partSizeInMb skip=$skip if="$file" 2> /dev/null | md5sum >> $checksumFile)
done
etag=$(echo $(xxd -r -p $checksumFile | md5sum)-$parts | sed 's/ --/-/')
echo -e "${1}\t${etag}"
rm $checksumFile

OSX version:

#!/bin/bash

if [ $# -ne 2 ]; then
    echo "Usage: $0 file partSizeInMb";
    exit 0;
fi

file=$1

if [ ! -f "$file" ]; then
    echo "Error: $file not found." 
    exit 1;
fi

partSizeInMb=$2
fileSizeInMb=$(du -m "$file" | cut -f 1)
parts=$((fileSizeInMb / partSizeInMb))
if [[ $((fileSizeInMb % partSizeInMb)) -gt 0 ]]; then
    parts=$((parts + 1));
fi

checksumFile=$(mktemp -t s3md5)

for (( part=0; part<$parts; part++ ))
do
    skip=$((partSizeInMb * part))
    $(dd bs=1m count=$partSizeInMb skip=$skip if="$file" 2>/dev/null | md5 >>$checksumFile)
done

echo $(xxd -r -p $checksumFile | md5)-$parts
rm $checksumFile
Nelson Teixeira
  • 6,297
  • 5
  • 36
  • 73
12

Below that's work for me to compare local file checksum with s3 etag. I used Python

def md5_checksum(filename):
    m = hashlib.md5()
    with open(filename, 'rb') as f:
        for data in iter(lambda: f.read(1024 * 1024), b''):
            m.update(data)
   
    return m.hexdigest()


def etag_checksum(filename, chunk_size=8 * 1024 * 1024):
    md5s = []
    with open(filename, 'rb') as f:
        for data in iter(lambda: f.read(chunk_size), b''):
            md5s.append(hashlib.md5(data).digest())
    m = hashlib.md5(b"".join(md5s))
    print('{}-{}'.format(m.hexdigest(), len(md5s)))
    return '{}-{}'.format(m.hexdigest(), len(md5s))

def etag_compare(filename, etag):
    et = etag[1:-1] # strip quotes
    print('et',et)
    if '-' in et and et == etag_checksum(filename):
        return True
    if '-' not in et and et == md5_checksum(filename):
        return True
    return False


def main():   
    session = boto3.Session(
        aws_access_key_id=s3_accesskey,
        aws_secret_access_key=s3_secret
    )
    s3 = session.client('s3')
    obj_dict = s3.get_object(Bucket=bucket_name, Key=your_key)

    etag = (obj_dict['ETag'])
    print('etag', etag)
    
    validation = etag_compare(filename,etag)
    print(validation)
    etag_checksum(filename, chunk_size=8 * 1024 * 1024)
    return validation
li Anna
  • 323
  • 4
  • 8
6

As of 2022-02-25, S3 features a new Checksum Retrieval function GetObjectAttributes:

New – Additional Checksum Algorithms for Amazon S3 | AWS News Blog

Checksum Retrieval – The new GetObjectAttributes function returns the checksum for the object and (if applicable) for each part.

This function supports SHA-1, SHA-256, CRC-32, and CRC-32C for checking the integrity of the transmission.

Update: It appears that while this GetObjectAttributes approach works in many cases, there are circumstances like console uploads where the checksums are calculated based on 16 MB chunks. See e.g. Checking object integrity:

When you perform some operations using the AWS Management Console, Amazon S3 uses a multipart upload if the object is greater than 16 MB in size. In this case, the checksum is not a direct checksum of the full object, but rather a calculation based on the checksum values of each individual part.

For example, consider an object 100 MB in size that you uploaded as a single-part direct upload using the REST API. The checksum in this case is a checksum of the entire object. If you later use the console to rename that object, copy it, change the storage class, or edit the metadata, Amazon S3 uses the multipart upload functionality to update the object. As a result, Amazon S3 creates a new checksum value for the object that is calculated based on the checksum values of the individual parts.

It appears that MD5 is actually not an option for the new features, so this may not resolve your original question, but MD5 is deprecated for lots of reasons, and if use of an alternate checksum works for you, this may be what you're looking for.

nealmcb
  • 12,479
  • 7
  • 66
  • 91
  • Is there any AWS documentation where they announced MD5 deprecation? – Rauf Aghayev Mar 14 '23 at 22:14
  • @RaufAghayev The fact that Amazon doesn't include MD5 in the set of checksums returned by `GetObjectAttributes` suggests that they recognize issues with it. Many such issues with MD5 are documented, see e.g. https://en.wikipedia.org/wiki/MD5 and it is officially deprecated for many purposes, e.g. https://www.rfc-editor.org/rfc/rfc9155.pdf But there may be less security-conscious applications where the risks of using it are low. – nealmcb Mar 29 '23 at 18:15
  • 1
    This still doesn't work for files larger than 16MB (multipart uploads) according to the docs - https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#large-object-checksums – orogers Jun 14 '23 at 21:53
  • Thanks, @orogers - looks like there can indeed be chunking complications. Though from my reading of the docs, that is only a problem for console uploads. I've updated my answer. – nealmcb Jun 21 '23 at 14:30
5

For anyone who spend time to search around to find out that why the md5 not the same as ETag in S3.

ETag will calculate against chuck of data and concat all md5hash to make md5 hash again and keep the number of chunk at the end.

Here is C# version to generate hash

    string etag = HashOf("file.txt",8);

source code

    private string HashOf(string filename,int chunkSizeInMb)
    {
        string returnMD5 = string.Empty;
        int chunkSize = chunkSizeInMb * 1024 * 1024;

        using (var crypto = new MD5CryptoServiceProvider())
        {
            int hashLength = crypto.HashSize/8;

            using (var stream = File.OpenRead(filename))
            {
                if (stream.Length > chunkSize)
                {
                    int chunkCount = (int)Math.Ceiling((double)stream.Length/(double)chunkSize);

                    byte[] hash = new byte[chunkCount*hashLength];
                    Stream hashStream = new MemoryStream(hash);

                    long nByteLeftToRead = stream.Length;
                    while (nByteLeftToRead > 0)
                    {
                        int nByteCurrentRead = (int)Math.Min(nByteLeftToRead, chunkSize);
                        byte[] buffer = new byte[nByteCurrentRead];
                        nByteLeftToRead -= stream.Read(buffer, 0, nByteCurrentRead);

                        byte[] tmpHash = crypto.ComputeHash(buffer);

                        hashStream.Write(tmpHash, 0, hashLength);

                    }

                    returnMD5 = BitConverter.ToString(crypto.ComputeHash(hash)).Replace("-", string.Empty).ToLower()+"-"+ chunkCount;
                }
                else {
                    returnMD5 = BitConverter.ToString(crypto.ComputeHash(stream)).Replace("-", string.Empty).ToLower();

                }
                stream.Close();
            }
        }
        return returnMD5;
    }
  • This code work for me for small files. Large files gives me a different hash – Tono Nam Aug 26 '19 at 17:01
  • how much size of the file? – Pitipong Guntawong Aug 28 '19 at 00:53
  • how to get the chunk size of s3 multi-part object key? – Daniel Aug 29 '19 at 03:18
  • Depend on upload software. You can set the chunk size when you upload via AWS CLI.(Default is: 8MB) ref: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html – Pitipong Guntawong Aug 30 '19 at 04:26
  • Just my two cents worth on this subject... I needed to validate a file that I had downloaded with the S3 SDK transfer utility... If the file existed already, I used var kSize = ETag.Split('-'); var tSize = double.TryParse(kSize.LastOrDefault(), out var idd) ? idd : 5; var tq = (int)Math.Floor(AmazonS3Object.Size / Math.Floor(tSize) / (1024*1024)); to calculate the chunksize value. – Adrian Hum Jan 10 '23 at 09:11
3

The easiest way would be to set the checksum yourself as metadata before you upload these files to your bucket :

ObjectMetadata md = new ObjectMetadata();
md.setContentMD5("foobar");
PutObjectRequest req = new PutObjectRequest(BUCKET, KEY, new File("/path/to/file")).withMetadata(md);
tm.upload(req).waitForUploadResult();

Now you can access these metadata without downloading the file :

ObjectMetadata md2 = s3Client.getObjectMetadata(BUCKET, KEY);
System.out.println(md2.getContentMD5());

source : https://github.com/aws/aws-sdk-java/issues/1711

Tristan
  • 8,733
  • 7
  • 48
  • 96
1

I found that s3cmd has a --list-md5 option that can be used with the ls command, e.g.

s3cmd ls --list-md5 s3://bucket_of_mine/

Hope this helps.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
ahnkle
  • 467
  • 6
  • 17
  • 1
    This is handy, but as mentioned in several other answers, on some files this will not be the actual MD5 sum, but some other type of hash. – Ian Greenleaf Young Sep 12 '19 at 02:38
  • 3
    I've checked the s3cmd source code and it stores md5 in metadata while uploading. So this command will only print md5 for objects uploaded with s3cmd or objects uploaded in single chunk – ZAB Sep 16 '19 at 15:28
1

MD5 is a deprecated algorithm and not supported by AWS S3 but you can get the SHA256 checksum given you upload the file with the --checksum-algorithm like this:

aws s3api put-object --bucket picostat --key nasdaq.csv --body nasdaq.csv --checksum-algorithm SHA256

That will return output like this:

{
    "ETag": "\"25f798aae1c15d44a556366b24d85b6d\"",
    "ChecksumSHA256": "TEqQVO6ZsOR9FEDv3ofP8KDKbtR02P6foLKEQYFd+MI=",
    "ServerSideEncryption": "AES256"
}

And then run this base64 algorithm on the original file to compare:

shasum -a 256 nasdaq.csv | cut -f1 -d\ | xxd -r -p | base64 

Replace the references to the CSV file with your own and make the bucket name your own.

Whenever you want to retrieve the checksum, you can run:

aws s3api get-object-attributes --bucket picostat --key nasdaq.csv --object-attributes "Checksum"
pmagunia
  • 1,718
  • 1
  • 22
  • 33
  • Where is it written that, AWS does not support MD5 anymore? – Rauf Aghayev Mar 13 '23 at 12:52
  • 1
    As I mentioned above MD5 is not supported server-side but SHA256 is (S3 will not compute the MD5 hash for you, but you can send the MD5 hash to S3 when the the object is created if you compute it yourself. – pmagunia Mar 13 '23 at 14:10
  • Thanks for the answering, but where is it documented from AWS side? Thanks again for replying. – Rauf Aghayev Mar 13 '23 at 16:01
  • You can see the supported checksums here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html – pmagunia Mar 15 '23 at 01:00
  • Aren't they referring to only additional checksum. To my understanding they support MD5 via header, but those are the additional ones where they do not use MD5. – Rauf Aghayev Mar 15 '23 at 11:47
  • 1
    So, I double-checked with a solutions architect from AWS. Apparently, MD5 is still supported. – Rauf Aghayev Mar 17 '23 at 08:03
0

I have used the following approach with success. I present here a Python fragment with notes.

Let's suppose we want the MD5 checksum for an object stored in S3 and that the object was loaded using the multipart upload process. The ETag value stored with the object in S3 is not the MD5 checksum we want. The following Python commands can be used to stream the binary of the object, without downloading or opening the object file, to compute the desired MD5 checksum. Please note this approach assumes a connection to the S3 account containing the object has been established, and that the boto3 and hashlib modules have been imported:

#
# specify the S3 object...
#
bucket_name = "raw-data"
object_key = "/date/study-name/sample-name/file-name"
s3_object = s3.Object(bucket_name, object_key)

#
# compute the MD5 checksum for the specified object...
#
s3_object_md5 = hashlib.md5(s3_object.get()['Body'].read()).hexdigest()

This approach works for all objects stored in S3 (i.e., objects that have been loaded with or without using the multipart upload process).

-1

I have cross checked jets3t and management console against uploaded files' MD5sum, and ETag seems to be equal to MD5sum. You can just view properties of the file in AWS management console:

https://console.aws.amazon.com/s3/home

b10y
  • 839
  • 9
  • 19
  • ETag only equals MD5 in certain situations now. See https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#large-object-checksums – orogers Jun 14 '23 at 21:57
-2

This works for me. In PHP, you can compare the checksum between local file e amazon file using this:



    // get localfile md5
    $checksum_local_file = md5_file ( '/home/file' );

    // compare checksum between localfile and s3file    
    public function compareChecksumFile($file_s3, $checksum_local_file) {

        $Connection = new AmazonS3 ();
        $bucket = amazon_bucket;
        $header = $Connection->get_object_headers( $bucket, $file_s3 );

        // get header
        if (empty ( $header ) || ! is_object ( $header )) {
            throw new RuntimeException('checksum error');
        }
        $head = $header->header;
        if (empty ( $head ) || !is_array($head)) {
            throw new RuntimeException('checksum error');
        }
        // get etag (md5 amazon)
        $etag = $head['etag'];
        if (empty ( $etag )) {
            throw new RuntimeException('checksum error');
        }
        // remove quotes
        $checksumS3 = str_replace('"', '', $etag);

        // compare md5
        if ($checksum_local_file === $checksumS3) {
            return TRUE;
        } else {
            return FALSE;
        }
    }

ROMANIA_engineer
  • 54,432
  • 29
  • 203
  • 199
  • This should not work for multipart uploads, as mentioned above if the `etag` is not an `md5` of the entire file but rather a `md5` of the `md5`s of the chunks then you are comparing different things. Consider this example in laravel comparing S3 object etag and local file's `md5`: `dump(trim(Storage::disk('s3')->getMetadata($path)['etag'], '"'), md5_file(Storage::disk('local')->path($path)))` returns: `["7243d808aaca5466cee4ebef3ed6cbdf-4", "3680d3d6da5d90ccf2d758b90682f64c",]` – reppair Mar 08 '21 at 12:20
-2

Here's the code to get the S3 ETag for an object in PowerShell converted from c#.

function Get-ETag {
  [CmdletBinding()]
  param(
    [Parameter(Mandatory=$true)]
    [string]$Path,
    [Parameter(Mandatory=$true)]
    [int]$ChunkSizeInMb
  )

  $returnMD5 = [string]::Empty
  [int]$chunkSize = $ChunkSizeInMb * [Math]::Pow(2, 20)

  $crypto = New-Object System.Security.Cryptography.MD5CryptoServiceProvider
  [int]$hashLength = $crypto.HashSize / 8

  $stream = [System.IO.File]::OpenRead($Path)

  if($stream.Length -gt $chunkSize) {
    $chunkCount = [int][Math]::Ceiling([double]$stream.Length / [double]$chunkSize)
    [byte[]]$hash = New-Object byte[]($chunkCount * $hashLength)
    $hashStream = New-Object System.IO.MemoryStream(,$hash)
    [long]$numBytesLeftToRead = $stream.Length
    while($numBytesLeftToRead -gt 0) {
      $numBytesCurrentRead = [int][Math]::Min($numBytesLeftToRead, $chunkSize)
      $buffer = New-Object byte[] $numBytesCurrentRead
      $numBytesLeftToRead -= $stream.Read($buffer, 0, $numBytesCurrentRead)
      $tmpHash = $crypto.ComputeHash($buffer)
      $hashStream.Write($tmpHash, 0, $hashLength)
    }
    $returnMD5 = [System.BitConverter]::ToString($crypto.ComputeHash($hash)).Replace("-", "").ToLower() + "-" + $chunkCount
  }
  else {
    $returnMD5 = [System.BitConverter]::ToString($crypto.ComputeHash($stream)).Replace("-", "").ToLower()
  }

  $stream.Close()  
  $returnMD5
}
-3

Here is the code to get MD5 hash as per 2017

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import org.apache.commons.codec.binary.Base64;
public class GenerateMD5 {
public static void main(String args[]) throws Exception{
    String s = "<CORSConfiguration> <CORSRule> <AllowedOrigin>http://www.example.com</AllowedOrigin> <AllowedMethod>PUT</AllowedMethod> <AllowedMethod>POST</AllowedMethod> <AllowedMethod>DELETE</AllowedMethod> <AllowedHeader>*</AllowedHeader> <MaxAgeSeconds>3000</MaxAgeSeconds> </CORSRule> <CORSRule> <AllowedOrigin>*</AllowedOrigin> <AllowedMethod>GET</AllowedMethod> <AllowedHeader>*</AllowedHeader> <MaxAgeSeconds>3000</MaxAgeSeconds> </CORSRule> </CORSConfiguration>";

        MessageDigest md = MessageDigest.getInstance("MD5");
        md.update(s.getBytes());
        byte[] digest = md.digest();
        StringBuffer sb = new StringBuffer();
        /*for (byte b : digest) {
            sb.append(String.format("%02x", b & 0xff));
        }*/
        System.out.println(sb.toString());
        StringBuffer sbi = new StringBuffer();
        byte [] bytes = Base64.encodeBase64(digest);
        String finalString = new String(bytes);
        System.out.println(finalString);
    }
}

The commented code is where most people get it wrong changing it to hex