I copied a large number of gzip files from Google Cloud Storage to AWS's S3 using s3DistCp (as this AWS article describes). When I try to compare the files' checksums, they differ (md5/sha-1/sha-256 have same issue).
If I compare the sizes (bytes) or the decompressed contents of a few files (diff
or another checksum), they match. (In this case, I'm comparing files pulled directly down from Google via gsutil
vs pulling down my distcp'd files from S3).
Using file
, I do see a difference between the two:
file1-gs-direct.gz: gzip compressed data, original size modulo 2^32 91571
file1-via-s3.gz: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT), original size modulo 2^32 91571
My Goal/Question:
My goal is to verify that my downloaded files match the original files' checksums, but I don't want to have to re-download or analyze the files directly on Google. Is there something I can do on my s3-stored files to reproduce the original checksum?
Things I've tried:
Re-gzipping with different compressions: While I wouldn't expect s3DistCp to change the original file's compression, here's my attempt at recompressing:
target_sha=$(shasum -a 1 file1-gs-direct.gz | awk '{print $1}')
for i in {1..9}; do
cur_sha=$(cat file1-via-s3.gz | gunzip | gzip -n -$i | shasum -a 1 | awk '{print $1}')
echo "$i. $target_sha == $cur_sha ? $([[ $target_sha == $cur_sha ]] && echo 'Yes' || echo 'No')"
done
1. abcd...1234 == dcba...4321 ? No
2. ... ? No
...
2. ... ? No