Checking MD5 of files present in an S3 bucket and loading files not already present

Question

I have a requirement to load data from our on-prem servers to S3 buckets. A python script will be scheduled to run every morning for loading of any new files that arrive on on-prem servers.

However, loaded files are not removed from our on-prem servers, and I need to load files that have not already been loaded to S3 buckets.

Folder Structure on on-prem servers and S3 buckets need to be exact, like given below:

MainFolder/
├── SubFolderOne/
│   ├── File1
│   ├── File2
│   ├── File3
│   └── File4
├── SubFolderTwo/
│   ├── File1
│   └── File2
└── SubFolderThree/
    ├── File1
    ├── File2
    ├── File3
    └── File4

where MainFolder is the folder that needs to be monitored. A folder in our s3 bucket exists with the same name. Everything under MainFolder on on-prem servers and in S3 bucket, needs to be exactly the same.

I tried using etag values to compare files, but etag values and md5 hash values is not same, for exactly same file.

Is there any way to implement this requirement?

Can you clarify why the md5 is required? So files with the same name can change and should be re-uploaded then? Note: the ETag is, if it carries a md5, also base64 encoded. — leberknecht, May 12 '21 at 14:44
The ETag algorithm isn't documented, though it is somewhat known. For small files it's simply md5, but for larger files, [it's more complex](https://stackoverflow.com/questions/12186993/what-is-the-algorithm-to-compute-the-amazon-s3-etag-for-a-file-larger-than-5gb). Could you be running into larger files? — Anon Coward, May 12 '21 at 16:15

score 0 · Answer 1 · answered May 12 '21 at 15:42

Not sure if this helps, but i have a it-works-for-me here:

% echo "hello world" > test.txt
% md5sum test.txt 
6f5902ac237024bdd0c176cb93063dc4  test.txt

% aws s3 cp test.txt s3://<bucket-name>/test.txt
upload: ./test.txt to s3://<bucket-name>/test.txt
% aws s3api head-object --bucket <bucket-name>  --key test.txt --query ETag --output text
"6f5902ac237024bdd0c176cb93063dc4"

Can you give more information about how you check the md5 on the python/bash part and how you query the ETag? Maybe there is a newline added somewhere or something like that

Checking MD5 of files present in an S3 bucket and loading files not already present

1 Answers1