I have a requirement to load data from our on-prem servers to S3 buckets. A python script will be scheduled to run every morning for loading of any new files that arrive on on-prem servers.
However, loaded files are not removed from our on-prem servers, and I need to load files that have not already been loaded to S3 buckets.
Folder Structure on on-prem servers and S3 buckets need to be exact, like given below:
MainFolder/
├── SubFolderOne/
│ ├── File1
│ ├── File2
│ ├── File3
│ └── File4
├── SubFolderTwo/
│ ├── File1
│ └── File2
└── SubFolderThree/
├── File1
├── File2
├── File3
└── File4
where MainFolder is the folder that needs to be monitored. A folder in our s3 bucket exists with the same name. Everything under MainFolder on on-prem servers and in S3 bucket, needs to be exactly the same.
I tried using etag values to compare files, but etag values and md5 hash values is not same, for exactly same file.
Is there any way to implement this requirement?