Copy only new objects from S3 to on-premise server

Question

I have a S3 bucket where objects are generated from salesforce on daily basis. I want to copy those objects from S3 bucket to a local Linux server. An application will run on that Linux server which will reference those objects to generate a new file.

I cannot use S3-sync as there will be hundreds of thousands of objects residing in S3 bucket. Since these objects will be generated on daily basis, sync will add add a substantial cost. I only want newly created objects to be copied to the local server.

I am considering using S3FS or JuiceFS to mount S3 bucket locally. But I heard that mounting S3 to a local server is not a reliable solution.

Is there any reliable and secure way where we can only copy new objects to the local server? Also, is it reliable if I mount the S3 to the local server using S3FS or JuiceFS?

Thank you very much in advance.

stevel · Answer 1 · 2023-03-20T13:37:42.287

0

you could actually use hadoop's distcp command with the -update option; it will not download files which are local and of the same length (there's no checksum comparison between s3 and other stores, so equal length is interpreted as unchanged. this can be run locally from the command line; no need for a cluster.

hadoop distcp -update -numListstatusThreads 40 s3a://mybucket/path file://tmp/downloads

the numThreads option parallelises dir scanning, this sounds like it will matter, as s3 LIST calls only return pages of a few thousand and take time and money.

see https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html

edited Mar 20 '23 at 13:37

answered Aug 19 '22 at 09:45

stevel

12,567
1
39
50

The `-numThreads` parameter no longer exists in todays HDFS (3.3.4), it is now called `-numListstatusThreads`, and already defaults to 40. – DieterDP Mar 20 '23 at 10:13
@DieterDP wow, i've been getting it wrong. checked source, and yep, you are right. edited my entry – stevel Mar 20 '23 at 13:38

Copy only new objects from S3 to on-premise server

1 Answers1