0

I've a large csv file with below details:

total records: 20 million
total columns: 45
total file size: 8 GB

I am trying to process this csv file using Apache Spark (distributed computing engine) on AWS EMR. I am partitioning this csv file based on one of its column which is a Timestamp datatype.

Spark ends up creating close to 1.2 million partitioned folders and under each folder there is an output .orc file with size in the range of 0 to 5 KB. all these folders/files are written on HDFS in EMR by Spark.

we need to copy these large number of smaller files from EMR(HDFS) to S3 bucket, I've used s3-dist-cp and it copies them successfully in close to 3-4 minutes.

Is this the best practice to copy such large number of small files using s3-dist-cp command utility? or is there any other alternative approach?

TylerH
  • 20,799
  • 66
  • 75
  • 101
TheCodeCache
  • 820
  • 1
  • 7
  • 27
  • 1
    `s3-dist-cp` is the best option to copy files. however, you should consider partitioning by date or some other column, rather than on timestamp so you have a manageable number of partitions. – Vamsi Prabhala May 31 '20 at 13:36
  • `s3-dist-cp` was built for it right? I would expect that is a good enough options then. Or are you trying to come up with the elusive "ideal" option? – Saša Zejnilović May 31 '20 at 13:50
  • Hi @Vamsi, true, we should try to avoid partitioning by timestamp, but suppose we anyway have large number of small-2 files in HDFS, and we need to copy all of them to S3 bucket, so is s3-dist-cp the ideal way to go ?? – TheCodeCache Jun 01 '20 at 06:14
  • Hi @SašaZejnilović, I think s3-dist-cp was built for copying small number of larger files, however I am not 100% sure on that. that's what my doubt is: should we use s3-dist-cp to copy large number of smaller files from HDFS to S3? or there is some other approach, – TheCodeCache Jun 01 '20 at 06:15

0 Answers0