I've a large csv file with below details:
total records: 20 million
total columns: 45
total file size: 8 GB
I am trying to process this csv file using Apache Spark (distributed computing engine) on AWS EMR. I am partitioning this csv file based on one of its column which is a Timestamp
datatype.
Spark ends up creating close to 1.2 million partitioned folders and under each folder there is an output .orc file with size in the range of 0 to 5 KB. all these folders/files are written on HDFS in EMR by Spark.
we need to copy these large number of smaller files from EMR(HDFS) to S3 bucket, I've used s3-dist-cp and it copies them successfully in close to 3-4 minutes.
Is this the best practice to copy such large number of small files using s3-dist-cp command utility? or is there any other alternative approach?