0

I am using S3DistCp on an EMR cluster in order to aggregate around 200K small files (for a total of 3.4GB) from a S3 bucket to another path in the same bucket. It is working but it is extremely slow (around 600MB transferred after more than 20 minutes).

Here is my EMR configuration:

1 master m5.xlarge
3 cores m5.xlarge
release label 5.29.0

The command:

s3-dist-cp --s3Endpoint=s3-eu-central-1.amazonaws.com --src=s3://my-bucket/input/ --dest=s3://my-bucket/output/ --groupBy=.*input/(entry).*(.json.gz) --targetSize=128

Am I missing something ? I have read that S3DistCp can transfer a lot of files in a blink but I can't figure how. Both EMR and bucket are in the same region by the way.

Thank you.

Fabien Roussel
  • 225
  • 1
  • 4
  • 10

1 Answers1

0

Here are the recomendations

  1. use R type instance. It will provide more memory compared to M type instances
  2. use coalesce to merge the files in source as you have many small files
  3. Check the number of mapper tasks. The more the task, the lesser the performance
BigData-Guru
  • 1,161
  • 1
  • 15
  • 20