Performance issue with AWS EMR S3DistCp

Question

I am using S3DistCp on an EMR cluster in order to aggregate around 200K small files (for a total of 3.4GB) from a S3 bucket to another path in the same bucket. It is working but it is extremely slow (around 600MB transferred after more than 20 minutes).

Here is my EMR configuration:

1 master m5.xlarge
3 cores m5.xlarge
release label 5.29.0

The command:

s3-dist-cp --s3Endpoint=s3-eu-central-1.amazonaws.com --src=s3://my-bucket/input/ --dest=s3://my-bucket/output/ --groupBy=.*input/(entry).*(.json.gz) --targetSize=128

Am I missing something ? I have read that S3DistCp can transfer a lot of files in a blink but I can't figure how. Both EMR and bucket are in the same region by the way.

Thank you.

score 0 · Answer 1 · answered Oct 11 '20 at 17:03

Here are the recomendations

use R type instance. It will provide more memory compared to M type instances
use coalesce to merge the files in source as you have many small files
Check the number of mapper tasks. The more the task, the lesser the performance

Performance issue with AWS EMR S3DistCp

1 Answers1