How to merge the small files on S3 generated by EMR with thousands of reducers

Question

My cascalog EMR job generated thousands of small files on S3 buckets. It generate the same number of files as the number of reducers I used. Dumping all these tiny files take minutes. I wonder if there is a way to concat them on S3 so that I can dump them quickly?

Thanks

Kang

score 0 · Answer 1 · answered Apr 24 '13 at 05:40

0

There are a few solutions to this problem -- here is the one I use:

https://github.com/nathanmarz/dfs-datastores/blob/develop/dfs-datastores/src/main/java/com/backtype/hadoop/Consolidator.java

answered Apr 24 '13 at 05:40

hiroprotagonist

902
1
11
24

How to merge the small files on S3 generated by EMR with thousands of reducers

1 Answers1