0

My cascalog EMR job generated thousands of small files on S3 buckets. It generate the same number of files as the number of reducers I used. Dumping all these tiny files take minutes. I wonder if there is a way to concat them on S3 so that I can dump them quickly?

Thanks

Kang

rninja
  • 540
  • 1
  • 4
  • 12

1 Answers1

0

There are a few solutions to this problem -- here is the one I use:

https://github.com/nathanmarz/dfs-datastores/blob/develop/dfs-datastores/src/main/java/com/backtype/hadoop/Consolidator.java

hiroprotagonist
  • 902
  • 1
  • 11
  • 24