Currently I use AWS-EMR as the cluster. For the library, I use cascading.
The input data is stored in aws S3, in a directory. The directory contains many files, each about 100mb large (not compressed, plain text), and the files can easily reach 100 in number daily. The filename of each file contains a date. At the end of the day, I process all files produced on that date.
Currently my hadoop application process happens like this:
- Use S3 folder as input tap using
GlobHfs
- The
GlobHFS
has a custom filter where it's filtering the filename and only accepts today's file - process only filtered files and set output tap to S3.
My question:
- Should I use compression? If so what compression type should I use? I read .gz compression makes it only one mapper can do the job, in my case where the folder has many files, is that relevant? Should I use LZO with each file?
- Should I store bigger files, or is the current format (many smaller files) good enough?