I have a dataset formed by lots of small files (average 30-40 MB each). I wanted to run analytics on them by MapReduce but with each job, the mapper will read the files again which creates a heavy load on I/O performance (overheads etc.).
I wanted to know if it is possible to use the mapper once, emit various different outputs for different reducers? As I looked around, I saw that multiple reducers are not possible, but only possible thing is job chaining. However, I want to run these jobs in parallel, not sequentially, as they will all use the same dataset as input and run different analytics. So, in summary, the thing I want is something like below:
Reducer = Analytics1 /
Mapper - Reducer = Analytics2
\ Reducer = Analytics3 ...
Is this possible? Or do you have any suggestions for a workaround? Please give me some ideas. Reading these small files all over again creates a huge overhead and performance reduction for my analysis.
Thanks in advance!
Edit: I forgot to mention that I'm using Hadoop v2.1.0-beta with YARN.