Processing HUGE number of small files independently

Question

The task is to process HUGE (around 10,000,000) number of small files (each around 1MB) independently (i.e. the result of processing file F1, is independent of the result of processing F2).

Someone suggested Map-Reduce (on Amazon-EMR Hadoop) for my task. However, I have serious doubts about MR.

The reason is that processing files in my case, are independent. As far as I understand MR, it works best when the output is dependent on many individual files (for example counting the frequency of each word, given many documents, since a word might be included in any document in the input file). But in my case, I just need a lot of independent CPUs/Cores.

I was wondering if you have any advice on this.

Side Notes: There is another issue which is that MR works best for "huge files rather than huge number of small size". Although there seems to be solutions for that. So I am ignoring it for now.

score 1 · Accepted Answer · answered May 11 '15 at 07:04

It is possible to use map reduce for your needs. In MapReduce, there are two phases Map and Reduce, however, the reduce phase is not a must, just for your situation, you could write a map-only MapReduce job, and all the calculations on a single file should be put into a customised Map function.

However, I haven't process such huge num of files in a single job, no idea on its efficiency. Try it yourself, and share with us :)

score 1 · Answer 2 · answered Jun 26 '15 at 09:30

This is quite easy to do. In such cases - the data for MR job is typically the list of files (and not the files themselves). So the size of the data submitted to Hadoop is the size of 10M file names - which is order of a couple of gigs max.

One uses MR to split up the list of files into smaller fragments (how many can be controlled by various options). Then each mapper gets a list of files. It can process one file at a time and generate the output.

(fwiw - I would suggest Qubole (where I am a founder) instead of EMR cause it would save you a ton of money with auto-scaling and spot integration).

Processing HUGE number of small files independently

2 Answers2