0

I have used Pig and Hive before but am new to Hadoop MapReduce. I need to write an application which has multiple small sized files as input (say 10). They have different file structures, so I want to process them parallelly on separate nodes so that they can be processed quickly. I know that the strong point of Hadoop is processing large data but these input files, though small, require a lot of processing so I was hoping to leverage Hadoop's parallel computing prowess. Is this possible?

j0k
  • 22,600
  • 28
  • 79
  • 90
aa8y
  • 3,854
  • 4
  • 37
  • 62
  • how small are these files and what kind of processing you are going to perform? – Tariq Dec 28 '12 at 15:14
  • The files are pretty small, from 1 - 20 KB. And we have to perform a lot of different checks to ensure that the file is in correct format and is not corrupt. – aa8y Dec 28 '12 at 16:22
  • Can this be achieved by partitioning (maybe based on filename). Please answer this question of mine, in which I am encountering a problem while partitioning the data. http://stackoverflow.com/questions/14193646/unable-to-set-partitoner-to-the-jobconf-object – aa8y Jan 07 '13 at 10:15
  • possible duplicate of [Getting Filename/FileData as key/value input for Map when running a Hadoop MapReduce Job](http://stackoverflow.com/questions/14212453/getting-filename-filedata-as-key-value-input-for-map-when-running-a-hadoop-mapre) – Charles Dec 30 '13 at 22:03

2 Answers2

0

It is possible but you're probably not going to get much value. You have these forces against you:

Confused input

You'll need to write a mapper which can handle all of the different input formats (either by detecting the input format, or using the filename of the input to decide which format to expect)

Multiple outputs

You need to either use the slightly tricky multiple output file handling functionality of Hadoop or write your output as a side effect of the reducer (or mapper if you can be sure that each file will go to a different node)

High Cost of initialization

Every hadoop map reduce job comes with a hefty start up cost, about 30 seconds on a small cluster, much more on a larger cluster. This point alone probably will lose you more time than you could ever hope to gain by parallelism.

Community
  • 1
  • 1
Jeffrey Theobald
  • 2,469
  • 18
  • 15
  • Yeah I was considering the same solution and I did know about slow initialization speed. We have an 80 node cluster so I guess the initialization time will be ~30 seconds. We used to perform the same operations using Ab Initio (and it's supposed to be very fast) and it took about 7-8 minutes. So I was hoping it could take similar time, if not less, on Hadoop as well. – aa8y Dec 28 '12 at 16:19
0

In brief: give a try to NLineInputFormat.

There is no problem to copy all your input files to all nodes (you can put them to distributed cache if you like). What you really want to distribute is check processing.

With Hadoop you can create (single!) input control file in the format (filename,check2run) or (filename,format,check2run) and use NLineInputFormat to feed specified number of checks to your nodes (mapreduce.input.lineinputformat.linespermap controls number of lines feed to each mapper).

Note: Hadoop input format determines how splits are calculated; NLineInputFormat (unlike TextInputFormat) does not care about blocks.

Depending on the nature of your checks you may be able to compute linespermap value to cover all files/checks in one wave of mappers (or may be unable to use this approach at all :) )

Yevgen Yampolskiy
  • 7,022
  • 3
  • 26
  • 23
  • Forgive my ignorance but can you please explain all this in laymen terms? :( – aa8y Dec 30 '12 at 17:32
  • The first question you should ask yourself: It it possible to parallelize my problem? Depending on your problem you can try to paralallelize by: a) breaking file into parts, and applying your calculations to parts of the file (this is what you frequently do wiht Hadoop); or b) if you must perform multiple calculation on a file then you can run several computations on a file separately. If you can parallelize then the next question is how you do it. In my answer I sketched how case b) could be approached in Hadoop. – Yevgen Yampolskiy Dec 31 '12 at 02:44