2

I have a lot of zip files that need to be processed by a C++ library. So I use C++ to write my hadoop streaming program. The program will read a zip file, unzip it, and process the extracted data. My problem is that:

  1. my mapper can't get the content of exactly one file. It usually gets something like 2.4 files or 3.2 files. Hadoop will send several files to my mapper but at least one of the file is partial. You know zip files can't be processed like this. Can I get exactly one file per map? I don't want to use file list as input and read it from my program because I want to have the advantage of data locality.

  2. I can accept the contents of multiple zip file per map if Hadoop don't split the zip files. I mean exactly 1, 2, 3 files, not something like 2.3 files. Actually it will be even better because my program need to load about 800MB data file for processing the unziped data. Can we do this?

Bhavik Ambani
  • 6,557
  • 14
  • 55
  • 86
avhacker
  • 667
  • 1
  • 9
  • 20

2 Answers2

3

You can find the solution here:

http://wiki.apache.org/hadoop/FAQ#How_do_I_get_each_of_a_job.27s_maps_to_work_on_one_complete_input-file_and_not_allow_the_framework_to_split-up_the_files.3F

The easiest way I would suggest is to set mapred.min.split.size to a large value so that your files do not get split.

If this does not work then you would need to implement an InputFormat which is not very difficult to do and you can find the steps at: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat

Amar
  • 11,930
  • 5
  • 50
  • 73
  • If I set the split size to a large value, I guess that my mapper will get something like 100 files plus partial of the 101th file. The 101th file is still splitted. As I know, I can't implement an InputFormat for hadoop streaming. Am I right? – avhacker Dec 25 '12 at 15:56
  • You can then use the below for your purpose: https://gist.github.com/808035. It is a customInputformat which sets issplittable to false. Also , from: http://hadoop.apache.org/docs/r0.20.2/streaming.html#How+do+I+provide+my+own+input%2Foutput+format+with+streaming%3F, At least as late as version 0.14, Hadoop does not support multiple jar files. So, when specifying your own custom classes you will have to pack them along with the streaming jar and use the custom jar instead of the default hadoop streaming jar. – Varun Shingal Dec 26 '12 at 12:43
-1

Rather then depending on the min split size I would suggest an easier way is to Gzip your files.

There is a way to compress files using gzip

http://www.gzip.org/

If you are on Linux you compress the extracted data with

gzip -r /path/to/data

Now that you have this pass this data as your input in your hadoop streaming job.