Hadoop read multiple lines at a time

Question

I have a file in which a set of every four lines represents a record.

eg, first four lines represent record1, next four represent record 2 and so on..

How can I ensure Mapper input these four lines at a time?

Also, I want the file splitting in Hadoop to happen at the record boundary (line number should be a multiple of four), so records don't get span across multiple split files..

How can this be done?

score 12 · Accepted Answer · answered Nov 15 '11 at 17:50

A few approaches, some dirtier than others:

The right way

You may have to define your own RecordReader, InputSplit, and InputFormat. Depending on exactly what you are trying to do, you will be able to reuse some of the already existing ones of the three above. You will likely have to write your own RecordReader to define the key/value pair and you will likely have to write your own InputSplit to help define the boundary.

Another right way, which may not be possible

The above task is quite daunting. Do you have any control over your data set? Can you preprocess it in someway (either while it is coming in or at rest)? If so, you should strongly consider trying to transform your dataset int something that is easier to read out of the box in Hadoop.

Something like:

ALine1
ALine2            ALine1;Aline2;Aline3;Aline4
ALine3
ALine4        ->
BLine1
BLine2            BLine1;Bline2;Bline3;Bline4;
BLine3
BLine4

Down and Dirty

Do you have any control over the file sizes of your data? If you manually split your data on the block boundary, you can force Hadoop to not care about records spanning splits. For example, if your block size is 64MB, write your files out in 60MB chunks.

Without worrying about input splits, you could do something dirty: In your map function, add your new key/value pair into a list object. If the list object has 4 items in it, do processing, emit something, then clean out the list. Otherwise, don't emit anything and move on without doing anything.

The reason why you have to manually split the data is that you are not going to be guaranteed that an entire 4-row record will be given to the same map task.

Thanks for your reply, I was thinking of the second approach you suggested, but isn't that also riddled with the same problem? How do I read four lines at at time to append them together and create a single line? — Gitmo, Nov 15 '11 at 22:05
You could write something in Perl or Python that could do the trick. That's what I had in mind. — Donald Miner, Nov 15 '11 at 22:21
Use [SequenceFile](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html) with compression for better performance if pre-processing of the file is done. — Praveen Sripati, Nov 16 '11 at 02:10

score 3 · Answer 2 · answered Nov 16 '11 at 02:09

Another way (easy but may not be efficient in some cases) is to implement the FileInputFormat#isSplitable(). Then the input files are not split and are processed one per map.

import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
    @Override
    protected boolean isSplitable(FileSystem fs, Path file) {
        return false;
    }
}

And as orangeoctopus said

In your map function, add your new key/value pair into a list object. If the list object has 4 items in it, do processing, emit something, then clean out the list. Otherwise, don't emit anything and move on without doing anything.

This has some overhead for the following reasons

Time to process the largest file drags the job completion time.
A lot of data may be transferred between the data nodes.
The cluster is not properly utilized, since # of maps = # of files.

** The above code is from Hadoop : The Definitive Guide

This idea sounds promising. How about using NLinesInputFormat to specify the number of lines to each mapper. That way it won't be dependent on the largest file. The problem is, I am using Hadoop 0.20 which doesn't have this implemented.. Any thoughts? — Gitmo, Nov 16 '11 at 15:31
In 0.20 NLineInputFormat is not implemented in the new API. You can try porting new API NLinesInputFormat from some other release into 0.20. It shouldn't be that difficult and you would also be learning how to compile and build a Hadoop jar. — Praveen Sripati, Nov 16 '11 at 17:00

Hadoop read multiple lines at a time

2 Answers2

Linked