Provide map splits with splits of the same file

Question

How can I provide each line of a file fed to the mapper with splits of the same file?

Basically what i want to do is

for each line in file-split
{  

    for each line in file{     
             //process
    }

}

Can i do this using map reduce in java?

score 0 · Answer 1 · edited Mar 22 '14 at 19:04

Actually when a mapreduce job is triggered, it first check the input file(s), for simplicity consider we have only one big inputfile!. If it is larger in size than the block size, job tracker split this file by the block size, then initiate No. of map tasks = No. of Splits generated and pass each split to each mapper task for processing. So no more than one split will be processed by each mapper. Also if the input file size is less than the block size, then jobtracker will take it as a separate split.

Suppose block size is 64MB, and you have 2 files each having 10MB in size, then jobtracker will generate 2 splits!,because according to FileInputFormat a split can be exactly a single file(incase filesize <= block size) or a part of a file (in case its size > blocksize).

Thus, a mapper will only process a single split, also a split cannot contain more than one file (true for FileInputFormat the default format, but in case of combine file input format it can span over multiple files).

I guess you are using FilInputFormat. HTH!

You can refer the Hadoop: The Definitive Guide to understand its basics.

Hey Tom,thanks for the suggestion but,what i want to do is for each line in every file-split, i need to process it with every line in the whole file,is there any way i can use two separate mappers to do the same,thanks in advance! — Nitin J, Mar 04 '14 at 04:12

score 0 · Answer 2 · answered Mar 03 '14 at 13:37

0

Here how you can do it:

1) Initialize a in Mapper.setup() a vector of strings (or a file if your splits are too big - the split size is usually ~ block size of the input n HDFS).

2) In Mapper.map() read the lines and add them to the vector.

3) Now you have the whole split in the vector. Do you processing in Mapper.cleanup(): e.g. you can iterate over the loop, and write to reducer each line as key and all lines of the split as value.

answered Mar 03 '14 at 13:37

Evgeny Benediktov

1,389
1
10
13

Hey Evgeny,thanks for the suggestion but,this will provide me all the lines in that particular file-split,what i want to do is for each line in every file-split, i need to process it with every line in the whole file.Thanks in advance – Nitin J Mar 04 '14 at 04:08
Sorry, didn't read well the requirements. A little bit strange imho, since split is internal to MapReduce, what your job has to do with it? Anyway, in that case, I think you have to make one split for one file. To do it gzip your input files, or put them to HDFS with block size > mapred.min.split.size. Pay attention that you put your inputs in the power of 2. – Evgeny Benediktov Mar 04 '14 at 07:12

Tom Sebastian · Answer 3 · 2014-03-05T05:41:40.563

You can get all lines of a file , in the reducer task. If it solves your issue , please look:

    public class FileLineComparison {

        public static class Map extends
                Mapper<LongWritable, Text, Text, Text> {
            private Text fileName = new Text();

            public void map(LongWritable key, Text line, Context context)
                    throws IOException, InterruptedException {// Parse the input string into a nice map
                /*
                 * get file name from context and put it as key,
                 * so that reducer will get all lines of that file
                             * from one or more mappers
                 */
                 FileSplit fileSplit = (FileSplit)context.getInputSplit();
                 fileName.set( fileSplit.getPath().getName());

                 context.write(fileName, line);


            }
        }

        public static class Reduce extends
                Reducer<Text, Text, Text, Text> {

                      public void reduce(Text filename, Iterable<Text> allLinesOfsinglefile,  Context context) throws IOException, InterruptedException {
                          for (Text val : allLinesOfsinglefile) {
                              /*
                               * you get each line of the file here.
                               * if you want to compare each line with the rest, please loop again.
But in that case consider it as an iterable object
                               * do your things here
                               */
                          }
                        /*
                         * write to out put file, if required  
                         */
                      context.write(filename, filename);
                      }
                  }
    }

Or if you really need it in mapper, please read the file itself in each mapper, since filename and path we got from split.It is only reccomended when file size is small..

Provide map splits with splits of the same file

3 Answers3