hadoop mapper reading multiple lines

Question

New to hadoop - I am trying to read in my HDFS file in chunks, for example - 100 lines at a time and then running regression with the data using apache OLSMultipleLinearRegression in the mapper. I am using this code shown here to read in multiple lines: http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/

My mapper is defined as:

public void map(LongWritable key, Text value,Context context) throws java.io.IOException ,InterruptedException
{
    String lines = value.toString();
    String []lineArr = lines.split("\n");
    int lcount = lineArr.length;
    System.out.println(lcount); // prints out "1"
    context.write(new Text(new Integer(lcount).toString()),new IntWritable(1));
}

My question is: how come lcount==1 from system.out.println? My file is delimited by "\n" and I have set NLINESTOPROCESS = 3 in the record reader. My input file is formatted as :

y x1 x2 x3 x4 x5
y x1 x2 x3 x4 x5
y x1 x2 x3 x4 x5
...

I cannot perform my multiple regression if i am only reading 1 line at a time, as the regression API takes in multiple data points... thank you for any help

In hadoop the data coming to mapper is line by line if you use TextInputFormat as your Input class.If you need the whole data you should use WholeFileInputFormat. — USB, Aug 23 '14 at 08:22

score 0 · Answer 1 · answered Feb 03 '13 at 04:15

0

String.split() takes a regular expression as an argument. You have to double escape.

String []lineArr = lines.split("\\n");

answered Feb 03 '13 at 04:15

Brian Roach

76,169
12
136
161

Hmm, lcount is still ==1. The problem is that my value.toString() only contains 1 line of input instead of 3. Could you please help? – cs_newbie Feb 03 '13 at 23:15
Then there's another bug in the code the guy posted other than the one I just pointed out :) I'd suggest looking at what `value` is in your `map()`, then going from there – Brian Roach Feb 03 '13 at 23:17
the Value inside my map() contains one line of "y x1 x2 x3 x4 x5".. I am very new to hadoop and wonder if you could help me point out which function I should start looking at from the recordReader? thank you so much.. http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/ – cs_newbie Feb 03 '13 at 23:21
Exactly ... so, the code from that website doesn't work as stated or ... you didn't set it to be your `InputFormatClass` in your job. Without debugging the guy's code or seeing yours I can't tell you which. I'm really not going to do the former ;) – Brian Roach Feb 03 '13 at 23:25

hadoop mapper reading multiple lines

1 Answers1