0

New to hadoop - I am trying to read in my HDFS file in chunks, for example - 100 lines at a time and then running regression with the data using apache OLSMultipleLinearRegression in the mapper. I am using this code shown here to read in multiple lines: http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/

My mapper is defined as:

public void map(LongWritable key, Text value,Context context) throws java.io.IOException ,InterruptedException
{
    String lines = value.toString();
    String []lineArr = lines.split("\n");
    int lcount = lineArr.length;
    System.out.println(lcount); // prints out "1"
    context.write(new Text(new Integer(lcount).toString()),new IntWritable(1));
}

My question is: how come lcount==1 from system.out.println? My file is delimited by "\n" and I have set NLINESTOPROCESS = 3 in the record reader. My input file is formatted as :

y x1 x2 x3 x4 x5
y x1 x2 x3 x4 x5
y x1 x2 x3 x4 x5
...

I cannot perform my multiple regression if i am only reading 1 line at a time, as the regression API takes in multiple data points... thank you for any help

cs_newbie
  • 1,959
  • 1
  • 15
  • 16
  • In hadoop the data coming to mapper is line by line if you use TextInputFormat as your Input class.If you need the whole data you should use WholeFileInputFormat. – USB Aug 23 '14 at 08:22

1 Answers1

0

String.split() takes a regular expression as an argument. You have to double escape.

String []lineArr = lines.split("\\n");
Brian Roach
  • 76,169
  • 12
  • 136
  • 161
  • Hmm, lcount is still ==1. The problem is that my value.toString() only contains 1 line of input instead of 3. Could you please help? – cs_newbie Feb 03 '13 at 23:15
  • Then there's another bug in the code the guy posted other than the one I just pointed out :) I'd suggest looking at what `value` is in your `map()`, then going from there – Brian Roach Feb 03 '13 at 23:17
  • the Value inside my map() contains one line of "y x1 x2 x3 x4 x5".. I am very new to hadoop and wonder if you could help me point out which function I should start looking at from the recordReader? thank you so much.. http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/ – cs_newbie Feb 03 '13 at 23:21
  • Exactly ... so, the code from that website doesn't work as stated or ... you didn't set it to be your `InputFormatClass` in your job. Without debugging the guy's code or seeing yours I can't tell you which. I'm really not going to do the former ;) – Brian Roach Feb 03 '13 at 23:25