New to hadoop - I am trying to read in my HDFS file in chunks, for example - 100 lines at a time and then running regression with the data using apache OLSMultipleLinearRegression in the mapper. I am using this code shown here to read in multiple lines: http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/
My mapper is defined as:
public void map(LongWritable key, Text value,Context context) throws java.io.IOException ,InterruptedException
{
String lines = value.toString();
String []lineArr = lines.split("\n");
int lcount = lineArr.length;
System.out.println(lcount); // prints out "1"
context.write(new Text(new Integer(lcount).toString()),new IntWritable(1));
}
My question is: how come lcount==1 from system.out.println? My file is delimited by "\n" and I have set NLINESTOPROCESS = 3 in the record reader. My input file is formatted as :
y x1 x2 x3 x4 x5
y x1 x2 x3 x4 x5
y x1 x2 x3 x4 x5
...
I cannot perform my multiple regression if i am only reading 1 line at a time, as the regression API takes in multiple data points... thank you for any help