1

I have a custom MyInputFormat that suppose to deal with record boundary problem for multi-lined inputs. But when I put the MyInputFormat into my UDF load function. As follow:

import org.apache.hadoop.mapreduce.InputFormat;
public class EccUDFLogLoader extends LoadFunc {
    @Override
    public InputFormat getInputFormat() {
        System.out.println("I am in getInputFormat function");
        return new MyInputFormat();
    }
}

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class MyInputFormat extends TextInputFormat {
    public RecordReader createRecordReader(InputSplit inputSplit, JobConf jobConf) throws IOException {
        System.out.prinln("I am in createRecordReader");
        //MyRecordReader suppose to handle record boundary
        return new MyRecordReader((FileSplit)inputSplit, jobConf);
    }
}

For each mapper, it print out I am in getInputFormat function but not I am in createRecordReader. I am wondering if anyone can provide a hint on how to hoop up my costome MyInputFormat to PIG's UDF loader? Much Thanks.

I am using PIG on Amazon EMR.

Community
  • 1
  • 1
Simon Guo
  • 2,776
  • 4
  • 26
  • 35
  • try adding an `@Override` annotation on the `createRecordReader` method to ensure you have the correct signature – Chris White Dec 19 '12 at 00:02

1 Answers1

1

Your signature doesn't match that of the parent class (you're missing the Reporter argument), try this:

@Override
public RecordReader<LongWritable, Text> getRecordReader(
        InputSplit inputSplit, JobConf jobConf, Reporter reporter)
             throws IOException {
  System.out.prinln("I am in createRecordReader");
  //MyRecordReader suppose to handle record boundary
  return new MyRecordReader((FileSplit)inputSplit, jobConf);
}

EDIT Sorry i didn't spot this earlier, as you note, you need to use the new API signature instead:

@Override
public RecordReader<LongWritable, Text> 
      createRecordReader(InputSplit split,
             TaskAttemptContext context) {
  System.out.prinln("I am in createRecordReader");
  //MyRecordReader suppose to handle record boundary
  return new MyRecordReader((FileSplit)inputSplit, jobConf);
}

And your MyRecordReader class needs to extend the org.apache.hadoop.mapreduce.RecordReader class

Chris White
  • 29,949
  • 4
  • 71
  • 93
  • if I put @Override, then it gives me error says `MyInputFormat.java:11: method does not override or implement a method from a supertype @Override.` – Simon Guo Dec 19 '12 at 01:48
  • The error is because your current methof signature doesn't override a parent method. Add in the Reporter argument and you should be ok – Chris White Dec 19 '12 at 01:50
  • Pig `getInputFormat` expect a `org.apache.hadoop.mapreduce.InputFormat` so the `TextInputFormat` in `MyInputFormat` is from `org.apache.hadoop.mapreduce.lib.input.TextInputFormat`. And it doesn't have `getRecordReader` but `createRecordReader`. That's why I use `createRecordReader`. And it still gives me error. – Simon Guo Dec 19 '12 at 01:57
  • Or should I not extends `TextInputFormat`? If so which one should I extend to?? – Simon Guo Dec 19 '12 at 02:05
  • You should extends org.apache.hadoop.mapreduce.lib.input.TextInputFormat – zjffdu Dec 19 '12 at 04:36
  • Thanks, I am be able to reach my `RecordReader` now. Thanks `Chris` and `zjffdu` – Simon Guo Dec 19 '12 at 06:13