I need to parse PDF file in a mapreduce program using Java. I am on a cluster setup using CDH 5.0.1. I have a custom Input Format class extended by FileInputFormat where I have overridden getRecordReader method to return an instance of a custom RecordReader, and isSplitable method to prevent the file to be non-splittable as suggested in this SO answer.
Now the problem is, in the current CDH API getRecordReader returns Interface org.apache.hadoop.mapred.RecordReader while the one extended to custom Record Reader in the above SO answer, is an abstract class org.apache.hadoop.mapreduce.RecordReader.
My custom Input Format class:
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.RecordReader;
public class PDFInputFormat extends FileInputFormat<Text, Text> {
@Override
public RecordReader<Text, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter) throws IOException {
return new PDFRecordReader();
}
@Override
protected boolean isSplitable(FileSystem fs, Path filename) {
return false;
}
}
Appreciate any help or pointers as to what am I missing here.