Java MapReduce program to work on PDF files

Question

I need to parse PDF file in a mapreduce program using Java. I am on a cluster setup using CDH 5.0.1. I have a custom Input Format class extended by FileInputFormat where I have overridden getRecordReader method to return an instance of a custom RecordReader, and isSplitable method to prevent the file to be non-splittable as suggested in this SO answer.

Now the problem is, in the current CDH API getRecordReader returns Interface org.apache.hadoop.mapred.RecordReader while the one extended to custom Record Reader in the above SO answer, is an abstract class org.apache.hadoop.mapreduce.RecordReader.

My custom Input Format class:

import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.RecordReader;

public class PDFInputFormat extends FileInputFormat<Text, Text> {

@Override
public RecordReader<Text, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter) throws IOException {
    return new PDFRecordReader();
    }

@Override
protected boolean isSplitable(FileSystem fs, Path filename) {
    return false;
    }

}

Appreciate any help or pointers as to what am I missing here.

score 1 · Accepted Answer · edited May 23 '17 at 10:26

1

The problem is you are using the wrong api(old org.apache.hadoop.mapred.*api) Please use the new API (org.apache.hadoop.mapreduce.*)

org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.RecordReader;

The SO answer you following using this new API. In new API RecordReader is class not interface

UPDATE

mapred vs mapreduce API

edited May 23 '17 at 10:26

Community

1
1

answered Nov 25 '14 at 11:13

Tom Sebastian

3,373
5
29
54

Thanks for answering, but if I use the suggested API the FileInputFormat does not have a getRecordReader method rather it has a createRecordReader. – Harman Nov 25 '14 at 15:50
Also, I can see both the classes available in the same API i.e. CDH 5.0.1 - Hadoop 2.3.0 without any of the one listed as deprecated. Here, I have given the links to both the classes for the same API. [org.apache.hadoop.mapred.FileInputFormat](http://archive-primary.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.1/api/index.html?org/apache/hadoop/mapred/FileInputFormat.html) [org.apache.hadoop.mapreduce.lib.input.FileInputFormat](http://archive-primary.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.1/api/index.html?org/apache/hadoop/mapred/FileInputFormat.html) – Harman Nov 25 '14 at 16:06
you are correct the old api is not depricated yet, to support legacy code.If you are doing a new code I suggest to use new API implementaions. to find the difference please see the update in answer. – Tom Sebastian Nov 26 '14 at 05:08
Thanks a ton for the help but in that case, how do I support overriding getRecordReader method which is not present in the new API. – Harman Nov 26 '14 at 09:12
1

you don't need that method. createRecordReader is the replacement for that in new API.I once did that for my case: https://github.com/thomachan/Custom-MR/blob/master/src/mapreduce/hi/api/input/CustomInputFormat.java – Tom Sebastian Nov 26 '14 at 09:17

Java MapReduce program to work on PDF files

1 Answers1