I want to parse PDF files in my hadoop 2.2.0 program and I found this, followed what it says and until now, I have these three classes:
PDFWordCount
: the main class containing map and reduce functions. (just like native hadoop wordcount sample but instead ofTextInputFormat
I used myPDFInputFormat
class.PDFRecordReader extends RecordReader<LongWritable, Text>
: Which is the main work here. Especially I put myinitialize
function here for more illustration.public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException, InterruptedException { System.out.println("initialize"); System.out.println(genericSplit.toString()); FileSplit split = (FileSplit) genericSplit; System.out.println("filesplit convertion has been done"); final Path file = split.getPath(); Configuration conf = context.getConfiguration(); conf.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE); FileSystem fs = file.getFileSystem(conf); System.out.println("fs has been opened"); start = split.getStart(); end = start + split.getLength(); System.out.println("going to open split"); FSDataInputStream filein = fs.open(split.getPath()); System.out.println("going to load pdf"); PDDocument pd = PDDocument.load(filein); System.out.println("pdf has been loaded"); PDFTextStripper stripper = new PDFTextStripper(); in = new LineReader(new ByteArrayInputStream(stripper.getText(pd).getBytes( "UTF-8"))); start = 0; this.pos = start; System.out.println("init has finished"); }
(You can see my
system.out.println
s for debugging. This method fails in convertinggenericSplit
toFileSplit
. Last thing I see in console, is this:hdfs://localhost:9000/in:0+9396432
which is
genericSplit.toString()
PDFInputFormat extends FileInputFormat<LongWritable, Text>
: which just createsnew PDFRecordReader
increateRecordReader
method.
I want to know what is my mistake?
Do I need extra classes or something?