I have to parse PDF files , that are in HDFS in a Map Reduce Program in Hadoop. So i get the PDF file from HDFS as Input splits and it has to be parsed and sent to the Mapper Class. For implementing this InputFormat I had gone through this link . How can the these input splits be parsed and converted into text format ?
-
This answer may be part of what you're looking for: http://stackoverflow.com/a/9298965/698839 – Matt D Feb 24 '12 at 20:52
2 Answers
Processing PDF files in Hadoop can be done by extending FileInputFormat Class. Let the class extending it be WholeFileInputFormat. In the WholeFileInputFormat class you override the getRecordReader() method. Now each pdf will be received as an Individual Input Split. Then these individual splits can be parsed to extract the text. This link gives a clear example of understanding how to extend FileInputFormat.

- 443
- 1
- 4
- 16
It depends on your splits. I think (could be wrong) that you'll need each PDF as a whole in order to parse it. There are Java libraries to do this, and Google knows where they are.
Given that, you'll need to use an approach where you have the file as a whole when you're ready to parse it. Assuming you'd want to do that in the mapper, you'd need a reader that would hand whole files to the mapper. You could write your own reader to do this, or perhaps there's one already out there. You could possibly build a reader that scans the directory of PDFs and passes the name of each file as the key into the mapper and the contents as the value.

- 13,631
- 10
- 59
- 101
-
Implementing WholeFileInput format instead of CombileFileInput format solves the problem. So in WholeFileInput format each PDF file will be received as a single input split. Then these input splits can be parsed entirely. – WR10 Feb 25 '12 at 09:56
-
Also when trying to parse the entire file as a single split, won't the size of the file being read be a bottleneck? Consider a file of TB in size and if there is a single file then it has to be parsed compulsorily on a single machine. how do we overcome this bottleneck? – WR10 Feb 27 '12 at 08:55
-
Well, first find out if it's really the case that you need the PDF in whole in order parse it. If not, that fixes the issue. Assuming that you can't break it up, then I think you have to pass file names as the splits, and read directly from HDFS in your mapper. – Don Branson Feb 27 '12 at 13:00