0

I have an application that iterates over a directory of pdf files and searches for a string. I am using PDFBox to extract the text from the PDF and the code is pretty straightforward. At first to search through 13 files it was taking a minute in a half to load the results but I noticed that PDFBox was putting a lot of stuff in the log file file. I changed the logging level and that helped alot but it is still taking over 30 seconds to load a page. Does anybody have any suggestions on how I can optimize the code or another way to determine how many hits are in a document? I played around with Lucene but it seems to only give you the number of hits in a directory not number of hits in a particular file.

Here is my code to get the text out of a PDF.

public static String parsePDF (String filename) throws IOException 
 {

    FileInputStream fi = new FileInputStream(new File(filename));       

    PDFParser parser = new PDFParser(fi);   
    parser.parse();   
    COSDocument cd = parser.getDocument();   
    PDFTextStripper stripper = new PDFTextStripper();   
    String pdfText = stripper.getText(new PDDocument(cd));  

    cd.close();

    return pdfText;
 }
user984701
  • 77
  • 2
  • 10

1 Answers1

0

Lucene would allow you to index each of the document seperately.
Instead of using PDFBox directly. you can use Apache Tika for extracting text and feeding it to lucene. Tika uses PDFBox internally. However, it provides easy to use api as well as ability to extract content from any types of document seamlessly.
Once you have each lucene document for each of the file in your directory, you can perform search against the complete index.
Lucene matches the search term and would return back number of results (files) which match the content in the document.
It is also possible to get the hits in each of the lucene document/file using the lucene api. This is called the term frequency, and can be calculated for the document and field being searched upon.

Example from In a Lucene / Lucene.net search, how do I count the number of hits per document?

List docIds = // doc ids for documents that matched the query, 
              // sorted in ascending order 

int totalFreq = 0;
TermDocs termDocs = reader.termDocs();
termDocs.seek(new Term("my_field", "congress"));
for (int id : docIds) {
    termDocs.skipTo(id);
    totalFreq += termDocs.freq();
}
Community
  • 1
  • 1
Jayendra
  • 52,349
  • 4
  • 80
  • 90
  • oh I didn't realize that this was on option. I will take a look. – user984701 Oct 23 '11 at 16:52
  • Well it looks like I am back to square one. It seems that the Lucene family cannot index PDF directly and needs to extract the text first. They list a few options to extract the text but one is PDFBox that I am already using. I don't understand why it is so slow. I got the app to work with Tika with shaved off only 2 seconds! http://wiki.apache.org/lucene-java/LuceneFAQ#head-c45f8b25d786f4e384936fa93ce1137a23b7e422 – user984701 Oct 24 '11 at 19:36