Lucene 4.2.0 index pdf

Question

I am using example source code from the Lucene 4.2.0 demo API: http://lucene.apache.org/core/4_2_0/demo/overview-summary.html

I run IndexFiles.java to create an index from a directory of rtf, pdf, doc, and docx files. I then run SearcFiles.java and notice that I encounter several instances where my searches fail i.e. it does not return a document that contains the word I searched for.

I suspect it has to do with Lucene 4.2.0 not being able to correctly index non .txt files without additional customization.

Question: Can the IndexFiles.java source code (Lucene 4.2.0) correctly index pdf, doc, docx files as it is written in the provided link? Does anyone have examples or references on how to code that functionality?

Thank You

femtoRgon · Accepted Answer · 2013-05-20T15:35:55.733

0

No, it can't. IndexFiles is a demo, an example for you to learn from, but not really designed for production use. If you take a look at the code, you'll see it just uses a FileInputStream (wrapped with an InputStreamReader, wrapped with a BufferedReader). Generally, Lucene won't handle how to parse different file formats (except it's own index files, of course). How to parse a file to provide meaningful content to Lucene is up to you to define.

Apache Tika might be a good place to look for this functionality. Here is a simple example using Tika with Lucene.

You might also consider using Solr.

edited May 20 '13 at 15:35

answered May 20 '13 at 03:15

femtoRgon

32,893
7
60
87

Thanks for your answer. Can you elaborate on what the issues are with using the FileInputStream process you drew attention to? I simply need Lucene for a desktop application where I can create a searchable index by pointing it at a directory on a users desktop. Also, I was a little confused when you said "except its own index files." Doesn't parsing occur before files are indexed? It seems Lucene only handles .txt files. All other formats must first have text extraced using something like Tika. I think of parsing as essentially tokenizing words in a document. Is text extraction parsing? – Brian May 20 '13 at 14:15
Lucene doesn't handle files at all, really. That demo handles plain text files, but core Lucene doesn't. FileStreamReader is a Java standard stream reader, and for your purposes, it will only handle plain text. This works on the Unix philosophy. Lucene indexes content. Tika extracts content from rich documents. I've added links to a couple of examples using Tika, one with Lucene directly, the other using Solr (which you might want to consider as well). – femtoRgon May 20 '13 at 15:34
Thanks! These links are helpful. I will begin exploring Tika. How might I go about replacing the code with the stream reader? I imagine passing some structure of parsed content using Tika...? – Brian May 20 '13 at 16:49

Lucene 4.2.0 index pdf

1 Answers1