Solr: store Text Layout from extrected pdf with tika / extract request handler

Question

i'm using solr 4 and the extract request handler to index pdf files, which works well. The text from the pdf is stored in the index in oder to display/provide an text snipped with highlighting.

The problem is, that the layout of the stored text is lost in solrs stored fiels. For example, if the pdf content is:

 left text                       right text
 2nd. line leftr text            text at the right side

....the content of the stored field lookes like that:

 left text right text
 2nd. line leftr text text at the right side

On the other hand: if i extrat the pdf to text (using linux tool pdftotext) followed by indexing the textfile (instead the pdf) using the extract request hendler -> the stored field contains/includes the layout. So the text snipped (and the content of the stored field in solr) lookes like that:

 left text                       right text
 2nd. line leftr text            text at the right side

My Question: Is there a way to keept the layout also while indexing an pdf, not only an text file?

How are you calling Tika? Are you getting the XHTML and processing it, or asking Tika to flatten it straight to plain text? — Gagravarr, Dec 09 '12 at 22:36
@Gagravarr I'm using curl to send the file to the "extract" import handler. — The Bndr, Dec 10 '12 at 14:47

score 0 · Accepted Answer · answered Dec 08 '12 at 13:02

0

Apache Tika would extract all the text from the pdf and index the contents as a text file.
But Instead of using the ExtractHandler with Tika, you can always convert the pdf to text and get it index so that you have the text with layout and have search enabled over it.
You can also check if you can change the default handling of Apache Tika probably using PDFBox to use other converter which holds the text layout.

answered Dec 08 '12 at 13:02

Jayendra

52,349
4
80
90

converting the pdf 2 text is the solution I use right now. But converting pdf2 text before indexing is very slow, because of more than a million of docs. It needs 1-2seconds per doc, which means it needs 1-2 million seconds (>277 hours) only for converting pdf2text.... so you 2nd hint: how to use PDFBox?? – The Bndr Dec 08 '12 at 21:08
After searching the web a long time, i decided to keep on going the "old" way. Nearly the same like you said: exporting to text, but using linux tool `pdftotext`with oition -layout`, because PDFbox2text does not seams to keep the layout structure. – The Bndr Dec 10 '12 at 14:46

Solr: store Text Layout from extrected pdf with tika / extract request handler

1 Answers1