i'm using solr 4 and the extract request handler to index pdf files, which works well. The text from the pdf is stored in the index in oder to display/provide an text snipped with highlighting.
The problem is, that the layout of the stored text is lost in solrs stored fiels. For example, if the pdf content is:
left text right text
2nd. line leftr text text at the right side
....the content of the stored field lookes like that:
left text right text
2nd. line leftr text text at the right side
On the other hand: if i extrat the pdf to text (using linux tool pdftotext) followed by indexing the textfile (instead the pdf) using the extract request hendler -> the stored field contains/includes the layout. So the text snipped (and the content of the stored field in solr) lookes like that:
left text right text
2nd. line leftr text text at the right side
My Question: Is there a way to keept the layout also while indexing an pdf, not only an text file?