0

I've a problem with the pdf text extraction of Solr. Solr uses Apache Tika for extracting the text of a PDF file and tika uses PDFBox for that. When I send my PDF file to Solr it extracts the text successfully, but the text is totally messed up. Something like that

MonaPersNr.KSt.KUZKapaz.Sollstd.MonatJahrtsbericht

But when I extract the same PDF file directly with PDFBox at the command line with following command I'll get a nice result.

java -jar pdfbox-app-1.6.0.jar ExtractText -console test.pdf

I don't know which Tika version or better which PDFBox version is used by solr. I even can't find the library for that in the solr war file... All libs in the lib dir are the following:

09.09.2011  09:06    <DIR>          .
09.09.2011  09:06    <DIR>          ..
09.09.2011  09:06         1.421.869 apache-solr-core-3.4.0.jar
07.09.2011  13:12            22.478 apache-solr-noggit-r1099557.jar
09.09.2011  09:06           281.626 apache-solr-solrj-3.4.0.jar
07.09.2011  13:12           188.671 commons-beanutils-1.7.0.jar
07.09.2011  13:12            58.160 commons-codec-1.4.jar
07.09.2011  13:12           575.389 commons-collections-3.2.1.jar
07.09.2011  13:12            27.361 commons-csv-1.0-SNAPSHOT-r966014.jar
07.09.2011  13:12            57.779 commons-fileupload-1.2.1.jar
07.09.2011  13:12           305.001 commons-httpclient-3.1.jar
07.09.2011  13:12           109.043 commons-io-1.4.jar
07.09.2011  13:12           257.923 commons-lang-2.4.jar
07.09.2011  13:12            28.804 geronimo-stax-api_1.0_spec-1.0.1.jar
07.09.2011  13:12           932.554 guava-r05.jar
07.09.2011  13:12            17.308 jcl-over-slf4j-1.6.1.jar
07.09.2011  13:12            12.359 log4j-over-slf4j-1.6.1.jar
09.09.2011  09:04           850.852 lucene-analyzers-3.4.0.jar
09.09.2011  09:02         1.398.580 lucene-core-3.4.0.jar
09.09.2011  09:04            61.997 lucene-grouping-3.4.0.jar
09.09.2011  09:04            83.615 lucene-highlighter-3.4.0.jar
09.09.2011  09:04            30.214 lucene-memory-3.4.0.jar
09.09.2011  09:04            69.797 lucene-misc-3.4.0.jar
09.09.2011  09:04            45.979 lucene-queries-3.4.0.jar
09.09.2011  09:04            57.912 lucene-spatial-3.4.0.jar
09.09.2011  09:04            62.164 lucene-spellchecker-3.4.0.jar
07.09.2011  13:12            25.496 slf4j-api-1.6.1.jar
07.09.2011  13:12             8.890 slf4j-jdk14-1.6.1.jar
07.09.2011  13:12           419.521 velocity-1.6.1.jar
07.09.2011  13:12           309.896 velocity-tools-2.0-beta3.jar
07.09.2011  13:12           520.969 wstx-asl-3.2.7.jar
              29 Datei(en)      8.242.207 Bytes
               2 Verzeichnis(se), 21.805.932.544 Bytes frei

I would be really really happy if somebody knows a solution for that.

javanna
  • 59,145
  • 14
  • 144
  • 125
itsme
  • 852
  • 1
  • 10
  • 23

1 Answers1

2

Solr holds the additional jars for Tika and its dependencies in a separate folder, which are not packaged as a part of the Solr deployable.

For Solr 3.4 -

If you have the solr trunk, the jars can be seen in the solr/contrib/extraction/lib folder

On the subversion you can find the jars @ path which is pdfbox-1.3.1.jar

The trunk for Solr has the latest pdfbox-1.6.0.jar.

Jayendra
  • 52,349
  • 4
  • 80
  • 90
  • ok i replaced the pdfbox, fontbox and jempbox lib with the newest 1.6.0 jar files and I get still the same result. – itsme Nov 08 '11 at 10:31
  • Ok, when I use the nightly build archive text extraction works pretty nice. But I would prefer to use a stable build – itsme Nov 08 '11 at 10:46
  • I've replaced the dist and contrib directory with the content of the nightly build. Now PDF extraction works good. I hope everything else will still be stable too =) – itsme Nov 08 '11 at 10:50
  • Solr 3.4 doesn't have the latest pdf box. Apache Tika works with the specified pdfbox jars, so upgrading the jars maybe not work as is. You may need to check the trunk, or wait for 3.5 and check if it was the updated ones. – Jayendra Nov 08 '11 at 10:50
  • yup .. if are you working the trunk build, you may want to check on the stability. – Jayendra Nov 08 '11 at 10:52
  • But for some reason the command "java -jar pdfbox-app-1.6.0.jar ExtractText -console -sort test.pdf" extracts the text better then a "curl "http://localhost:8983/apache-solr-3.4.0/update/extract?extractOnly=true" -F "myfile=@test.pdf"". Is there any way to set an equivalent option to -sort at the http request? – itsme Nov 08 '11 at 10:56
  • Ok I got the solution... In tika it's hardcoded that the text should not be sorted. That is for performance reasons. In my opinion this is stupid, because a good text extraction is more important. But ok, not my choice. So to fix this, you've to download the apache tika source code and change in tika-parsers\src\main\java\org\apache\tika\parser\pdf\PDF2XHTML.java the expression setSortByPosition(false); to setSortByPosition(true); and rebuild the whole project with "mvn clean install -Dmaven.test.failure.ignore=true" and replace the tika-parser jar in contrib lib of solr – itsme Nov 08 '11 at 12:11