Configure apache solr3.6 with tika1.2

Question

I am using solr3.6 with tika1.2 but I can't upload pdf files. First I install solr and upload some *.xml files from the exampledocs. This files I could search with this URL http://localhost:8983/solr/select/?q=solr. And in the next step I install tika to upload pdf and doc files but it doesn't function. The following content is in the "example/solr/conf/solrconf.xml" file.

<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults"><str name="fmap.content">text</str><str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>
    <str name="tika.config">tika-data-config.xml</str>
    <str name="captureAttr">true</str>
    <str name="fmap.a">links</str>
    <str name="fmap.div">ignored_</str>
  </lst>
</requestHandler>`

And in the file "example/solr/conf/tika-data-config.xml" I have this content:

<dataConfig>
  <dataSource name="bin" type="BinFileDataSource" />
  <document>
    <entity name="f" dataSource="null" rootEntity="false" processor="FileListEntityProcessor" transformer="TemplateTransformer" baseDir="/home/ubuntu-user/Documents" fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)" onError="skip" recursive="true">
      <field column="fileAbsolutePath" name="path" />
      <field column="fileSize" name="size" />
      <field column="fileLastModified" name="lastmodified" /><entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text" onError="skip">
      <field column="Author" name="author" meta="true"/>
      <field column="title" name="title" meta="true"/>
    </entity>

If I put this lines in the console

curl http://localhost:8983/solr/update/extract?literal.id=doc2&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=@test.pdf"

I get this output

<?xml version="1.0" encoding="UTF-8"?>
  <response>
    <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">183</int>
    </lst>
  </response>

But I can't search the content with solr. If I browse to this url: http://localhost:8983/solr/browse, I see a new entry but no content.

Also I started the solr and tika server:

java -jar start.jar
java -jar tika-server-1.2.jar

Can anyone help me ?

score 1 · Answer 1 · answered Nov 13 '12 at 21:25

You need add the jars (or paths) for apache-solr-dataimporthandler-3.6, apache-solr-dataimporthandler- extras-3.6 and apache-solr-cell-3.6 in the dist folder as well as corresponding files in the contrib folder.

Then you can extract pdf's from Solr without starting a Tika server.

score 0 · Answer 2 · answered Nov 14 '12 at 04:47

Check the ExtractingRequestHandler which would help you to index the Rich documents.
You don't need to start a separate Tika Server as Solr can use the libraries added within to extract the content from the rich documents.

The jar (Solr Cell and Tika Jars needed with dependencies) required are probably within the configuration :-

<lib dir="../../dist/" regex="apache-solr-cell-\d.*\.jar" /> 
<lib dir="../../contrib/extraction/lib" regex=".*\.jar" />

score 0 · Answer 3 · answered Nov 14 '12 at 15:21

Now I have install solr new and I can search pdf's by this url

http://localhost:8983/solr/select/?q=attr_content:st*

Some PDFs are ok but by any PDF I get this Output

<arr name="attr_content"><str>                         ((stdin))      � ���������

The are attr_creation_date and attr_meta are ok.The producer was Ghostscript. GPL Ghostscript 8.63

Configure apache solr3.6 with tika1.2

3 Answers3