1

I'm using Solr 6.2.1 and ExtractingRequestHandler (already included in Solr 6.2.1) to index pdf and word documents. All documents (pdf and word) are indexed with metadata (title, date, cp_revision, compagny, ...) but the content field is always empty.

According to the documentation I should have a non-empty content field : "Tika adds all the extracted text to the content field."

Has anybody know why the content field is empty ? According to this post answer it's maybe because I open my file in a non-binary mode but how to do it in binary mode ?

This is my solrconfig.xml file :

<lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-cell-\d.*\.jar" />

...

<requestHandler name="/update/extract"
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults">
    <str name="xpath">/xhtml:html/xhtml:body/descendant:node()</str>
    <str name="capture">content</str>
    <str name="fmap.meta">attr_meta_</str>
    <str name="uprefix">attr_</str>
    <str name="lowernames">true</str>
  </lst>
</requestHandler>
Community
  • 1
  • 1
Marine Msg
  • 23
  • 2

3 Answers3

0

Try indexing with the files example in the examples/files, it is designed to parse rich-text format. If that works, you can figure out what goes wrong in your definition. I suspect the xpath parameter may be wrong and returning just empty content.

Alexandre Rafalovitch
  • 9,709
  • 1
  • 24
  • 27
  • I tested the files example and it doesn't work either. What should I change in the xpath parameter ? This is the original solrconfig.xml file, I changed nothing. – Marine Msg Oct 21 '16 at 12:30
  • If the original files example pulls no content, then your PDF is most likely just images with no content. But it should work for MSWord? Otherwise, step away from the Solr and test it directly with separately-downloaded Tika (which is what Solr uses under the covers). – Alexandre Rafalovitch Oct 21 '16 at 14:26
0

I was using the solr:alpine Docker image and had the same problem. Turns out the "content" field was getting mapped to Solr's "text" field which is indexed but not stored by default. See if "fmap.content=doc_content" in Curl does the trick.

0

I was having a similar problem and I fixed by setting the /update/extracthandler request handler to this:

<requestHandler name="/update/extract"
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
  <str name="lowernames">true</str>
  <str name="fmap.meta">ignored_</str>
  <str name="fmap.content">content</str>
  <str name="update.chain">uuid</str>
</lst>

The key part being the content where it maps the Tika obtained contents to your "content" field, which must be defined in your schema, probably as stored=true

s1m3n
  • 623
  • 6
  • 21