I'm using Solr 6.2.1 and ExtractingRequestHandler (already included in Solr 6.2.1) to index pdf and word documents. All documents (pdf and word) are indexed with metadata (title, date, cp_revision, compagny, ...) but the content field is always empty.
According to the documentation I should have a non-empty content field : "Tika adds all the extracted text to the content field."
Has anybody know why the content field is empty ? According to this post answer it's maybe because I open my file in a non-binary mode but how to do it in binary mode ?
This is my solrconfig.xml file :
<lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-cell-\d.*\.jar" />
...
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="xpath">/xhtml:html/xhtml:body/descendant:node()</str>
<str name="capture">content</str>
<str name="fmap.meta">attr_meta_</str>
<str name="uprefix">attr_</str>
<str name="lowernames">true</str>
</lst>
</requestHandler>