solr pdf search highlighting issue

Question

solr v6.5:- I have 2 pdf files indexed in a solr core. When I search for a keyword it is getting found in the document, however, the highlighting works for one document and not the other. For ex: when I search for "panic" which is there in one of the documents. I get the search result with highlighting. But when I search for "epsilon", I get a result that says it has been found with the document information etc, however, the highlighting for this document is not working. Heres whats been added/changed in managed_schema.xml:

    .
    .
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
.
. 
    <field name="_text_" type="text_general" multiValued="true" indexed="true" stored="true"/>
    <field name="content" type="text_general" multiValued="true" indexed="true" stored="true"/>
    .
    .
    <copyField source="content" dest="_text_"/>

And, solrconfig.xml snippet is as follows:

.
.
<requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="fmap.meta">ignored_</str>
      <str name="fmap.content">_text_</str>
    </lst>
  </requestHandler>
.
.

Please check how the word "epsilon" is indexed in Solr using a search query and check whether it contains any space or capital letters in it. And also check what kind of analyzers you are using for "_\_text\__" field(Both "index" and for "query")? add this in question too so that the issue can be found. — Riya, Apr 24 '17 at 12:01
@Riya I am pretty sure that the word I am searching for is "as is" in the document as well. Also, have made an edit including the fieldtype details — user7913157, Apr 24 '17 at 12:22
Once again, searching and highlighting works for one pdf doc(small size pdf file). However, searching works but not highlighting for the other pdf(comparatively larger in size). If that helps. — user7913157, Apr 24 '17 at 12:29

score 0 · Answer 1 · answered Apr 25 '17 at 13:43

Used the

hl.maxAnalyzedChars=aLargeEnoughValue

parameter in the query and it gives me highlighting for search words which are farther down the doc. The default value for this parameter is 51200.

Take-away: Large docs when indexed in Solr would give +ve results for SEARCH, however, highlighting could be null/nothing. This happens if the word searched for is farther down the document. Simply increasing the value of hl.maxAnalyzedChars does the job.

solr pdf search highlighting issue

1 Answers1