1

I'm new to solr. I'm trying to configure solr 6.3 using solarium but I run on a stemming issue. My collection of documents has words like: "call", "calls", "called", "calling" and "serv", "serve", "serves", "served" and "serving". I have 'serv' in there in an effort to understand the behavior of the stemmer with the produced stem. When I query solr from my solarium php page, the number of results obtained indicates that all documents that have whatever form of the searched word are taken into account. However, it doesn't show me all of the documents. For example:

For the query: 'serv' It only shows the document with 'serv' For the query: 'serve' It only shows the document with 'serve'
For the query: 'serves' It only shows the document with 'serves' and 'serv' For the query: 'served' It only shows the document with 'served' and 'serv' For the query: 'serving' It only shows the document with 'serving' and 'serv'

In the case of 'call'

call --> call,
calls --> calls call,
called --> called call,
calling --> calling, call

So by the looks of it the documents that include the keyword and the actual stem show up with the term highlighted but the rest of the documents do not show.

I would expect the stemmer to bring up all these documents with the different occurences of the keyword. i.e a search for "call" should bring up "call" "calling" "called" "calls".

The relevant parts of my schema are as follows:

<field name="content" type="text_en" indexed="true" stored="true"/>
 <field name="_text_" type="stemmed_text" multiValued="true" indexed="true" stored="false"/>
 <dynamicField name="stemmed_*" type="stemmed_text" indexed="true" stored="false" />
 <copyField source="*" dest="_text_" />

<fieldType name="stemmed_text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
  <tokenizer class="solr.ClassicTokenizerFactory"/>
  <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.KeywordRepeatFilterFactory"/>
  <filter class="solr.HunspellStemFilterFactory" dictionary="en_GB.dic" affix="en_GB.aff" ignoreCase="true" strictAffixParsing="true" />
  <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
<analyzer type="query">
  <tokenizer class="solr.ClassicTokenizerFactory"/>
  <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.KeywordRepeatFilterFactory"/>
  <filter class="solr.HunspellStemFilterFactory" dictionary="en_GB.dic" affix="en_GB.aff" ignoreCase="true" strictAffixParsing="true" />
  <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index"> 
  <tokenizer class="solr.ClassicTokenizerFactory"/>
  <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.EnglishPossessiveFilterFactory"/>
</analyzer>
<analyzer type="query">
  <tokenizer class="solr.ClassicTokenizerFactory"/>
  <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
  <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
  <filter class="solr.LowerCaseFilterFactory"/>   
  <filter class="solr.EnglishPossessiveFilterFactory"/>
</analyzer>

Part of my php page is as follows: .....

// get a select query instance
    $query = $client->createSelect();
    $query->setFields(array('id', 'subject', 'content'));
// $query->setQuery('someWord');
    $query->setQuery($someWord);
    $query->setStart(0)->setRows($limit);
// get highlighting component and apply settings
    $hl = $query->getHighlighting();
    $hl->setSnippets(15);
    $hl->setFields(array('content'));
    $hl->setSimplePrefix('<strong>');
    $hl->setSimplePostfix('</strong>');

.....

foreach ($resultset AS $document) {
            $subj ='';     
            if (is_array($document->subject))  {
                $subj = implode(', ', $document->subject);  
            }       
                echo '<table style="margin-bottom:20px; text-align:left; border:none; width:500px">';
                $highlightedDoc = $highlighting->getResult($document->id);
            if ($highlightedDoc) {  
                foreach ($highlightedDoc as $field => $highlight) {
                    echo $subj;
                    echo implode(' (...) ', $highlight) . '<br/>';
            }   
        }

        echo '</table>';
        } 

I use the solrconfig that comes with the solr installation. I would greatly appreciate it if someone could tell me what I am doing wrong. Am I missing something from my schema or is there some setting I have to configure in the solrconfig? As my last resort I am thinking of using the solr.EdgeNGramFilterFactory but I would like to avoid this. I am attaching a link to an image of my solr analysis screen.

Thank you in advance.

Solr Analysis for the word "calling"

Solr Admin Console Showing Highlighting

entropy
  • 23
  • 5
  • What do you mean by "occurences are counted in the resultset but not showing" - are the documents included in the list of hits? Are they not in the list of documents, but in the result count? Or is highlighting the part that's missing? (i.e. all variations aren't highlighted?) Have you tried running the indexed and queried term through the analysis page? – MatsLindh Nov 23 '16 at 14:09
  • i.e. I may have 4 documents each of them having a different form of the verb "call" (call, calls, called and calling respectively). All the documents are in the result count but not in the result list. A search for "call" will count all the 4 documents but will only list the document with the word "call". A search for "calling" will list the documents with the words "calling" and "call". What happens to the rest?. Highlighting works fine for the documents that are brought up. At the bottom of my previous post I attached a link to an image of my solr analysis panel. – entropy Nov 23 '16 at 19:48
  • In fact I noticed on the Solr admin console that it is the highlighting that is not working for all the terms. Any ideas why this is happening? – entropy Nov 23 '16 at 22:00
  • What is the symptoms of highlighting not working? Are the terms tagged with the same position when indexed? (the "keyword" part is not important) – MatsLindh Nov 23 '16 at 22:23
  • @MatsLindh I have edited my original post and at the end of it I added a link to an image of my SolrAdmin panel. I queried for "calling". As you will see solr finds all the documents with all the different forms of call (4 documents) but it highlights only two.. Any ideas? – entropy Nov 23 '16 at 22:56
  • @MatsLindh Thank you for your time and effort. Eventually, I managed to solve the issue. I started with a clean installation of solr and everything worked like a charm! – entropy Nov 24 '16 at 18:18

0 Answers0