3

I am running SOLR as search engine for an intranet with just over 40000 docs. I keep it very simple by using the copyField directive to copy the title and the keywords fields to the content field and index only that.

Since now we were using this config:

<analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory" />              
    <filter class="solr.SnowballPorterFilterFactory" language="German" />
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>

That worked pretty good, but there were complains, that the wildcard had to be set manually. So I added the NGRamFilterFactory as the last line in the analyzer:

<analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory" />              
    <filter class="solr.SnowballPorterFilterFactory" language="German" />
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="30" />
</analyzer>

The problem now is: with the old config I used to find 7 docs with a certain keyword ('Sony'). Now, there are only 2. I completely flushed the index and build it up from the scratch. When I take that line out again and reindex the docs it works as expected again. That leads me to the questions I have:

  • is the FilterFactory the right thing for me or should it be the tokenizer factory? And if the tokenizer: can it run after the filters?
  • I am adding the docs as xml in tranches of 75 docs and doing a commit at the very end. Should there be more commits?
  • There was another one that I forgot now .. grr

Thanks in advance!

harpax
  • 5,986
  • 5
  • 35
  • 49

2 Answers2

5

Just a wild guess -

Whats the size (number of words) in your content field ?
As, now that you have NGramFilterFactory into your filter chain with a minGramSize of 3 a lot of tokens are going to be generated and all at a new position.

The maxFieldLength settings, in solrconfig.xml, limits the number of tokens to be indexed.
The default value is 10000 (which is still high) but can be exceeded with large content and ngramfilter in the filter chain.

<maxFieldLength>10000</maxFieldLength>

Try increasing this value to a high number, re index and check if the matches are found.

Jayendra
  • 52,349
  • 4
  • 80
  • 90
  • a good 'wild guess'. Increasing the limit did the job. Do you know if it can be checked how many tokens are in the index? – harpax Oct 14 '11 at 15:47
  • you can check the terms in the index. Not sure if you can check document and field specific. try luke tool, it may help you in this. – Jayendra Oct 14 '11 at 19:10
2

I would highly recommend using the Field Analysis Debugging tool. This is accessible via the Solr Admin site (click the [Analysis] link next to [Config]). This is a very powerful tool where you can see how a text value is broken down into words, and shows the resulting tokens after they pass through each filter in the chain.

With this tool you can take one of your documents that is not being returned when you query for "Sony" and paste the text to be indexed into the index field and sony into the query field to see how Solr is applying your filters and then querying that field for matches. You can then change your schema back to the original without the NGramFilterFactory and see how your document was originally being broken down and matched to compare how the NGramFilterFactory has impacted the index and query.

Your smaller search results could be based on the minGramSize and maxGramSize settings that you have specified in the NGramFilterFactory settings. Please reference the NGramFilterFactory documentation on the Solr Wiki for more details on how these impact the indexing.

Paige Cook
  • 22,415
  • 3
  • 57
  • 68
  • I checked the results of that tool but couldn't find an error. Increasing the maxFieldLength as propsed by Jayendra Patil did the job. Thanks for your answer! – harpax Oct 14 '11 at 15:48