I am running SOLR as search engine for an intranet with just over 40000 docs. I keep it very simple by using the copyField directive to copy the title
and the keywords
fields to the content
field and index only that.
Since now we were using this config:
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="German" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
That worked pretty good, but there were complains, that the wildcard had to be set manually. So I added the NGRamFilterFactory
as the last line in the analyzer:
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="German" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="30" />
</analyzer>
The problem now is: with the old config I used to find 7 docs with a certain keyword ('Sony'). Now, there are only 2. I completely flushed the index and build it up from the scratch. When I take that line out again and reindex the docs it works as expected again. That leads me to the questions I have:
- is the FilterFactory the right thing for me or should it be the tokenizer factory? And if the tokenizer: can it run after the filters?
- I am adding the docs as xml in tranches of 75 docs and doing a commit at the very end. Should there be more commits?
- There was another one that I forgot now .. grr
Thanks in advance!