In Solr (3.3), is it possible to make a field letter-by-letter searchable through a EdgeNGramFilterFactory
and also sensitive to phrase queries?
By example, I'm looking for a field that, if containing "contrat informatique", will be found if the user types:
- contrat
- informatique
- contr
- informa
- "contrat informatique"
- "contrat info"
Currently, I made something like this:
<fieldtype name="terms" class="solr.TextField">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
</fieldtype>
...but it failed on phrase queries.
When I look in the schema analyzer in solr admin, I find that "contrat informatique" generated the followings tokens:
[...] contr contra contrat in inf info infor inform [...]
So the query works with "contrat in" (consecutive tokens), but not "contrat inf" (because this two tokens are separated).
I'm pretty sure any kind of stemming can work with phrase queries, but I cannot find the right tokenizer of filter to use before the EdgeNGramFilterFactory
.