7

In Solr (3.3), is it possible to make a field letter-by-letter searchable through a EdgeNGramFilterFactory and also sensitive to phrase queries?

By example, I'm looking for a field that, if containing "contrat informatique", will be found if the user types:

  • contrat
  • informatique
  • contr
  • informa
  • "contrat informatique"
  • "contrat info"

Currently, I made something like this:

<fieldtype name="terms" class="solr.TextField">
    <analyzer type="index">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
    </analyzer>
    <analyzer type="query">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
    </analyzer>
</fieldtype>

...but it failed on phrase queries.

When I look in the schema analyzer in solr admin, I find that "contrat informatique" generated the followings tokens:

[...] contr contra contrat in inf info infor inform [...]

So the query works with "contrat in" (consecutive tokens), but not "contrat inf" (because this two tokens are separated).

I'm pretty sure any kind of stemming can work with phrase queries, but I cannot find the right tokenizer of filter to use before the EdgeNGramFilterFactory.

Triad sou.
  • 2,969
  • 3
  • 23
  • 27
Xavier Portebois
  • 3,354
  • 6
  • 33
  • 53

4 Answers4

6

Exact phrase search does not work because of query slop parameter = 0 by default. Searching for a phrase '"Hello World"' it searches for terms with sequential positions. I wish EdgeNGramFilter had a parameter to control output positioning, this looks like an old question.

By setting qs parameter to some very high value (more than maximum distance between ngrams) you can get phrases back. This partially solves problem allowing phrases, but not exact, permutations will be found as well. So that search for "contrat informatique" would match text like "...contract abandoned. Informatique..."

enter image description here

To support exact phrase query i end up to use separate fields for ngrams.

Steps required:

Define separate field types to index regular values and grams:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

<fieldType name="ngrams" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Tell solr to copy fields when indexing:

You can define separate ngrams reflection for each field:

<field name="contact_ngrams" type="ngrams" indexed="true" stored="false"/>
<field name="product_ngrams" type="ngrams" indexed="true" stored="false"/>
<copyField source="contact_text" dest="contact_ngrams"/>
<copyField source="product_text" dest="product_ngrams"/>

Or you can put all ngrams into one field:

<field name="heap_ngrams" type="ngrams" indexed="true" stored="false"/>
<copyField source="*_text" dest="heap_ngrams"/>

Note that you'll not be able to separate boosters in this case.

And the last thing is to specify ngrams fields and boosters in the query. One way is to configure your application. Another way is to specify "appends" params in the solrconfig.xml

   <lst name="appends">
     <str name="qf">heap_ngrams</str>
   </lst>
Community
  • 1
  • 1
Grimmo
  • 1,485
  • 14
  • 13
2

As alas I could not manage to use a PositionFilter right like Jayendra Patil suggested (PositionFilter makes any query a OR boolean query), I used a different approach.

Still with the EdgeNGramFilter, I added the fact that each keyword the user typed in is mandatory, and disabled all phrases.

So if the user ask for "cont info", it transforms to +cont +info. It's a bit more permissive that a true phrase would be, but it managed to do what I want (and doesn't return results with only one term from the two).

The only con against this workaround is that terms can be permutated in the results (so a document with "informatique contrat" will also be found), but it's not that a big deal.

Xavier Portebois
  • 3,354
  • 6
  • 33
  • 53
  • Hi, Xavier. Can you please explain how did you transform "cont info" to +cont+info is there any out of the box util class for this ? Or is this just identify the double quotations and transform manually ? I'm trying to solve this : http://stackoverflow.com/questions/37033381/solr-search-field-best-practices – wattale May 05 '16 at 14:57
  • It was a manual operation, looking for double quotations and adding the plus sign. I didn't find anything that could automate this for me :-/ – Xavier Portebois May 06 '16 at 08:24
  • Thanks for the reply xavier, For me also after crawling so much content couldn't find an out of the box solution. I thought I'm reinventing the wheel by doing this manually. But I guess doing it manually is the only option available :| – wattale May 08 '16 at 12:20
1

I've made a fix to EdgeNGramFilter so positions within a token are not incremented anymore:

    public class CustomEdgeNGramTokenFilterFactory extends TokenFilterFactory {
    private int maxGramSize = 0;

    private int minGramSize = 0;

    @Override
    public void init(Map<String, String> args) {
        super.init(args);
        String maxArg = args.get("maxGramSize");
        maxGramSize = (maxArg != null ? Integer.parseInt(maxArg)
                : EdgeNGramTokenFilter.DEFAULT_MAX_GRAM_SIZE);

        String minArg = args.get("minGramSize");
        minGramSize = (minArg != null ? Integer.parseInt(minArg)
                : EdgeNGramTokenFilter.DEFAULT_MIN_GRAM_SIZE);

    }

    @Override
    public CustomEdgeNGramTokenFilter create(TokenStream input) {
        return new CustomEdgeNGramTokenFilter(input, minGramSize, maxGramSize);
    }
}
public class CustomEdgeNGramTokenFilter extends TokenFilter {
    private final int minGram;
    private final int maxGram;
    private char[] curTermBuffer;
    private int curTermLength;
    private int curGramSize;

    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
    private final PositionIncrementAttribute positionIncrementAttribute = addAttribute(PositionIncrementAttribute.class);

    /**
     * Creates EdgeNGramTokenFilter that can generate n-grams in the sizes of the given range
     *
     * @param input   {@link org.apache.lucene.analysis.TokenStream} holding the input to be tokenized
     * @param minGram the smallest n-gram to generate
     * @param maxGram the largest n-gram to generate
     */
    public CustomEdgeNGramTokenFilter(TokenStream input, int minGram, int maxGram) {
        super(input);

        if (minGram < 1) {
            throw new IllegalArgumentException("minGram must be greater than zero");
        }

        if (minGram > maxGram) {
            throw new IllegalArgumentException("minGram must not be greater than maxGram");
        }

        this.minGram = minGram;
        this.maxGram = maxGram;
    }

@Override
public final boolean incrementToken() throws IOException {
    while (true) {
        int positionIncrement = 0;
        if (curTermBuffer == null) {
            if (!input.incrementToken()) {
                return false;
            } else {
                positionIncrement = positionIncrementAttribute.getPositionIncrement();
                curTermBuffer = termAtt.buffer().clone();
                curTermLength = termAtt.length();
                curGramSize = minGram;
            }
        }
        if (curGramSize <= maxGram) {
            if (!(curGramSize > curTermLength         // if the remaining input is too short, we can't generate any n-grams
                    || curGramSize > maxGram)) {       // if we have hit the end of our n-gram size range, quit
                // grab gramSize chars from front
                int start = 0;
                int end = start + curGramSize;
                offsetAtt.setOffset(start, end);
                positionIncrementAttribute.setPositionIncrement(positionIncrement);
                termAtt.copyBuffer(curTermBuffer, start, curGramSize);
                curGramSize++;

                return true;
            }
        }
        curTermBuffer = null;
    }
}

    @Override
    public void reset() throws IOException {
        super.reset();
        curTermBuffer = null;
    }
}
1

Here is what I was thinking -
For the ngrams to be phrase matched the position of the tokens generated for each word should be the same.
I checked for the edge grams filter and it increments the tokens, and didn't find any parameter to prevent it.
There is a position filter available and this maintains the tokens position to the same token as to the begining.
So if the following configuration is used all tokens are at the same position and it matches the phrase query (same token positions are matched as phrases)
I checked it through the anaylsis tool and the queries matched.

So you might want to try the hint :-

<analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <charFilter class="solr.MappingCharFilterFactory" 
            mapping="mapping-ISOLatin1Accent.txt" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
            generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
            catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" 
            maxGramSize="15" side="front"/>
    <filter class="solr.PositionFilterFactory" />
</analyzer>
Daniel Rikowski
  • 71,375
  • 57
  • 251
  • 329
Jayendra
  • 52,349
  • 4
  • 80
  • 90
  • The idea is neat, but doesn't seem to work anyway :-/ Even if I got matches through the admin analysis tool, a real query returns nothing (probably because in the analysis tool, the way it highlights tokens doesn't bother with phrases). Also, [PositionFilter](http://tinyurl.com/solr-positionfilter) makes the query _boolean_ as said on the wiki, so "contrat informatique" or even "+contrat +informatique" returns documents with "contrat" but also without "informatique" as the default operator is a OR. A possible alternative would be to transform the query in +contrat +informatique, I think. – Xavier Portebois Oct 03 '11 at 08:19