1

Using Solr for searching docs in English and Korean languages, so far Korean language search is working fine. Need to extend English exact phrase to match with partial words too.

Solr query I used:

content: "He go"

is not matching with He goes, He gone, He goal, etc.

I tried with like these but not worked

content: "He go"*
content: "He go*"

Current field schema

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CJKBigramFilterFactory" han="false" hiragana="false" katakana="false" hangul="true" outputUnigrams="true" />
    </analyzer>
</fieldType>

So my input and expected output is given below:

Input: He go ( with quote)
Output: He goes, He gone, He goals ( should match with docs having those words, can be a partial match )

How can I achieve this functionality, any suggestion is highly appreciated.

2 Answers2

1

If you want to search by parts of a word, you need apply, for example, N-Gram Tokenizer, <tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="10"/>

eg.

In: "bicycle"

Out: "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle"

In this case you will be able to search by the part of word. You need apply the NGramTokenizerFactory for both analyzers:

<fieldType name="custome_field_type" class="solr.TextField" positionIncrementGap="100" multiValued="false">
    <analyzer type="index">
      <tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="10"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="10"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

If you use the above field type then analysis of the same on the admin tool is as below.

Analysis of the solr admin tool

You can also try the below query analyzer. It all depends on your requirement.

<analyzer type="query">
  <tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>

You can modify or add the field types in your schema.xml and apply it to your field. Once done restart the server, re-index the data. You can verify the above fieldType for your field if the data matches using solr admin tool.

I have used the below field type and done the analysis using the solr tool.

Here is the field type :

    <fieldType name="custome_field_type" class="solr.TextField" positionIncrementGap="100" multiValued="false">
        <analyzer type="index">
          <tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="10"/>
          <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
          <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
        <analyzer type="query">
          <tokenizer class="solr.KeywordTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
   </fieldType>

Please find the analysis of the same from the solr admin tool.

Solr Analysis Page

Abhijit Bashetti
  • 8,518
  • 7
  • 35
  • 47
1

The Complex Phrase Query Parser supports inline wildcards in a phrase. In your case appending inOrder=true to the parameters will give you the behavior you want.

There's a few limitations that you should be aware of:

Performance is sensitive to the number of unique terms that are associated with a pattern. For instance, searching for "a*" will form a large OR clause (technically a SpanOr with many terms) for all of the terms in your index for the indicated field that start with the single letter 'a'. It may be prudent to restrict wildcards to at least two or preferably three letters as a prefix. Allowing very short prefixes may result in to many low-quality documents being returned.

Notice that it also supports leading wildcards "*a" as well with consequent performance implications. Applying ReversedWildcardFilterFactory in index-time analysis is usually a good idea.

MatsLindh
  • 49,529
  • 4
  • 53
  • 84