3

I am searching some phrase in Solr of Name field. I tryed different comfigurations for Name, to be of type string or any custom TextField.

  <fieldType name="string" class="solr.StrField" sortMissingLast="true" 
  docValues="true" />
  <fieldType name="alphaOnlySort" class="solr.TextField" 
sortMissingLast="true" omitNorms="true">
     <analyzer>
       <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory"/>
       <filter class="solr.PatternReplaceFilterFactory" replace="all" 
  replacement="" pattern="([^a-z])"/>
     </analyzer>
 </fieldType>

I defined Name like this:

then tried like string:

Also I tried different tokenizers and filters combinations without success.

This is what I want: I have phrase 'test split' and I have some entries that have Name 'test', 'test 124','testblablabla' and 'test split 124'. What I find out is that 'test' entry is first match in my example,and 'test split' has much much lower ranking altought it has more matching letters. Why is that??

I am testing using solr admin interface and my q (query) parametar is: Name:*test split*

EDIT 1:

I also tryed to create copyField called ExactName which has this configuration:

 <fieldType name="exact" class="solr.TextField">
    <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer> 
 </fieldType>

and I search like this:

Name:*test split* OR (ExactName:*test split*)^5.0 

Still 'test' comes much before 'test split' :(

Vlado Pandžić
  • 4,879
  • 8
  • 44
  • 75
  • Have you tried using the `pf`, `pf2` or `pf3` parameters for the (e)dismax handlers? Those are created to give boosts to matching sequences. Also remember that wild card searches will skip most parts of the analysis chain. – MatsLindh Sep 28 '17 at 13:23
  • What to write there? My field name 'Name'. I tryied but still nothing – Vlado Pandžić Sep 28 '17 at 13:28
  • The ranking of the documents has to do with the lucence scoring function, which was [`tf-idf`](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) before Solr 6 and [`BM25`](https://en.wikipedia.org/wiki/Okapi_BM25) after. [Here is a good post comparing the two.](http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/) – qbzenker Sep 28 '17 at 19:26

2 Answers2

3

First of all, what do you want ? Do you want to return only results for your phrase ? or boost more phrase matches in comparison to other types of matches ?

The edismax (and its properties) are probably your solution. You can play with the mm parameter ( configuring the minimum match for clauses) and the pf( which boost phrases match) . [1]

If you just want the phrase to match : "test split" query should do the trick. Don't use * wildcard queries, use a proper analysis to split the tokens, wildcard queries are very inefficient in general.

[1] https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html

[2] https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thepf_PhraseFields_Parameter

  • +1 for "wildcard queries are very inefficient in general.". Comes only with tons of hands-on experience. – BB23850 May 24 '20 at 05:21
1

Your thoughts to solve this problem is actually correct. There are multiple ways to do this. It is possible to solve this at query-time by boosting span queries, but more efficient is to do this also at indexing time.

What often is done for name searching is indeed boosting phrases. You could add a filter in the exact fieldType. Checkout shingles with the Shingle Filter with a default of minShingleSize of 2. Shingles are token n-grams.

You could create a fieldType without lowercasing as well by adding an extra copyField and also with the Shingle Filter.

Then boosting the fields is the next step. If you use the eDisMax query parser, you could use the bf parameter to boost the fields:

  • Case-sensitive (no lower-casing) + shingles has highest boost
  • Case-insensitive (with lower-casing) + shingles with 2nd highest boost
  • Standard field without boost.
drjz
  • 648
  • 3
  • 10