0

I have one index "name_and_title_index" with two fields "name" and "title".

Indextool gives me this information on interested keywords:

keyword ,docs    ,hits    ,offset
word7   ,56      ,57      ,519386707
word8   ,154     ,161     ,475390304
word2   ,2438    ,2597    ,14258546
word3   ,26599   ,29074   ,68018978
word5   ,475349  ,656569  ,191390685
word1   ,645079  ,881965  ,303666122
word6   ,1089457 ,1435180 ,350540391

indexed_documents - 10742342, total keywords - 1379888

It seems to me I do not understand rankers since all off them returns results in different order than I've expect.

I expect any result with word7 would have higher weight (there is only 56 docs out of 10.7M)

The SphinxQL is:

SELECT 
    ID, 
    WEIGHT(), 
    SNIPPET(name, 'word1 word2 word3 word4 word5 word6') AS _name, 
    SNIPPET(title, 'word7 word8 word9') AS _title 
FROM 
    name_and_title_index 
WHERE 
    MATCH('@name "word1 word2 word3 word4 word5 word6"/0.5 @title "word7 word8 word9"/0.5')

Different rankers gives me next results:

RANKER=PROXIMITY_BM25;

| 1 | 6546 | _ <b>word6</b> <b>word1</b> <b>word2</b> <b>word3</b>         | _ _ <b>word8</b> _ _ <b>word7</b>    |
| 4 | 6528 | _ _ _ _ _ _ _ _ <b>word2</b> <b>word3</b> <b>word4</b> _      | _ _ <b>word8</b> _ _ _ _ _ ...       |
| 2 | 4521 | <b>word5</b> <b>word6</b> _ _ _ _ _ _ <b>word1</b> _ _        | _ <b>word7</b> _ _ _ _ _ _ _ _ ...   |
| 3 | 4520 | <b>word5</b> _ <b>word1</b> _ _ _ _ _ <b>word6</b> _ _        | _ _ _ _ _ _ _ _ _ _ _ _ <b>word7</b> |
| 5 | 4519 | <b>word1</b> _ _ _ _ _ <b>word5</b> <b>word6</b> _ _ _ _      | _ _ _ _ _ _ <b>word8</b> _ _ _ _ _ _ |
| 6 | 2520 | <b>word5</b> _ _ _ _ _ ... _ _ _ _ <b>word6</b> _ _ _ _ _ ... | ... _ _ _ _ _ _ _ <b>word8</b> _ _   |


RANKER=BM25;

| 1 | 2546 | _ <b>word6</b> <b>word1</b> <b>word2</b> <b>word3</b>         | _ _ <b>word8</b> _ _ <b>word7</b>    |
| 4 | 2528 | _ _ _ _ _ _ _ _ <b>word2</b> <b>word3</b> <b>word4</b> _      | _ _ <b>word8</b> _ _ _ _ _ ...       |
| 2 | 2521 | <b>word5</b> <b>word6</b> _ _ _ _ _ _ <b>word1</b> _ _        | _ <b>word7</b> _ _ _ _ _ _ _ _ ...   |
| 3 | 2520 | <b>word5</b> _ <b>word1</b> _ _ _ _ _ <b>word6</b> _ _        | _ _ _ _ _ _ _ _ _ _ _ _ <b>word7</b> |
| 5 | 2520 | <b>word1</b> _ _ _ _ _ <b>word5</b> <b>word6</b> _ _ _ _      | _ _ _ _ _ _ <b>word8</b> _ _ _ _ _ _ |
| 6 | 2519 | <b>word5</b> _ _ _ _ _ ... _ _ _ _ <b>word6</b> _ _ _ _ _ ... | ... _ _ _ _ _ _ _ <b>word8</b> _ _   |



RANKER=SPH04;

| 4 | 16528 | _ _ _ _ _ _ _ _ <b>word2</b> <b>word3</b> <b>word4</b> _      | _ _ <b>word8</b> _ _ _ _ _ ...       |
| 1 | 14546 | _ <b>word6</b> <b>word1</b> <b>word2</b> <b>word3</b>         | _ _ <b>word8</b> _ _ <b>word7</b>    |
| 2 | 14521 | <b>word5</b> <b>word6</b> _ _ _ _ _ _ <b>word1</b> _ _        | _ <b>word7</b> _ _ _ _ _ _ _ _ ...   |
| 3 | 14520 | <b>word5</b> _ <b>word1</b> _ _ _ _ _ <b>word6</b> _ _        | _ _ _ _ _ _ _ _ _ _ _ _ <b>word7</b> |
| 5 | 14519 | <b>word1</b> _ _ _ _ _ <b>word5</b> <b>word6</b> _ _ _ _      | _ _ _ _ _ _ <b>word8</b> _ _ _ _ _ _ |
| 6 | 10520 | <b>word5</b> _ _ _ _ _ ... _ _ _ _ <b>word6</b> _ _ _ _ _ ... | ... _ _ _ _ _ _ _ <b>word8</b> _ _   |

Why result 4 is always higher than result 2 and 3 (and with SPH04 it is higher than result 1)?

Anton
  • 56
  • 3
  • 6
  • 1
    Have you tried using `packedfactors()` function to extract details information about ranking for each document? That might help explain some details. – barryhunter Oct 13 '16 at 10:31
  • For bm25 (since packedfactors() works only if expression ranker is specified) I get: word0=(tf=0, idf=0.009163), word1=(tf=1, idf=0.011779), word2=(tf=1, idf=0.011624), word3=(tf=1, idf=0.014978), word5=(tf=0, idf=0.009976), word6=(tf=0, idf=0.010907), word7=(tf=0, idf=0.017064), word10=(tf=1, idf=0.015675) Here word# != word# from match. And none of this idf's is near to smth I expect for keyword with 56 docs out of 10.7M. – Anton Oct 13 '16 at 22:02
  • Have you read about the 'idf' option? You can change how idf is computed http://sphinxsearch.com/docs/current.html#sphinxql-select for legacy reasons it might be using a unconventional algorithm. – barryhunter Oct 14 '16 at 14:02

0 Answers0