3

I'm using postgres FTS for prefix searching.
When a match occurs I want to favor an exact match over a prefix match.

A query of work:* should order work first and then workday. This query will give the same rank for both matches, so this won't work

SELECT ts_rank(to_tsvector('simple', 'work'), to_tsquery('simple', 'work:*'))
UNION ALL
SELECT ts_rank(to_tsvector('simple', 'workday'), to_tsquery('simple', 'work:*'));
-- result:
0.06079271
0.06079271

When I add an or clause for work ( trough | work ) , the rank for workday suddenly becomes 0. I'm not understanding why the or clause reduces the rank to 0, since there is still a matching term (work:*) in the query

SELECT ts_rank(to_tsvector('simple', 'work'), to_tsquery('simple', 'work:* | work'))
UNION ALL
SELECT ts_rank(to_tsvector('simple', 'workday'), to_tsquery('simple', ' work:* | work'));
-- result:
0.06079271
0 

When I switch the two or clauses positionally, the result is the same as not adding the clause at all:

SELECT ts_rank(to_tsvector('simple', 'work'), to_tsquery('simple', 'work | work:*'))
UNION ALL
SELECT ts_rank(to_tsvector('simple', 'workday'), to_tsquery('simple', 'work | work:*'));
-- result:
0.06079271
0.06079271

SQL fiddle: http://sqlfiddle.com/#!17/9eecb/89476

My goal is to better understand ts_rank and why the ranking gets impacted in such a dramatic way, potentially reporting any bugs too postgres.

I can not find any mention of the order of the query terms them self to have any impact on ranking.
https://www.postgresql.org/docs/13/textsearch-controls.html#TEXTSEARCH-RANKING

note: I am using ranking to order the results because my query can contain an arbitrary amount of keywords.

Willem D'Haeseleer
  • 19,661
  • 9
  • 66
  • 99

2 Answers2

-1

How about a query that explicitly ranks full matches before prefix matches?

SELECT doc
FROM tab
WHERE to_tsvector('simple', doc) @@ to_tsquery('simple', 'work:*')
ORDER BY NOT to_tsvector('simple', doc) @@ to_tsquery('simple', 'work'),
         ts_rank(to_tsvector('simple', doc), to_tsquery('simple', 'work:*'));

That relies on FALSE < TRUE.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
-1
to_tsquery('simple', 'work:*'), to_tsquery('simple', 'work:*') 

is the wrong way. Because :
select ('simple', 'work:*')::tsquery; will yield error. So:
select to_tsquery('simple', 'work:*'); == select to_tsquery( 'work:*');


why SELECT ts_rank(to_tsvector('workday') ,( 'work'::tsquery)); yield 0 Because workday and work is two different lexlexemes


Translate your question: why SELECT ts_rank(to_tsvector('workday') ,( 'work'::tsquery)); yield 0
select ts_rank(to_tsvector('workday'),to_tsquery('work:*')); yield 0.06079271
https://www.postgresql.org/docs/current/textsearch-controls.html. The order is important.

Ranking attempts to measure how relevant documents are to a particular query, so that when there are many matches the most relevant ones can be shown first. PostgreSQL provides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur.

However there is another function ts_rank_cd

select ts_rank_cd(to_tsvector('workday'),to_tsquery('work | work:*')); 

return 0.1

select ts_rank_cd(to_tsvector('workday'),to_tsquery('work:* | work'));

return 0.1

For now I don't know 0.1 mean. usefullink: https://linuxgazette.net/164/sephton.html
https://www.postgresql.org/message-id/flat/Pine.LNX.4.64.0709111118150.2767%40sn.sai.msu.ru#f5da1024abc8a2fc85814fd49a8bd71d
http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking
https://www.postgresql.eu/events/pgconfeu2018/sessions/session/2116/slides/137/pgconf.eu-2018-fts.pdf
https://www.postgresql.org/message-id/CAAMbBLtckoNmFp6irVXsKcZufM6_DiQGN2vtHhWVgofESsE11g@mail.gmail.com


SELECT ts_rank(to_tsvector('simple', 'workday'), to_tsquery('simple', ' work:* | work'))

The Same as

SELECT ts_rank(to_tsvector('workday'), to_tsquery(' work:* | work'));

please read the manual. I also explained. If you want simple workday, you need use operator. 'simple & workday'

Order matters, then the result is inconsistent.
So You want consistent result use ts_rank_cd

jian
  • 4,119
  • 1
  • 17
  • 32
  • `to_tsquery('simple', 'work:*')` is valid use of ts_query and does not cause an error, it is not clear to me how this relates to the problem/solution. Could you clarify why `SELECT ts_rank(to_tsvector('simple', 'workday'), to_tsquery('simple', ' work:* | work'));` returns 0? – Willem D'Haeseleer Feb 27 '22 at 22:03
  • it' valid but it's same as `select to_tsquery( 'work:*');`, there is no this way, `to_tsquery('simple', 'work:*')` in the manual. – jian Feb 28 '22 at 03:49
  • The first parameter is the config_name. Please see this sql fiddle: http://sqlfiddle.com/#!17/9eecb/89810 See usage here: https://www.postgresql.org/docs/13/textsearch-tables.html. I would encourage you to correct/remove your answer – Willem D'Haeseleer Feb 28 '22 at 06:23
  • @WillemD'Haeseleer Thanks. I know what simple mean: simple dictionary: just lower case and check for stopword. But still the tsquery order matters, as per manual. here also have examples: https://stackoverflow.com/questions/36739215/postgresql-making-ts-rank-take-the-ts-vector-position-as-is-or-defining-a-cust – jian Feb 28 '22 at 06:45
  • @WillemD'Haeseleer correct me if i wrong. Full text search based on lexemes. I guess that means `select ts_rank(to_tsvector('simple', 'hello'), to_tsquery('simple', 'hell'));` will return 0. So `select ts_rank(to_tsvector('simple', 'workday'), to_tsquery('simple', 'work'));` will also return 0. – jian Feb 28 '22 at 06:53
  • Correct, this question specifically is about prefix queries though. So this `SELECT ts_rank(to_tsvector('simple', 'workday'), to_tsquery('simple', 'work:*'));` will return `0.06079271` but somehow this `SELECT ts_rank(to_tsvector('simple', 'workday'), to_tsquery('simple', ' work:* | work'));` returns 0, even though this query matches not less or more terms, why is that ? – Willem D'Haeseleer Feb 28 '22 at 14:48
  • I guess the next step is read the paper or email to this guy. http://www.sai.msu.su/~megera/ But like the manual said ts_rank the order matter. and `select ts_rank(to_tsvector('simple', 'workday'), to_tsquery('simple', 'work'));` will return 0. – jian Feb 28 '22 at 15:20
  • Can you clarify in which scenario I would use `simple & workday` ? simple is the dictionary config, how can it be in the actual query ? I really appreciate your answer, but it is super confusing. – Willem D'Haeseleer Mar 05 '22 at 09:59