2

I want Solr to return a document only if all the tokens are present in the query. For example if a document have a text of "foo bar", then it will only be retrieved with a query like "foo bar", "foo bar baz", "bar foo baz". But not with "foo baz" or "baz bar".

It's similar to the Minimum Match (mm) provided with the DisMax parser, but mm will counts the matches from the query not the document.

I only need to retrieve a specific number of document, so it's also okay if there's a way to at least boost such documents to the top and take the first N document.

Kavaliro
  • 119
  • 1
  • 8
  • Depending on the length of the field, indexing the field value as a single token and then using the shinglefilter as a query filter is a common strategy. You might want to sort the tokens before sending them to Solr, since you only want to check for them being present, not being ordered. – MatsLindh May 11 '23 at 21:33
  • I think exact match should server the purpose for you case – Hafiz Muhammad Shafiq May 12 '23 at 02:34
  • @MatsLindh I think this would still miss those cases where the query after sorting doesn't have "foo" and "bar" next to each other. If any other token falls between them, the query would fail to match the document. The easiest way to solve this would be to generate all possible ordered tokens combinations of the query, but I was hopping to avoid that. – Kavaliro May 12 '23 at 06:02
  • 1
    Yeah, I missed that limitation. What is the length of the query? If you're trying to inverse search (where you're storing the search itself and then sending documents to find the matching searches), that's called percolating - there's a suggested version here: https://stackoverflow.com/questions/30473406/does-solr-support-percolation (i.e. create a memory index with that single doc, then run all stored queries against the doc to see if they match). This scales surprising well - there's also a few options linked in that question. Elasticsearch has it built-in as a "percolate query". – MatsLindh May 12 '23 at 09:19
  • @MatsLindh Yeah, Thanks. Percolating is exactly what I want to do. The stored searches are around 100K and with a max of 3 tokens, and the incoming query are 4 tokens on average. Do you think this can scale well with that number of stored searches? would the query be under 400ms? – Kavaliro May 12 '23 at 15:52
  • If there's only four tokens on average, I'd think doing what we mentioned about just generating the different combinations and sending them as a single boolean query would work well. When you get to seven or more tokens as a query it'll be worse, but my initial guess is that you could just search for the complete query string, extract the matching queries and then run through that subset to determine matches. If my memory serves me right (.. which might be a tall order), there's also the possiblity of using `field:value=1` (instead of ^ to boost) to assign a score to the match instead – MatsLindh May 12 '23 at 17:44
  • You could then look at the top results and consider all with a score of 3 to match. I don't have an index to test against where I am at the moment, but might have the option later if you don't see any results with that strategy. What's the worst length as the query string (i.e. the longest token sequence)? – MatsLindh May 12 '23 at 17:45
  • @MatsLindh The longer query I can support the better. Generating all tokens will results in `N!` combinations. So probably 4 is as high as it can get. This was my plan B if there was no other good solutions. The ranking idea could work. If I understand correctly, you can assign a score for each token match, right? Do you recall how to do this? – Kavaliro May 15 '23 at 10:30
  • My memory was slightly off, but `field:value^=1` should create a constant score for that term. See https://solr.apache.org/guide/solr/latest/query-guide/standard-query-parser.html#constant-score-with So making a query with `foo^=1 bar^=1 baz^=1` and `df=field` should give you the matching queries by looking at those with score 3, I'd think? – MatsLindh May 15 '23 at 11:18

1 Answers1

0

Simply add double quotes to your query and it will search exact terms only.

eg. "foo bar" will bring "foo bar" only and not "foo something" or "bar something"

Dimanshu Parihar
  • 347
  • 2
  • 12
  • This doesn't solve the problem I'm asking about. I need to match the documents that have "foo" and "bar" alone as well. So I need to get the documents that have ALL their tokens matched. – Kavaliro May 15 '23 at 10:33