1

Assume that I have two documents with the following content.

{
  "title": "Windsor Farmhouse Wood Writing Desk Light Brown - Martin Furniture Furniture"
}

{
  "title": "Benjara 34 in. Rectangular Light Brown/White 1 Drawer Computer Desk, Light Brown & White"
}

The definition of the field is as follows.

field title type string {
  indexing: summary | index | attribute
  index: enable-bm25
}

How can I match only the first document and not the second document when I want to match the phrase desk light in Vespa 8? In other words, I want to match only documents with ... desk light ..., but not others like ... desk, light ....

I tried the following query, but it seems like a weakAnd operation in Vespa 8 and matches both documents. It also matches documents that contain only ... desk ..., which should be expected from the weakAnd operation but not my expectation.

_desk_light=desk light
yql=select id, title, summaryfeatures from sources * where ([{"defaultIndex": "title"}](userInput(@_desk_light)));

I also tried adding grammar: phrase annotation to the userInput. Both of the documents are still matched.

_desk_light=desk light
yql=select id, title, summaryfeatures from sources * where ([{"defaultIndex": "title", "grammar": "phrase"}](userInput(@_desk_light)));

Really appreciate any advise. Thanks!

user1802604
  • 396
  • 3
  • 14

1 Answers1

2

Using grammar: phrase is the right solution if you only want to match the exact phrase "desk light", but in this case you'll still match both documents as they both contain that phrase (commas are ignored).

Jon
  • 2,043
  • 11
  • 9
  • Thank you for the quick response! Though it sounds reasonable that commas are ignored from the token matching, I'm still wondering that is there any solution to not matching the `...desk, light...` document? – user1802604 Oct 05 '22 at 16:01
  • 1
    There are ways, but not easily, without writing code/custom linguistic processing. Use a tokenizer that preserves punctuation, like tokenizers used for language models, but then get all kinds of other issues as well. But for me, this looks more like a normalization task during document processing. I believe the reason you don't want to match across the ',' is that there is a format used in the commerce title. – Jo Kristian Bergum Oct 05 '22 at 20:30