2

I having trouble dealing with multivalue query items and fields in terms of element similarity. For example, if we have an array of strings like such:

field colors type array<string>
# That might have several items like: "blue", "black and purple", "green", "yellow", etc

And I wish to query with a list of items:

"blue" (weight 0.5), "black" (weight 1.0)

Is there a way to perform a weighted listwise similarity that might look like: weight * elementSimilarity(blue on colors) + weight * elementSimilarity(black on colors)?

I've tried multiple features, including nativeRank, but I get inconsistent results depending on the length of the query array as well as the field array. As I also want to be able to deal with misspellings, "blu" should have a very high match with "blue" - hence why I prefer elementSimilarity. I think I've tried most of the rank features in vespa, but I haven't found a better way to deal with this use case.

Any guidance would be much appreciated! Thanks!

Edit: Just to elaborate, perhaps the biggest restriction to me in Vespa is how arrays are handled in the query. I would very much like to do something like:

expression {
    foreach(terms,N,query(colors,N).weight*elementSimilarity(query(colors,N)),true,sum)
}
kaega
  • 75
  • 6

1 Answers1

5

There are many ways to accomplish this but what is best depends on if you need free text style matching (linguistic processing of the string including tokenization and stemming) or not. It also depends on if this is just a ranking signal for documents that are already retrieved or used to retrieve documents.

If you don't need free text style matching but instead can use exact matching without linguistics processing (e.g using a fixed vocabulary) and this color ranking is just another ranking signal you should consider looking at using tensor ranking instead. Tensors are useful for ranking documents that are retrieved by the query operators, you cannot retrieve using a tensor (except for dense single order tensors using approximate nearest neighbor search). See tensor guide https://docs.vespa.ai/en/tensor-user-guide.html.

If you need free text style matching there are also several approaches. In the below example I assume that you want to have text style matching and that a query term 'purple' should match the document with 'black and purple'. See matching documentation https://docs.vespa.ai/en/reference/schema-reference.html#match

If you define the field colors like this

field colors type weightedset<string>{
   indexing: summary | index
   match: text #This is default matching for string fields with 'index'
}

And feed a doc

"colors": {
   "blue":1,
   "black and purple":1, 
   "green": 1,
   "yellow": 1
}

You can retrieve and rank using the following query

{
"yql": "select * from sources * where colors contains ([{\"weight\":1}]\"purple\") or colors contains ([{\"weight\":2}]\"yellow\");",
"ranking.profile": "color-ranking"
}

See query language reference on term weights

There are the multiple ways you can rank the retrieved documents, but the below assumes you use color ranking as the only ranking signal.

rank-profile color-ranking {
  function colorMatch() {
     expression: nativeDotProduct(colors)
  }
  first-phase {
   expression: colorMatch()
  } 
}

Here we use the nativeDotProduct ranking feature which in our example will return the 3 (21 + 11). The term weight and document weight can only be integers, tensors allows floats.

The elementSimilarity ranking feature is also a candidate and allows more flexibility and you can override if you want to use max/sum and how to combine the element weight and the query term weight.

If this only a ranking signal you can also use the rank query operator

{
"yql": "select * from sources * where rank(foo contains "bar", colors contains ([{\"weight\":1}]\"purple\") or colors contains ([{\"weight\":2}]\"yellow\"));",
"ranking.profile": "color-ranking"
}

In the above query we retrieve documents where a field called 'foo' contains 'bar and for those documents the colors field is matched and ranking features are created (depending on which are used in the ranking profile).

Generally the query is a way to express how to retrieve documents, and the ranking profile determines how you rank those retrieved. The rank query operator is a nice way to be able to create matching ( Q-D interactions) ranking features without impacting recall.

There are also other more efficient ways including the wand query operator if you want to retrieve efficiently using the inner dot product between something in the query and in the document. See https://docs.vespa.ai/en/using-wand-with-vespa.html

Jo Kristian Bergum
  • 2,984
  • 5
  • 8
  • Thank you for the detailed response! The dot product is an great suggestion and I'll give that method a try. – kaega Mar 22 '21 at 21:18
  • Thanks, if you require text style matching with tokenized matching I believe the above example will be a good alternative. Appreciate if it you can comment if this solves your use case. – Jo Kristian Bergum Mar 23 '21 at 09:35
  • The `nativeDotProduct` worked very well for my use case! There was the slight issue that the feature score would be non-normalized but nothing that couldn't be worked around by balancing other feature scores. – kaega Mar 24 '21 at 05:10
  • Great! The score is the inner dot product which is not normalized you are correct. But this is the case with many ranking features, including e.g bm25. GBDT models are pretty good at handling features without any normalization. Would you mind accepting the answer @kaega? – Jo Kristian Bergum Mar 24 '21 at 09:16