0

Consider a search system where the user submits a query ‘query’ and retrieves products based on some ranking algorithm. Assume that these products are ordered according to their quality such that p_0, p_1, …, p_10 and so on.

I would like to generate vector embeddings that mimic this ranking algorithm. The closest product vector to a query vector should ideally be p_0, the next one should be p_1 and so on.

I have tried to building word2vec embeddings for products by feeding products that have appeared in the same search session as sentences. Then, I have calculated the weighted average of product vectors to find query vectors to make the query vector closer to the top result. Although the closest result is usually the best result for a given query, the subsequent results include some results that would never appear as a top result.

Is there a trick that the word2vec can learn the ranking algorithm or any other techniques that I can try? I have looked into multi-dimensional vector scaling with non-metric distances but it did not seem scalable to me for more than 100Ks of products.

1 Answers1

0

There's no one trick – just iteratively improving your representations, & training set, & ranking methods to better meet your goals.

Word2vec-based representations can often help, but are still fairly simple & centered on individual words – whose senses may vary based on context & position in ways that a simple weighted-average-of-tokens fails to capture.

You may want to represent 'products' by more than just a string-of-word-tokens – to include other properties, as well. These could be scalar values like prices or various other kinds of ratings/properties, or extra synthetic labels, such as the result of other salient groupings (whether hand-edited or learned).

And even if just working with natural-language product descriptions – like product names, or descriptions, or reviews – there are other more-sophisticated text-representations that can be trained or used – such as sentence/document embeddings using deeper-networks than plain word2vec.

Most generically, if you have a bunch of quantitative representations of candidate results, and a query, and want to use some initial examples of "good" results to bootstrap more generalizable rules for scoring top results, you are attempting a "learning-to-rank" process:

https://en.wikipedia.org/wiki/Learning_to_rank

To suggest more specific steps would require a more specific description of inputs/outputs/goals, & what's been tried, and how what's been tried has failed.

For example, are your queries always just textual product names? In such a case, maybe plain keyword search is the central technology required – with things like word-vector-modelling just a tweak for handling some tough cases, like expanding the results, or adding more contrast to the rankings, when results are too few or to many.

Or, can you detect key gaps in the modeling related to exactly those cases where "results include some results that would [ideally] never appear as a top result"? If certain things like rare (poorly-modeled) words, or important qualities not yet captured in the model, seem to be to blame for such cases, that will guide the potential set of corrective changes.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thanks a lot for your answer. Sorry if I am not clear, I already have a learning-to-rank model. I can currently rank the products for a given query. I want my embeddings to approximate this LTR model. So, for example, when I look at the nearest neighbor of the query vector, I would like to see the top-ranked product. Then, the second-nearest should be the second ranked product and so on. – user1860037 Jan 31 '22 at 19:39
  • @user1860037 Did you guys reached any conclusion on how to approximate this LTR model based on the word embeddings. – Ganesh prasad Jun 12 '23 at 16:23