0

I want to implement a system which cache the most popular queries and given a new query try to find a similar query in the cache and return the same result. Since I want to do it as general as possible (queries can be short texts, images or even audio tracks) I'm going down with the Approximate Nearest Neighbor (ANN) approach, which is based on representing the query in a vector space.

My question is: what is the most efficient way to represent a query as a vector (which will be used as input in ANN)?

justHelloWorld
  • 6,478
  • 8
  • 58
  • 138
  • What are the features you are considering! For example, if you consider only texts, you can use one-hot-encoding. For more advanced experiments, you can use word-embeddings. I believe you should design your features for the texts, images and audio tracks separately. – Wasi Ahmad Nov 13 '16 at 17:57
  • @WasiAhmad Word-embedding seems good. My question is: how many dimensions have these vectors? Can they be used for encoding text queries? What is the distance metric used for computing vectors (queries) similarity? Most of ANN techniques can be used for metric spaces in **at most** some hundreds of dimension and if these vectors are bigger better solutions are proposed. – justHelloWorld Nov 13 '16 at 18:02
  • One-hot-encoding vectors' size is equal to the vocabulary size. Word embeddings are obviously much smaller than that. For computing text similarity, you have different options. A simple and widely acceptable choice is cosine similarity. – Wasi Ahmad Nov 14 '16 at 00:17

0 Answers0