1

I work on the problem of finding the nearest document in a list of documents. Each document is a word or a very short sentence (e.g. "jeans" or "machine tool" or "biological tomatoes"). By closest I mean close in a semantical way.

I have tried to use word2vec embeddings (from Mikolov article) but the closest words or more contextually linked than semanticaly linked ("jeans" is linked to "shoes" and not "trousers" as expected).

I have tried to use Bert encoding (https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#32-understanding-the-output) using last layers but it faces the same issues.

I have tried elastic search, but it doesn't find semantical similarities. (The task needs to be solved in French but maybe solving it in English is a good first step)

Toto
  • 11
  • 2

3 Answers3

1

Note different sets of word-vectors may vary in how well they capture your desired 'semantic' similarities. (In particular, training with a shorter window may emphasize similarity among words that are drop-in replacements for each other, as opposed to just used-in-similar domains, as larger window values may emphasize. See this answer for more details.)

You may also want to take a look at "Word Mover's Distance" as a way to compare short texts that contain various mixes of somewhat-similar words. (It's fairly expensive, but should be practical on your short texts. It's available in the Python gensim library as wmdistance() on KeyedVectors instances.)

If you have training data where your specific multi-word phrases are used, in many natural-language-like subtly-varied contexts, you could consider combining all such phrases-of-interest into single tokens (like machine_tool or biological_tomatoes), and training your own domain-specific word-vectors.

gojomo
  • 52,260
  • 14
  • 86
  • 115
1

For computing similarity between short texts which contains 2 or 3 words, you can use word2vec with getting the average vector of the sentence. for example, if you have a text (machine tool) and want to represent it in one vector using word2vec so you have to get the vector of "machine" and the vector if "tool" then combine them in one vector by getting the average vector which is to add the two vectors and divide by 2 (the number of words). this will give you a vector representation for a sentence which is more than one word. You can use also something like doc2vec which is designed on the top of word2vec and its purpose to get a vector for a sentence or paragraph.

Eyad Shokry
  • 151
  • 1
  • 9
0

You might try document embedding that is built on top of word2vec

However, notice that word and document embedding do not always capture "desired similarity", they just learn a language model on your corpus, they are heavy influenced by text size and word frequency.

How big is your corpus? If you need it just to perform some classification it might be better to train your vectors on a large dataset such as Google News corpus.