24

Using the gensim.models.Word2Vec library, you have the possibility to provide a model and a "word" for which you want to find the list of most similar words:

model = gensim.models.Word2Vec.load_word2vec_format(model_file, binary=True)
model.most_similar(positive=[WORD], topn=N)

I wonder if there is a possibility to give the system as input the model and a "vector", and ask the system to return the top similar words (which their vectors is very close to the given vector). Something similar to:

model.most_similar(positive=[VECTOR], topn=N)

I need this functionality for a bilingual setting, in which I have 2 models (English and German), as well as some English words for which I need to find their most similar German candidates. What I want to do is to get the vector of each English word from the English model:

model_EN = gensim.models.Word2Vec.load_word2vec_format(model_file_EN, binary=True)
vector_w_en=model_EN[WORD_EN]

and then query the German model with these vectors.

model_DE = gensim.models.Word2Vec.load_word2vec_format(model_file_DE, binary=True)
model_DE.most_similar(positive=[vector_w_en], topn=N)

I have implemented this in C using the original distance function in the word2vec package. But, now I need it to be in python, in order to be able to integrate it with my other scripts.

Do you know if there is already a method in gensim.models.Word2Vec library or other similar libraries which does this? Do I need to implement it by myself?

approxiblue
  • 6,982
  • 16
  • 51
  • 59
amin
  • 445
  • 1
  • 4
  • 14
  • Does `most_similar(..)` return the scores to you as well? I am picturing a custom function you write, that calls `most_similar` for every word in the vector, adds the results from ALL to the same list and then sorts on score and returns. – nbryans Jun 14 '16 at 17:35
  • Thanks nbryans. If there is no existing method that does this, I have to implement it as following: for each word in the vocabulary get its corresponding vector from the model. Compute the cosine similarity of the input vector and the returned one. And then return the top most similars. But, I thought such a method might exist; which seems not. – amin Jun 15 '16 at 06:27
  • 1
    Hi amin, have you implemented your method? if so, then I am interested to know why do you use cosine similarity instead of euclidean distance? – N. F. Dec 12 '18 at 20:10
  • Hi @MiNdFrEaK, I implemented it by myself. well, actually it was not a big deal but back then I was just wondering if it has been already implemented and I can use it of the shelf. Unfortunately, a very short time after that I stopped working on that topic. I need to look into my codes to find my implementation. I will let you know once I find it. – amin Dec 20 '18 at 16:06

2 Answers2

30

The method similar_by_vector returns the top-N most similar words by vector:

similar_by_vector(vector, topn=10, restrict_vocab=None)
approxiblue
  • 6,982
  • 16
  • 51
  • 59
user48135
  • 481
  • 5
  • 13
0

I don't think what you're trying to achieve could ever give an accurate answer. Simply because the two models are trained separately. And although both the English and the German model will have similar distances between their respective word vectors. There's no guarantee that the word vector for 'House' will have the same direction as the word vector for 'Haus'.

In simple terms, if you trained both models with vector size=3. And 'House' has vector [0.5,0.2,0.9], there's no guarantee that 'Haus' will have vector [0.5,0.2,0.9] or even something close to that.

In order to solve this, you could first translate the English word to German and then use the vector for that word to look for similar words in the German model.

TL:DR; You can't just plug in vectors from one language model into another and expect to have accurate results.

  • Well, you are right if the embeddings are trained separately and are in two different spaces. What I was trying to do back then was to either train bilingual word embeddings jointly or map the embeddings of the two language into a shared space by means of a trainable mapping function and some vocabulary or parallel corpora. – amin Dec 20 '18 at 16:09