18

In the word2vec model, there are two linear transforms that take a word in vocab space to a hidden layer (the "in" vector), and then back to the vocab space (the "out" vector). Usually this out vector is discarded after training. I'm wondering if there's an easy way of accessing the out vector in gensim python? Equivalently, how can I access the out matrix?

Motivation: I would like to implement the ideas presented in this recent paper: A Dual Embedding Space Model for Document Ranking

Here are more details. From the reference above we have the following word2vec model:

enter image description here

Here, the input layer is of size $V$, the vocabulary size, the hidden layer is of size $d$, and an output layer of size $V$. The two matrices are W_{IN} and W_{OUT}. Usually, the word2vec model keeps only the W_IN matrix. This is what is returned where, after training a word2vec model in gensim, you get stuff like:

model['potato']=[-0.2,0.5,2,...]

How can I access, or retain W_{OUT}? This is likely quite computationally expensive, and I'm really hoping for some built in methods in gensim to do this because I'm afraid that if I code this from scratch, it would not give good performance.

Alex R.
  • 1,397
  • 3
  • 18
  • 33

4 Answers4

10

While this might not be a proper answer (can't comment yet) and noone pointed this out, take a look here. The creator seems to answer a similar question. Also that's the place where you have a higher chance for a valid answer.

Digging around in the link he posted in the word2vec source code you could change the syn1 deletion to suit your needs. Just remember to delete it after you're done, since it proves to be a memory hog.

themistoklik
  • 880
  • 1
  • 8
  • 19
  • 3
    Thanks! This looks like what I'm looking for. To paraphrase the answer, the input/out embeddings are: Input: model.syn0, Output: model.syn1, model.syn1neg – Alex R. Nov 13 '16 at 00:44
  • Hey @themistoklik I am having same problem, But I am not able to access in newer version using syn1, if anyone can guide me, it would be really helpful. I am doing model.syn1[model.wv['word']]. Error I am getting is " Word2Vec object has no attribute 'syn1' " – Malvi Patel Feb 16 '22 at 06:01
2

To get the syn1 of any word, this might work.

model.syn1[model.wv.vocab['potato'].point]

where model is your trained word2vec model.

Kim Jay
  • 53
  • 6
0

Below code will enable to save/load model. It uses pickle internally, optionally mmap‘ing the model’s internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing.

model.save('/tmp/mymodel.model')
new_model = gensim.models.Word2Vec.load('/tmp/mymodel')

Some background information Gensim is a free Python library designed to process raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation and Random Projections discover semantic structure of documents by examining statistical co-occurrence patterns of the words within a corpus of training documents.

Some good blog describing about the use and sample code base to kick start on the project

Installation reference here

Hope this helps!!!

Syed
  • 417
  • 1
  • 6
  • 13
  • 4
    Thanks for your answer. However this has absolutely nothing to do with my question. Specifically in word2vec there are two distinct word vectors ("in" and "out"), and word2vec keeps only one of them ("in"). I'm asking about the other. – Alex R. Nov 12 '16 at 17:53
0

In the word2vec.py file you need to make this change In the following function it currently returns the "in" vector. As you want the "out" vector. The "in" is saved in syn0 object and "out" is saved in syn1neg object variable.

def save_word2vec_format(self, fname, fvocab=None, binary=False):
  ....
  ....
  row = self.syn1neg[vocab.index]
Darren Cook
  • 27,837
  • 13
  • 117
  • 217
Trideep Rath
  • 3,623
  • 1
  • 25
  • 14