How can I access output embedding(output vector) in gensim word2vec?

Question

I want to use output embedding of word2vec such as in this paper (Improving document ranking with dual word embeddings).

I know input vectors are in syn0, output vectors are in syn1 and syn1neg if negative sampling.

But when I calculated most_similar with output vector, I got same result in some ranges because of removing syn1 or syn1neg.

Here is what I got.

IN[1]: model = Word2Vec.load('test_model.model')

IN[2]: model.most_similar([model.syn1neg[0]])

OUT[2]: [('of', -0.04402521997690201),
('has', -0.16387106478214264),
('in', -0.16650712490081787),
('is', -0.18117375671863556),
('by', -0.2527652978897095),
('was', -0.254993200302124),
('from', -0.2659570872783661),
('the', -0.26878535747528076),
('on', -0.27521973848342896),
('his', -0.2930959463119507)]

but another syn1neg numpy vector is already similar output.

IN[3]: model.most_similar([model.syn1neg[50]])

OUT[3]: [('of', -0.07884830236434937),
('has', -0.16942456364631653),
('the', -0.1771494299173355),
('his', -0.2043554037809372),
('is', -0.23265135288238525),
('in', -0.24725285172462463),
('by', -0.27772971987724304),
('was', -0.2979024648666382),
('time', -0.3547973036766052),
('he', -0.36455872654914856)]

I want to get output numpy arrays(negative or not) with preserved during training.

Let me know how can I access pure syn1 or syn1neg, or code, or some word2vec module can get output embedding.

gojomo · Accepted Answer · 2017-03-06T21:58:03.867

With negative-sampling, syn1neg weights are per-word, and in the same order as syn0.

The mere fact that your two examples give similar results doesn't necessarily indicate anything is wrong. The words are by default sorted by frequency, so the early words (including those in position 0 and 50) are very-frequent words with very-generic cooccurrence-based meanings (that may all be close to each other).

Pick a medium-frequency word with a more distinct meaning, and you may get more meaningful results (if your corpus/settings/needs are sufficiently like those of the 'dual word embeddings' paper). For example, you might want to compare:

model.most_similar('cousin')

...with...

model.most_similar(positive=[model.syn1neg[model.vocab['cousin'].index])

However, in all cases the existing most_similar() method only looks for similar-vectors in syn0 – the 'IN' vectors of the paper's terminology. So I believe the above code would only really be computing what the paper might call 'OUT-IN' similarity: a list of which IN vectors are most similar to a given OUT vector. They actually seem to tout the reverse, 'IN-OUT' similarity, as something useful. (That'd be the OUT vectors most similar to a given IN vector.)

The latest versions of gensim introduce a KeyedVectors class for representing a set of word-vectors, keyed by string, separate from the specific Word2Vec model or other training method. You could potentially create an extra KeyedVectors instance that replaces the usual syn0 with syn1neg, to get lists of OUT vectors similar to a target vector (and thus calculate top-n 'IN-OUT' similarities or even 'OUT-OUT' similarities).

For example, this might work (I haven't tested it):

outv = KeyedVectors()
outv.vocab = model.wv.vocab  # same
outv.index2word = model.wv.index2word  # same
outv.syn0 = model.syn1neg  # different
inout_similars = outv.most_similar(positive=[model['cousin']])

syn1 only exists when using hierarchical-sampling, and it's less clear what an "output embedding" for an individual word would be there. (There are multiple output nodes corresponding to predicting any one word, and they all need to be closer to their proper respective 0/1 values to predict a single word. So unlike with `syn1neg, there's no one place to read a vector that means a single word's output. You might have to calculate/approximate some set of hidden->output weights that would drive those multiple output nodes to the right values.)

Your code works wonderfully!! Thank you very much for giving me the answer. — Suin SEO, Mar 06 '17 at 00:37

How can I access output embedding(output vector) in gensim word2vec?

1 Answers1

Linked