I have a Word2Vec
model which is trained in Gensim
. How can I use it in Tensorflow
for Word Embeddings
. I don't want to train Embeddings from scratch in Tensorflow. Can someone tell me how to do it with some example code?
Asked
Active
Viewed 6,562 times
8

neel
- 8,399
- 7
- 36
- 50
1 Answers
10
Let's assume you have a dictionary and inverse_dict list, with index in list corresponding to most common words:
vocab = {'hello': 0, 'world': 2, 'neural':1, 'networks':3}
inv_dict = ['hello', 'neural', 'world', 'networks']
Notice how the inverse_dict index corresponds to the dictionary values. Now declare your embedding matrix and get the values:
vocab_size = len(inv_dict)
emb_size = 300 # or whatever the size of your embeddings
embeddings = np.zeroes((vocab_size, emb_size))
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('embeddings_file', binary=True)
for k, v in vocab.items():
embeddings[v] = model[k]
You've got your embeddings matrix. Good. Now let's assume you want to train on the sample: x = ['hello', 'world']
. But this doesn't work for our neural net. We need to integerize:
x_train = []
for word in x:
x_train.append(vocab[word]) # integerize
x_train = np.array(x_train) # make into numpy array
Now we are good to go with embedding our samples on-the-fly
x_model = tf.placeholder(tf.int32, shape=[None, input_size])
with tf.device("/cpu:0"):
embedded_x = tf.nn.embedding_lookup(embeddings, x_model)
Now embedded_x
goes into your convolution or whatever. I am also assuming you are not retraining the embeddings, but simply using them. Hope that helps

vega
- 2,150
- 15
- 21
-
I'm pretty sure that the line `embeddings[v] = model[k]` should be replaced with `embeddings[v] = model.word_vec(k)` – bluesummers May 02 '17 at 08:26
-
I also thought of this more manual approach (i.e. iterating the whole vocabulary and looking them up one by one using `model.word_vec(k)`. But is there a way to make use of `tf.nn.embedding_lookup`, which it seems would be more efficient? One post using Tensorflow with GloVe https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html essentially produced a custom GloVe file which can be used to perform direct index-to-embeddings lookup. I wonder if one can do something similar with Word2Vec (binary) files. – xji Jan 26 '18 at 20:33
-
1@JIXiang in practice you get all the words you want from Word2Vec and save it in a numpy array, pickle, or whatever. Loading word2vec from Gensim every time is very expensive. `tf.nn.embedding_lookup` requires a matrix, so you can't use `model.word_vec(k)` on the fly. And `tf` is more efficient. – vega Apr 13 '18 at 19:39