Keras initialize large embeddings layer with pretrained embeddings

Question

I am trying to re-train a word2vec model in Keras 2 with Tensorflow backend by using pretrained embeddings and custom corpus.

This is how I initialize the embeddings layer with pretrained embeddings:

embedding = Embedding(vocab_size, embedding_dim,
                      input_length=1, name='embedding',
                      embeddings_initializer=lambda x: pretrained_embeddings)

where pretrained_embeddings is a big matrix of size vocab_size x embedding_dim

This works as long as pretrained_embeddings is not too big.

In my case unfortunately this is not the case - vocab_size=2270872 and embedding_dim=300.

Upon initializing the Embeddings layer I get the error:

Cannot create a tensor proto whose content is larger than 2GB.

The error comes from the function add_weight() in /opt/r/anaconda3/lib/python3.6/site-packages/keras/engine/base_layer.py, more specifically the following line:

weight = K.variable(initializer(shape),
                    dtype=dtype,
                    name=name,
                    constraint=constraint)

initializer is the lambda function from above, which returns the big matrix. shape is (2270872, 300) as already mentioned.

Is it possible to solve this issue without having to go to low-level Tensorflow programming ? If I switch to Theano as a backend the code runs fine, but I'd like to use Tensorflow for its better long-term prospects.

The only similar Stackoverflow question I found was this, which proposes placeholder variables, but I am not sure how I can apply them on the level of Keras.

Thanks a lot

Edit: I am more than willing to work around this issue on the level of the Tensorflow backend. It's just that I don't know how to combine in this case Tensorflow and Keras code in the same application. Most examples are either one or the other, not both.

For example, what use are the Tensorflow placeholder variables when the initialization of the Embeddings layer in Keras will inevitably invoke the add_weight() function, which causes the issue ?

Solution:

As hinted by in @blue-phoenox's comment I rewrote the code like this:

embedding = Embedding(vocab_size, embedding_dim,
                      input_length=1, 
                      name='embedding')
embedding.build(input_shape=(1,)) # the input_shape here has no effect in the build function
embedding.set_weights([pretrained_embeddings])

That did it. Thanks again @blue-phoenox.

Probably this helps: https://stackoverflow.com/questions/35394103/initializing-tensorflow-variable-with-an-array-larger-than-2gb — ixeption, Nov 21 '18 at 21:39
actually this is the link, which I also referred to at the end of the question. Unfortunately I do nit know how to make use of it in my case. — Pavlin Mavrodiev, Nov 22 '18 at 08:34
What about just setting the weights instead of initializing?https://stackoverflow.com/questions/51819213/keras-function-api-setting-weight-manually-to-a-layer/51819438#51819438 — MBT, Nov 22 '18 at 12:49
@blue-phoenox Thanks. That did it. Can you post your reply as a separate comment so that I can select it as the best answer ? — Pavlin Mavrodiev, Nov 22 '18 at 15:35

MBT · Accepted Answer · 2018-11-23T19:23:08.497

Instead of using the embeddings_initializer argument of the Embedding layer you can load pre-trained weights for your embedding layer using the weights argument, this way you should be able to hand over pre-trained embeddings larger than 2GB.

Here is a short example:

from keras.layers import Embedding

embedding_layer = Embedding(vocab_size,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

Where embedding_matrix is just a regular numpy matrix containing your weights.

For for examples you can also take a look here:
https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

Edit:

As @PavlinMavrodiev (see end of question) pointed out correctly the weights argument is deprecated. He instead used the layer method set_weights to set the weights instead:

layer.set_weights(weights): sets the weights of the layer from a list of Numpy arrays (with the same shapes as the output of get_weights).

To get trained weights get_weights can be used:

layer.get_weights(): returns the weights of the layer as a list of Numpy arrays.

Both are methods from the Keras Layer-Baseclass and can be used for all keras layers, including embeddings layer.

I am accepting this as the solution to the question, although my implementation is slightly different. I've put the latter as an edit in the original question. The reason is that the `weights` argument to the `Embedding` layer appears to be outdated, although it works at the moment,. It is not mentioned in the latest Keras 2 documentation. I believe I've implemented a more future-proof version. — Pavlin Mavrodiev, Nov 23 '18 at 17:42
How would you use the `set_weights` method when you're using the `tf.keras.layers.Embedding` layer inside a custom `class MyLayer(tf.keras.layers.Layer)`'s `__init__` ? Where do you need to put it? I can't figure it out. — Frederik Bode, Mar 25 '21 at 21:37

Keras initialize large embeddings layer with pretrained embeddings

1 Answers1

Linked