24

I'm currently working with a Keras model which has a embedding layer as first layer. In order to visualize the relationships and similarity of words between each other I need a function that returns the mapping of words and vectors of every element in the vocabulary (e.g. 'love' - [0.21, 0.56, ..., 0.65, 0.10]).

Is there any way to do it?

today
  • 32,602
  • 8
  • 95
  • 115
philszalay
  • 423
  • 1
  • 3
  • 9

1 Answers1

50

You can get the word embeddings by using the get_weights() method of the embedding layer (i.e. essentially the weights of an embedding layer are the embedding vectors):

# if you have access to the embedding layer explicitly
embeddings = emebdding_layer.get_weights()[0]

# or access the embedding layer through the constructed model 
# first `0` refers to the position of embedding layer in the `model`
embeddings = model.layers[0].get_weights()[0]

# `embeddings` has a shape of (num_vocab, embedding_dim) 

# `word_to_index` is a mapping (i.e. dict) from words to their index, e.g. `love`: 69
words_embeddings = {w:embeddings[idx] for w, idx in word_to_index.items()}

# now you can use it like this for example
print(words_embeddings['love'])  # possible output: [0.21, 0.56, ..., 0.65, 0.10]
today
  • 32,602
  • 8
  • 95
  • 115
  • with this line 'words_embeddings = {w: embeddings[idx] for w, idx in tokenizer.word_index}' I get the following exception: IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices. tokenizer.word_index return a mapping from words to their index. – philszalay Jul 09 '18 at 22:01
  • @dermaschder I think you have forgotten to call `items()` on the dictionary, i.e. `tokenizer.word_index.items()`. – today Jul 09 '18 at 22:20
  • Do not forget to add a special character in case you add padding in your input for some use-cases such as LSTM. You can add an index for the padding as well, something like `word_to_index['__padding__'] = 0 ` – Vaibhav Jan 16 '19 at 22:22
  • @today, are embedding in Keras static or dynamic (sometimes called contextualized embeddings)? – A.B Jul 01 '21 at 21:25
  • @A.B If you are referring to contextualized vs. non-contextualized embeddings, then this is not at all related to Keras or the DL framework you are using. Instead it is related to the method or algorithm you are using. The embedding layer by itself is only a lookup table: given an integer index, it returns a vector corresponding to that index. It's you, as the designer of the method or architecture of the model, who decides to use it in a way that the model gives you contextualized or non-contextualized embeddings. – today Jul 02 '21 at 08:26
  • Thankyou @today the integer index part makes sense. the learnd dense vector it returns, is it more close to static embedding or to contextualized/ dynamic one. If i learn them with downstream lstm prediction tssk, will they become contextualized? – A.B Jul 02 '21 at 10:07
  • @A.B The contextualized embeddings are not achieved with just an embedding layer; it's the architecture of the model (besides the embedding layer) which produce contextualized embds. The values in the embd layer are fixed (after training) and therefore given two sentences like "the bank account" and "the bank of the river", the vector produced by the embd layer for the word "bank" is exactly the same for the two sentences. So you must add other layers, whether it's RNN or Transformer layers, on top in order to produce contextualized embds (as the output of those layers, not the embd layer). – today Jul 02 '21 at 10:45
  • Thankyou very much @today. It makes sense and was really helpful :) – A.B Jul 02 '21 at 11:03
  • How can I import `emebdding_layer`? It says `emebdding_layer` is not defined. I couldn't find it in the documentation. I also think that there is a typo – Guilherme Parreira Jul 07 '21 at 11:50