0

I have downloaded the glove trained matrix and used it in a Keras layer. however, I need the sentence embedding for another task.

I want to calculate the mean of all the word embeddings that are in that sentence.

what is the most efficient way to do that since there are about 25000 sentences?

also, I don't want to use a Lambda layer in Keras to get the mean of them.

Mohammad Reza
  • 1,143
  • 4
  • 13
  • 27

1 Answers1

1

the best way to do this is to use a GlobalAveragePooling1D layer. it receives the embeddings of tokens inside the sentences from the Embedding layer with the shapes (n_sentence, n_token, emb_dim) and computes the average of each token present in the sentence. the result has shape (n_sentence, emb_dim)

here a code example

embedding_dim = 128
vocab_size = 100
sentence_len = 20

embedding_matrix = np.random.uniform(-1,1, (vocab_size,embedding_dim))
test_sentences = np.random.randint(0,vocab_size, (3,sentence_len))

inp = Input((sentence_len))
embedder = Embedding(vocab_size, embedding_dim,
                     trainable=False, weights=[embedding_matrix])(inp)
avg = GlobalAveragePooling1D()(embedder)

model = Model(inp, avg)
model.summary()

model(test_sentences) # the mean of all the word embeddings inside sentences 
Marco Cerliani
  • 21,233
  • 3
  • 49
  • 54
  • I get this error Layer model_5 was called with an input that isn't a symbolic tensor. Received type: when I'm trying to test the code – Mohammad Reza Jun 10 '20 at 11:48
  • 1
    I think you are using keras instead of tf.keras or an older version of tf. replace the final line with model.predict(test_sentences) https://colab.research.google.com/drive/1W8uLy49H_8UuD9DGZvtP7Md1f4ap3u6A?usp=sharing – Marco Cerliani Jun 10 '20 at 11:52