6

As we all know the capability of BERT model for word embedding, it is probably better than the word2vec and any other models.

I want to create a model on BERT word embedding to generate synonyms or similar words. The same like we do in the Gensim Word2Vec. I want to create method of Gensim model.most_similar() into BERT word embedding.

I researched a lot about it, seems that it is possible to do that, but the problem is it is only showing the embeddings in the form of number, there is no way to get the actual word from it. Can anybody help me regarding this?

DevPy
  • 439
  • 6
  • 17

1 Answers1

2
  1. Bert uses tokens, which are not exactly the same as words. So a single word may not be just a single token.

  2. Bert generates embedding vectors for each token with respect to other tokens within the context.

  3. You can select a pretrained bert model and feed them single word get output and average them So you can get single vector for a word

  4. Get list of words, calculate vector for each of them

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

word = "Hello"
inputs = tokenizer(word, return_tensors="pt")
outputs = model(**inputs)
word_vect = outputs.pooler_output.detach().numpy()
  1. calculate vector distances, so you can get similar words from distances
Quentin
  • 2,529
  • 2
  • 26
  • 32
Birol Kuyumcu
  • 1,195
  • 7
  • 17
  • Can you specify exact way or code to do it? I didn't get it – DevPy Oct 28 '21 at 07:45
  • i added a code for single word to vector . But i think it is not a good idea. because "Bert generates embedding vectors for each token with respect to other tokens within the context." – Birol Kuyumcu Oct 29 '21 at 05:22
  • Oh okay.. got your point, for this I have to get a corpus that is having many words and I should calculate bert embeddings for it, and to generate similar word I just have to calculate the given word embeddings and match the cosine similarity with my corpus embeddings! – DevPy Oct 29 '21 at 05:37
  • That is a good approach too, but can I avail the benefit of bert existing word embeddings, like getting the similar words embedding from bert model and convert these embeddings to words? is it possible to convert bert embeddings to word again? – DevPy Oct 29 '21 at 05:38
  • yes but "Bert generates embedding vectors for each token with respect to other tokens within the context." means same word has different embedding with respect to context you lose this – Birol Kuyumcu Oct 29 '21 at 05:40
  • okay, so means word2vec is the only library available for that task what I am looking for or is there any alternatives? – DevPy Oct 29 '21 at 05:47