2

Is it possible to generate word embeddings with Google's T5?

I'm assuming that this is possible. However, I cannot find the code I would need to be able to generate word embeddings on the relevant Github (https://github.com/google-research/text-to-text-transfer-transformer) or HuggingFace (https://huggingface.co/docs/transformers/model_doc/t5) pages.

mcr
  • 43
  • 3

1 Answers1

2

Yes, that is possible. Just feed the ids of the words to the word embedding layer:

from transformers import T5TokenizerFast, T5EncoderModel

tokenizer = T5TokenizerFast.from_pretrained("t5-small")
model = T5EncoderModel.from_pretrained("t5-small")
i = tokenizer(
    "This is a meaningless test sentence to show how you can get word embeddings", return_tensors="pt", return_attention_mask=False, add_special_tokens=False
)

o = model.encoder.embed_tokens(i.input_ids)

The output tensor has the following shape:

#print(o.shape)
torch.Size([1, 19, 512])

The 19 vectors are the representations of each token. Depending on your task, you can map them back to the individual words with word_ids:

i.word_ids()

Output:

[0, 1, 2, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 12, 12]
cronoik
  • 15,434
  • 3
  • 40
  • 78