2

When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em, bed ,ding, s.

I would like to know if there is any aggregation strategies available that make sense apart from doing a mean of these vectors.

from transformers import AutoTokenizer, AutoModel
# download and load model
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

sentences = ['This framework generates embeddings for each input sentence']


#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')


#Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

print(encoded_input['input_ids'].shape)

Output: torch.Size([1, 13])

for token in encoded_input['input_ids'][0]:
      print(tokenizer.decode([token]))

Output:

[CLS]
this
framework
generates
em
##bed
##ding
##s
for
each
input
sentence
[SEP]
dennlinger
  • 9,890
  • 1
  • 42
  • 63
cerofrais
  • 1,117
  • 1
  • 12
  • 32

1 Answers1

4

To my knowledge, mean aggregation is the most commonly used tool here, and in fact there is even scientific literature, empirically showing that it works well: Generalizing Word Embeddings using Bag of Subwords by Zhao, Mudgal and Liang. Formula 1 is describing exactly what you are proposing as well.

The one alternative that you could theoretically employ is something like a mean aggregate over the entire input, essentially making a "context prediction" over all words (potentially except "embeddings"), therefore emulating something similar to the [MASK]ing during training of the transformer models. But this is just a suggestion from me without any backup of scientific evidence that it works (for better or worse).

dennlinger
  • 9,890
  • 1
  • 42
  • 63