2

I'm new to the huggingface library and trying to run a model to do masked language ("fill-mask" task):

from transformers import BertTokenizer, BertForMaskedLM
import torch
from transformers import pipeline, AutoTokenizer, AutoModel

# Initialize MLM pipeline
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

print(len(tokenizer.vocab))
>>> 28996

But when I'm trying to get the probabilities over the tokens I'm getting an error:

classifier = pipeline("fill-mask", model=model, tokenizer=tokenizer)
results = classifier("Paris is the [MASK] of France.")

>>>KeyError                                  Traceback (most recent call last)
<ipython-input-15-30c429f29424> in <module>()
      1 classifier = pipeline("fill-mask", model=model, tokenizer=tokenizer)
----> 2 results = classifier("Paris is the [MASK] of France.")

4 frames
/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py in __getitem__(self, k)
   2041         if isinstance(k, str):
   2042             inner_dict = {k: v for (k, v) in self.items()}
-> 2043             return inner_dict[k]
   2044         else:
   2045             return self.to_tuple()[k]

KeyError: 'logits'

I also tried the following from a different tutorial and got the same error:

mlm = pipeline('fill-mask', model=model, tokenizer=tokenizer)

# Get mask token
mask = mlm.tokenizer.mask_token

# Get result for particular masked phrase
phrase = f'Paris is the [MASK] of France.'
result = mlm(phrase, top_k=10000)

# Print result
print(result)
Penguin
  • 1,923
  • 3
  • 21
  • 51

1 Answers1

4

You are using the pipline in the wrong way. You should only give the model_name to the model argument. It should be like this:

model_name = "emilyalsentzer/Bio_ClinicalBERT"
classifier = pipeline("fill-mask", model=model_name, tokenizer=tokenizer)
results = classifier("Paris is the [MASK] of France.")

And if you want to see the results

for i in range(len(results)):
  print(f"the {i}-th result={results[i]['token_str']} has score {results[i]['score']}")

which will be

the 0-th result=cause has score 0.1672661453485489
the 1-th result=site has score 0.14680784940719604
the 2-th result=source has score 0.12052636593580246
the 3-th result=area has score 0.07053395360708237
the 4-th result=sign has score 0.05601896718144417

So I'm not sure if the model you used is good options to predict the [MASK] part.

Doralisa
  • 171
  • 4