0

Challenges when calculating perplexity: is my approach reasonable?

I am trying to find a pre-trained language model that will work best for my text. The text is pretty specific in its language and content but there's no test data avaiable or budget to generate it so I'm using perplexity as an intrinisic metric so allow me to compare different fine-tuned versions of BART.

I've had a good look online but couldn't find any discussion about some of the following issues:

  • BART is a bi-directional model therefore when we talk about 'context' for calculating perplexity, the normal view that this includes all the words in a window up to the masked token seems incorrect. I am therefore planning to use a window centred (rather than ending) on the masked token. Does that seem correct or does it ruin the metric in some way I'm not anticipating?
  • When I'm calculating perplexity for larger sliding window sizes as suggested by HuggingFace, the probabilities that I'm multiplying together become so small that Python is rounding them to zero and therefore perplexity comes out as Infinite. I've checked and none of the probabilities themselves are zero, its just the product of them that becomes too small. I had planned to use 1024 tokens as the maximum that the model can take but will instead have ~350 as the limit. Has anyone else run into this problem and found another solution that I'm not seeing?
  • The text I am interested in is one single very long text, I've worked around that with my summarisation but I'm interested in seeing how well a model works for the text in general. It would take a lot more time than I have to calculate perplexity for sliding window across the entire text so instead my plan is to sample several shorter sections and calculate the perplexity of each and then aggregate that score up. Any advice on the best way to do that, take an average, or take all of the probabilities together and calculate perplexity across them despite them being discontinuous?
Agnes
  • 19
  • 3

1 Answers1

1

Built-in support for perplexity calculations in Hugging Face transformers is not very good. Instead, I recommend using the minicons library, which was built on top of Hugging Face transformers, and can handle all log-likelihood calculations for you under the hood.

from minicons import scorer

s2s_model = scorer.Seq2SeqScorer('facebook/bart-base', 'cuda')

stimuli = ["The keys to the cabinet are on the table.",
           "The keys to the cabinet is on the table."]

print(s2s_model.sequence_score(stimuli, source_format = 'blank'))
# [-10.298685073852539, -10.341218948364258]

You can replace facebook/bart-base with the path to your model.

Regarding how to aggregate the scores, I believe it's a good idea to take the average and standard deviation between the samples while comparing different models. Perplexity is a measure of how well the model predicts each subsequent word given the previous words, so it's largely about local continuity (how well the model captures the immediate context) rather than global continuity (how well the model understands the entire document as a whole).

Ruan
  • 772
  • 4
  • 13