Challenges when calculating perplexity: is my approach reasonable?
I am trying to find a pre-trained language model that will work best for my text. The text is pretty specific in its language and content but there's no test data avaiable or budget to generate it so I'm using perplexity as an intrinisic metric so allow me to compare different fine-tuned versions of BART.
I've had a good look online but couldn't find any discussion about some of the following issues:
- BART is a bi-directional model therefore when we talk about 'context' for calculating perplexity, the normal view that this includes all the words in a window up to the masked token seems incorrect. I am therefore planning to use a window centred (rather than ending) on the masked token. Does that seem correct or does it ruin the metric in some way I'm not anticipating?
- When I'm calculating perplexity for larger sliding window sizes as suggested by HuggingFace, the probabilities that I'm multiplying together become so small that Python is rounding them to zero and therefore perplexity comes out as Infinite. I've checked and none of the probabilities themselves are zero, its just the product of them that becomes too small. I had planned to use 1024 tokens as the maximum that the model can take but will instead have ~350 as the limit. Has anyone else run into this problem and found another solution that I'm not seeing?
- The text I am interested in is one single very long text, I've worked around that with my summarisation but I'm interested in seeing how well a model works for the text in general. It would take a lot more time than I have to calculate perplexity for sliding window across the entire text so instead my plan is to sample several shorter sections and calculate the perplexity of each and then aggregate that score up. Any advice on the best way to do that, take an average, or take all of the probabilities together and calculate perplexity across them despite them being discontinuous?