I have a specific generation problem involving a dataset built from a very small vocabulary. Ideally, my use case will be much more straightforward if I can simply provide that vocabulary in a fixed set of tokens. I know that with the BertTokenizer, for example, I can provide a vocab.txt
file and avoid any further tokenization of this basic vocabulary, and I'm wondering if there's a way to get GPT-2 to do the same? The only thing I can think of right now is creating a hacked PretrainedTokenizer
subclass, but perhaps someone has a better idea?
Any thoughts appreciated.
UPDATE: Okay, so it turns out I can just swap out BertTokenizer
and BertWordpieceTokenizer
when creating the GPT2LMHeadModel
. (Thanks HuggingFace for a well-designed, modular codebase!)