Use BertTokenizer with HuggingFace GPT-2

Asked Mar 06 '20 at 15:30

Active Nov 29 '20 at 11:58

Viewed 223 times

I have a specific generation problem involving a dataset built from a very small vocabulary. Ideally, my use case will be much more straightforward if I can simply provide that vocabulary in a fixed set of tokens. I know that with the BertTokenizer, for example, I can provide a vocab.txt file and avoid any further tokenization of this basic vocabulary, and I'm wondering if there's a way to get GPT-2 to do the same? The only thing I can think of right now is creating a hacked PretrainedTokenizer subclass, but perhaps someone has a better idea?

Any thoughts appreciated.

UPDATE: Okay, so it turns out I can just swap out BertTokenizer and BertWordpieceTokenizer when creating the GPT2LMHeadModel. (Thanks HuggingFace for a well-designed, modular codebase!)

edited Nov 29 '20 at 11:58

Guy Coder

24,501
8
71
136

asked Mar 06 '20 at 15:30

jbm

1,248
10
22

Use BertTokenizer with HuggingFace GPT-2

0 Answers0