Use pre-trained model vocabulary in an appropriate way with allennlp

Question

When using a huggingface pre-traind model,i passed a tokennizer and indexer for my textfied in Datasetreader, also i want use the same tokennizer and indexer in my model. Which way is an appropriate way in allennlp ? (using config file ?) Here is my code, i think this is a bad sloution. Give me some suggestions please.

`In my Dataset Reader::

    self._tokenizer = PretrainedTransformerTokenizer("microsoft/DialoGPT-small",tokenizer_kwargs={'cls_token': '[CLS]',
                                                                                    'sep_token': '[SEP]',
                                                                                                  'bos_token':'[BOS]'})
    self._tokenindexer = {"tokens": PretrainedTransformerIndexer("microsoft/DialoGPT-small",
                                                                  tokenizer_kwargs={'cls_token': '[CLS]',
                                                                                    'sep_token': '[SEP]',
                                                                                    'bos_token':'[BOS]'})}

In my Model:


self.tokenizer = GPT2Tokenizer.from_pretrained("microsoft/DialoGPT-small")

num_added_tokens = self.tokenizer.add_special_tokens({'bos_token':'[BOS]','sep_token': '[SEP]','cls_token':'[CLS]'})

self.emb_dim = len(self.tokenizer) 

self.embeded_layer = self.encoder.resize_token_embeddings(self.emb_dim)

I have create two tokenizers for datasetreader and model, and both the tokenizers have the common vocabulary and special tokens. but when i add the three special token in the same order, the special token will have a different index. so i switched the order in Model`s codes to achieve the same indexs.(stupid but effective) Is there exists a way to pass the tokennizer or vocab from DatasetReader to Model? Which way is an appropriate way in allennlp to slove this problem ?

Do you actually need the tokenizer/token indexer in your model, or do you just need access to the tokenizer's vocabulary? Because your can access the tokenizer's vocab in `self.vocab` from the model. — petew, Jul 09 '21 at 16:35
I have readed the guidience of allennlp's vocabulary, and I think i need to use the pretrained model'` vocab as my vocab , not from the instances. @petew — Jingyang Li, Jul 12 '21 at 03:37
The vocab that's passed into the model's `__init__` method will contain the word pieces from the huggingface tokenizer. It reads out the huggingface tokenizer and puts the tokens there. — Dirk Groeneveld, Jul 23 '21 at 21:12

Use pre-trained model vocabulary in an appropriate way with allennlp

0 Answers0