When using a huggingface pre-traind model,i passed a tokennizer and indexer for my textfied in Datasetreader, also i want use the same tokennizer and indexer in my model. Which way is an appropriate way in allennlp ? (using config file ?) Here is my code, i think this is a bad sloution. Give me some suggestions please.
`In my Dataset Reader::
self._tokenizer = PretrainedTransformerTokenizer("microsoft/DialoGPT-small",tokenizer_kwargs={'cls_token': '[CLS]',
'sep_token': '[SEP]',
'bos_token':'[BOS]'})
self._tokenindexer = {"tokens": PretrainedTransformerIndexer("microsoft/DialoGPT-small",
tokenizer_kwargs={'cls_token': '[CLS]',
'sep_token': '[SEP]',
'bos_token':'[BOS]'})}
In my Model:
self.tokenizer = GPT2Tokenizer.from_pretrained("microsoft/DialoGPT-small")
num_added_tokens = self.tokenizer.add_special_tokens({'bos_token':'[BOS]','sep_token': '[SEP]','cls_token':'[CLS]'})
self.emb_dim = len(self.tokenizer)
self.embeded_layer = self.encoder.resize_token_embeddings(self.emb_dim)
I have create two tokenizers for datasetreader and model, and both the tokenizers have the common vocabulary and special tokens. but when i add the three special token in the same order, the special token will have a different index. so i switched the order in Model`s codes to achieve the same indexs.(stupid but effective) Is there exists a way to pass the tokennizer or vocab from DatasetReader to Model? Which way is an appropriate way in allennlp to slove this problem ?