disabling fast tokenization in allennlp models

Asked Jun 14 '21 at 10:28

Active Jun 14 '21 at 10:28

Viewed 154 times

Dear stackoverflow community,

I have the following question: Is it possible to disable fast tokenization in an allennlp model?

I am trying to use the following model in my nlp pipeline but can't use the fast tokenization as it causes issues when multithreaded. My initial thought was to simply replace the tokenizer but this seemed to have no effect. I would greatly appreciate your help on this issue. Please tell me the obviouse thing that i am missing.

from allennlp.predictors.sentence_tagger import SentenceTaggerPredictor
predictor = SentenceTaggerPredictor.from_path("
            https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2021.03.10.tar.gz")

# attempt (not working as expected):

from transformers import RobertaTokenizer
predictor._dataset_reader._token_indexers["tokens"]._tokenizer = RobertaTokenizer.from_pretrained("roberta-base", use_fast=False)

# This still causes problems while being used when multithreaded
# and still calls transformers/tokenization_utils_fast.py

asked Jun 14 '21 at 10:28

Simon P.

disabling fast tokenization in allennlp models

0 Answers0