1

Does anyone know how to change the tokenizer in AllenNLP's coreference resolution? By default, it uses SpaCy and I would like to use a white space tokenizer so as to tokenize only words, not punctuation.

This is what I have tried so far but it does not seem to work:

review = """Judging from previous posts this used to be a good place, but not any longer.
        We, there were four of us, arrived at noon - the place was empty - 
        and the staff acted like we were imposing on them and they were very rude. 
        They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table.
        The food was lousy - too sweet or too salty and the portions tiny.
        After all that, they complained to me about the small tip.
        Avoid this place!"""

from allennlp.data.tokenizers.whitespace_tokenizer import WhitespaceTokenizer
from allennlp.predictors.predictor import Predictor

predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2020.02.27.tar.gz")
predictor._tokenizer = WhitespaceTokenizer()

pred = predictor.predict(document=review)

# expected output: 'Judging', 'from', 'previous', 'posts', 'this', 'used', 'to', 'be', 'a', 'good', 'place,', 'but', 'not', 'any', 'longer.'
print(pred['document'])

I found the documentation on tokenizers here, but I don't know if it is possible to use them on other models like on coreference resolution.

rosamariar
  • 36
  • 4
  • The tokenizer in the dataset reader needs to be changed. The easiest way to do that is to override it when you load the Predictor. For example, `Predictor.from_path("...", overrides={"validation_dataset_reader.tokenizer": {"type": "whitespace"}})` – petew Oct 22 '21 at 22:13
  • Thank you! I ended up using the Spacy tokenizer for convenience, but it is good to know how to do this for future reference. – rosamariar Feb 10 '22 at 16:18

0 Answers0