0

Stanford NLP pipeline issues lots of warnings particularly disturbing in production setup:

WARN  Untokenizable: � (U+FFFD, decimal: 65533)

Is there a way to disable them?

Denis Kulagin
  • 8,472
  • 17
  • 60
  • 129

2 Answers2

1

If you are working directly with a Tokenizer, the answer Denis Kulagin gives is good; if you are operating at the higher level of a StanfordCoreNLP pipeline, you can simply give the property (or equivalent command-line option):

tokenize.options = untokenizable=noneDelete

(to silently delete all unknown characters) or to silently keep them:

tokenize.options = untokenizable=noneKeep
Christopher Manning
  • 9,360
  • 34
  • 46
0

One can do it that way:

Reader reader = new StringReader(paragraphText);
DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor(reader, DocumentPreprocessor.DocType.Plain);

TokenizerFactory<? extends HasWord> factory = PTBTokenizer.factory();
factory.setOptions("untokenizable=noneDelete");
documentPreprocessor.setTokenizerFactory(factory);

From here: https://github.com/stanfordnlp/CoreNLP/issues/103#issuecomment-157793500

Denis Kulagin
  • 8,472
  • 17
  • 60
  • 129