Stanford NLP: how to disable warnings?

Question

Stanford NLP pipeline issues lots of warnings particularly disturbing in production setup:

WARN  Untokenizable: � (U+FFFD, decimal: 65533)

Is there a way to disable them?

score 1 · Answer 1 · answered Jul 29 '17 at 22:42

If you are working directly with a Tokenizer, the answer Denis Kulagin gives is good; if you are operating at the higher level of a StanfordCoreNLP pipeline, you can simply give the property (or equivalent command-line option):

tokenize.options = untokenizable=noneDelete

(to silently delete all unknown characters) or to silently keep them:

tokenize.options = untokenizable=noneKeep

score 0 · Answer 2 · answered Jul 29 '17 at 09:09

One can do it that way:

Reader reader = new StringReader(paragraphText);
DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor(reader, DocumentPreprocessor.DocType.Plain);

TokenizerFactory<? extends HasWord> factory = PTBTokenizer.factory();
factory.setOptions("untokenizable=noneDelete");
documentPreprocessor.setTokenizerFactory(factory);

From here: https://github.com/stanfordnlp/CoreNLP/issues/103#issuecomment-157793500

Stanford NLP: how to disable warnings?

2 Answers2