Stanford NLP pipeline issues lots of warnings particularly disturbing in production setup:
WARN Untokenizable: � (U+FFFD, decimal: 65533)
Is there a way to disable them?
Stanford NLP pipeline issues lots of warnings particularly disturbing in production setup:
WARN Untokenizable: � (U+FFFD, decimal: 65533)
Is there a way to disable them?
If you are working directly with a Tokenizer, the answer Denis Kulagin gives is good; if you are operating at the higher level of a StanfordCoreNLP pipeline, you can simply give the property (or equivalent command-line option):
tokenize.options = untokenizable=noneDelete
(to silently delete all unknown characters) or to silently keep them:
tokenize.options = untokenizable=noneKeep
One can do it that way:
Reader reader = new StringReader(paragraphText);
DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor(reader, DocumentPreprocessor.DocType.Plain);
TokenizerFactory<? extends HasWord> factory = PTBTokenizer.factory();
factory.setOptions("untokenizable=noneDelete");
documentPreprocessor.setTokenizerFactory(factory);
From here: https://github.com/stanfordnlp/CoreNLP/issues/103#issuecomment-157793500