Do tokenizers remove whitespaces?

Question

Does Lucene's Standard Tokenizer remove whitespaces and blank lines? I've been reading the API (StandardTokenizer) but it's not specified. Maybe tokenizers do it by default, I don't know.

score 1 · Accepted Answer · answered May 23 '12 at 07:43

1

Yes. Lucene tokenizers grab indexable terms from documents, which does not include whitespace. They do preserve the token's offsets in the original document, though.

This is documented in the docs for StandardTokenizer:

Splits words at punctuation characters, removing punctuation.

(Whitespace is punctuation.)

answered May 23 '12 at 07:43

Fred Foo

355,277
75
744
836

blank lines are considered whitespaces too? – synack May 23 '12 at 07:50
@Kits89: yep, any whitespace is. – Fred Foo May 23 '12 at 08:29

Do tokenizers remove whitespaces?

1 Answers1