Difference between WhitespaceTokenizerFactory and StandardTokenizerFactory

Question

I am new to Solr. By reading Solr's wiki, I don't understand the differences between WhitespaceTokenizerFactory and StandardTokenizerFactory. What's their real difference?

score 27 · Accepted Answer · answered Jun 25 '12 at 03:13

27

They differ in how they split the analyzed text into tokens.

The StandardTokenizer does this based on the following (taken from lucene javadoc):

Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
Recognizes email addresses and internet hostnames as one token.

The WhitespaceTokenizer does this based on whitespace characters:

A WhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens.

You should pick the tokenizer that best fits your application. In any case you have to use the same analyzer/tokenizers for indexing and searching!

answered Jun 25 '12 at 03:13

csupnig

3,327
1
24
22

Thanks csupnig! When u say "use the same analyzer/tokenizer" for index and searching, you meant the analyzer needs to be matched with the type of the tokenizer being used, m i right? – trillions Jun 25 '12 at 03:23
2

Yes, they should do the same in order to produce similar tokens. There are only rare cases where you want different tokenizers in the query parser than the tokenizers you used while indexing. – csupnig Jun 25 '12 at 07:08
5

**StandardTokenizer does not recognizes email addresses and internet hostnames as one token** : `@` is among the set of token-splitting punctuation, as well as hyphens and "dot/digit combinations", so email addresses are not preserved as single tokens, and inputs like : `my-domain2.com` is just splitted as :`my`, `domain2`, and `com`. – EricLavault Nov 23 '14 at 14:04
1

Following points are wrong behalf of StandardTokenizer : 1) Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. 2) Recognizes email addresses and internet hostnames as one token. – Vishnu Sharma Jul 02 '18 at 06:02

Difference between WhitespaceTokenizerFactory and StandardTokenizerFactory

1 Answers1

Linked