We have a trouble ticket format of numerics divided by a dash i.e., n-nnnnnnn
The link http://lucidworks.lucidimagination.com/display/solr/Tokenizers (in the sections on Standard Tokenizer and Classic Tokenizer) implies that both before and after the support of Unicode standard annex UAX#29 :
Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
Our Solr installation is only using StandardTokenizerFactory yet this trouble ticket format is being split in queries at the dash. I'm new to solr/lucene. I've downloaded the code for 3.6.1 and the comments imply the opposite (unless a dashed number is still considered a number). I wasn't able to follow the Lex processing:
- Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast
-
Asian languages, including Thai, Lao, Myanmar, and Khmer</li>
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character