solr 3.6.1 splitting word boundaries at a dash

Question

We have a trouble ticket format of numerics divided by a dash i.e., n-nnnnnnn

The link http://lucidworks.lucidimagination.com/display/solr/Tokenizers (in the sections on Standard Tokenizer and Classic Tokenizer) implies that both before and after the support of Unicode standard annex UAX#29 :

Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.

Our Solr installation is only using StandardTokenizerFactory yet this trouble ticket format is being split in queries at the dash. I'm new to solr/lucene. I've downloaded the code for 3.6.1 and the comments imply the opposite (unless a dashed number is still considered a number). I wasn't able to follow the Lex processing:

Tokens produced are of the following types:
<ALPHANUM>: A sequence of alphabetic and numeric characters
<NUM>: A number
<SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast

  Asian languages, including Thai, Lao, Myanmar, and Khmer</li>

<IDEOGRAPHIC>: A single CJKV ideographic character
<HIRAGANA>: A single hiragana character

I can confirm that you need to use Classic Analyzer at least when dealing with the pattern /^\d{1,5}-\d$/. I wonder if the initial single digit in your input is the problem? — Mark Leighton Fisher, Nov 21 '12 at 22:38
I'm not actually using a pattern filter. Just the StandardTokenizerFactory. — user1840253, Nov 22 '12 at 19:32
Sorry for the confusion -- I meant the input pattern in your data. I've seen the same behavior with 3.x+ Standard Analyzer with LOINC numbers, which are 1-5 digits followed by a '-' and a single digit. — Mark Leighton Fisher, Nov 25 '12 at 00:47
As per http://stackoverflow.com/questions/13571542/challenge-with-hyphens-dashes-in-solr-lucene it does break on a hyphen. — user1840253, Nov 27 '12 at 16:09
Is this something that can be done using a regex? In that case you can use the PatternTokenizer with a regex expression to define where ever you need to split. — rounak, May 06 '13 at 11:13
Can you create a regex for your logic? In that case you can use Pattern tokenizer. — rounak, May 10 '13 at 09:16
I can't understand if you are trying just to get through understanding StandardTokenizerFactory or if you are trying to do a specific thing. Can you clarify what is your main objective? — Alexander Jardim, Jul 03 '13 at 10:39

score 1 · Answer 1 · answered Jul 18 '14 at 09:39

You need the Regular Expression Pattern Tokenizer. This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.

See the Javadocs for java.util.regex.Pattern for more information on Java regular expression syntax.

solr 3.6.1 splitting word boundaries at a dash

1 Answers1