2

I have been trying to get my Solr schema (using Solr 1.3.0) to create terms that are tokenized by whitespace and punctuation. Here are some examples on what I would like to see happen:

terms given -> terms tokenized

foo-bar -> foo,bar
one2three4 -> one2three4
multiple words/and some-punctuation -> multiple,words,and,some,punctuation

I thought that this combination would work:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"/>
  </analyzer
<fieldType>

The problem is that this results in the following for letter to number transitions:

one2three4 -> one,2,three,4

I have tried various combinations of WordDelimiterFilterFactory settings, but none have proven useful. Is there a filter or tokenizer that can handle what I require?

claytron
  • 1,575
  • 13
  • 20

1 Answers1

2

how about

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnNumerics="0" />

that should prevent one2three4 to be split

Raoul Duke
  • 4,241
  • 2
  • 23
  • 18
  • That's what I thought, but the `generateWordParts` does split on numerics regardless of that setting. http://i.imgur.com/WpgCl.png – claytron Oct 08 '10 at 14:01
  • Do you have it correctly configured for query time? In your OP I only see an index time analyzer defined. It works for me with solr 1.4 so I guess it's either a bug in 1.3 or a configuration issue on your part. – Raoul Duke Oct 08 '10 at 14:09
  • It does the same for query time. I'm starting to think that it is a bug in 1.3.0 also. – claytron Oct 08 '10 at 14:12
  • It's certainly possible. I had to submit a few patches to fix bugs in 1.3 myself :-/ – Raoul Duke Oct 08 '10 at 14:15
  • That was it. Tested on 1.4.1 and it worked as expected. Thanks! – claytron Oct 08 '10 at 15:07
  • 2
    Turns out it wasn't a bug. The `splitOnNumerics` option wasn't added until the Solr 1.4 release. If the Solr wiki wasn't read-only I'd make note about the inclusion of those options in version 1.4. – claytron Oct 12 '10 at 14:35