5

As an example I have a text field that might contain the following string:

"d7199^^==^^81^^==^^A sentence or two!!"

I want to tokenize this data but have each token contain the first part of the string. So, I'd like the tokens to look like this for the example above:

"d7199^^==^^81^^==^^a"

"d7199^^==^^81^^==^^sentence"

"d7199^^==^^81^^==^^or"

"d7199^^==^^81^^==^^two"

How would I go about doing this?

Jason Palmer
  • 731
  • 4
  • 17

1 Answers1

1

You can implement your own custom Tokenizer and add it to the Solr classpath. Then use it in your Solr schema.xml and solrconfig.xml

Karl-Bjørnar Øie
  • 5,554
  • 1
  • 24
  • 30
  • After a bit of research this was my most logical conclusion as well. If you can give me some good examples the bounty all be yers! – Jason Palmer Sep 01 '11 at 13:19
  • How do you know when the first part of the input reaches its end? – jpountz Sep 02 '11 at 15:50
  • I could either define a different separator or we could just have it end at the last token ^^==^^. Or something else if you have a better suggestion. 3 more days until the bounty expires :( – Jason Palmer Sep 05 '11 at 16:08
  • 1
    It seems obvious that one must subclass a Tokenizer but HOW? – gyozo kudor May 25 '12 at 12:22
  • start with extending http://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/TokenizerFactory.html or http://lucene.apache.org/solr/api-3_6_1/org/apache/solr/analysis/TokenFilterFactory.html http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters – Karl-Bjørnar Øie Sep 03 '12 at 15:50