1

I want to do some basic hebrew stemming.

All the examples of custom analyzers I could find always merge other analyzers and and filters but never do any string level processing themselves.

What do I have to do for example if I want to create an analyzer that for each term in the stream it gets, emits either one or two terms by the following rules: if the incoming term begins with anything other then "a" it should be passed as is. if the incoming term begins with "a" then two terms should be emmited: the original term and a second one without the leading "a" and with a lower boost.

So that if the document has "help away" it will return "help", "away", and "way^0.8".

What methods of the analyzer should I override to do this? (A pointer to a similar nature example would be very helpful).

Thanks

Chet
  • 21,375
  • 10
  • 40
  • 58
epeleg
  • 10,347
  • 17
  • 101
  • 151

1 Answers1

1

Here's one example: http://www.java2s.com/Open-Source/Java-Document/Search-Engine/lucene/org/apache/lucene/wordnet/SynonymTokenFilter.java.htm

Briefly scanning the code, it seems it should emit additional tokens at the same position (a synonym). It does that by overriding incrementToken() which you'll have to do for your problem (maintain a stack of next tokens, returning one by one).

If this example doesn't work, just try to find one that explains how you could implement a synonym filter with Lucene, it's almost identical to your problem. Lucene in Action book has a good example of this, the code is available here: http://www.manning.com/hatcher3/LIAsourcecode.zip, class SynonymFilter.

milan
  • 11,872
  • 3
  • 42
  • 49
  • this looks very promising. I will probably take a day or two to make sure I get this right before I close this Q. but it does look like a realy good basis. (I will just have to fill the stack with my own required values). Any idea if I can cause them to be less significant is any way ? – epeleg Jan 16 '12 at 11:59
  • yeah, play with the code. if you have the book, lucene in action, it's explained there in details, with code samples (you can get the code samples from the book btw, just google). – milan Jan 16 '12 at 12:03
  • to make them less significant, if they're going into the same field, i guess you'd have to use the payload mechanism, and implement your own scorer. – milan Jan 16 '12 at 12:13
  • `"i guess you'd have to use the payload mechanism, and implement your own scorer"` can you provide some more info on this ? – epeleg Jan 16 '12 at 12:45
  • 1
    why don't you post this as another question, it might be useful for others – milan Jan 16 '12 at 12:54
  • +1 and done. here: http://stackoverflow.com/questions/8880396/boosting-lucene-terms-when-building-the-index – epeleg Jan 16 '12 at 13:02