I am new to Lucene, do not have enough time to go through the entire documentation. We are using Lucene highlighter to highlight matches. As far I know Lucene itself is using JFlex engine for it. Current task requires introduction of new language support.. According to requirements word like ειναι should match είναι and vice versa. People while typing a message usually avoid usage of accents and therefore word with accent must match same word without accent. So, my question is whether we can specify somewhere in Lucene or JFlex character transformation rules like U+038A->U+03B9 ?? Any help will be appreciated.
Asked
Active
Viewed 227 times
1 Answers
0
Not sure about character transformations...but you can do a couple things:
apply an ISOLatin1AccentFilter (in your analyzer) so that accented words are then treated as matches in non-accented searches. http://www.dotlucene.net/documentation/api/Lucene.Net.Analysis.ISOLatin1AccentFilter.html
use a Lucene Fuzzy Search http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.html#Fuzzy Searches
From what I have used, it is not a simple config setting. Solr might have something like that. Lucene is a bare library and usually gives you the flexibility to determine where your "business logic lies"...in searches, analyzers/filters or the index design itself.

Bart Czernicki
- 3,663
- 1
- 21
- 19
-
Thanks for your reply, but this is not exactly what I've been looking for. Seems that better solution will be to specify new token type in jflex file and as soon as word will be classified -- apply transformation rules.. – Ihor M. Nov 26 '12 at 20:27
-
Are you sure ISOLatin1AccentFilter doesn't help you? If you use it at both indexing and search time, you could find an accented word both by its accented and unaccented variations, which is what you wanted. (though you would also find an unaccented word by searching an accented word - is that the issue?) – Gili Nachum Dec 02 '12 at 22:23