Lucene.Net Underscores causing token split

Question

I've scripted a MsSqlServer databases tables,views and stored procedures into a directory structure that I am then indexing with Lucene.net. Most of my table, view and procedure names contain underscores.

I use the StandardAnalyzer. If I query for a table named tIr_InvoiceBtnWtn01, for example, I recieve hits back for tIr and for InvoiceBtnWtn01, rather than for just tIr_InvoiceBtnWtn01.

I think the issue is the tokenizer is splitting on _ (underscore) since it is punctuation.

Is there a (simple) way to remove underscores from the punctuation list or is there another analyzer that I should be using for sql and programming languages?

I'm trying the StopAnalyzer and the WhitespaceAnalyzer now. So for it looks like the WhitespaceAnalyzer may be the way to go. — automatic, Dec 01 '10 at 16:05

Xodarap · Accepted Answer · 2010-12-01T16:59:46.720

6

Yes, the StandardAnalyzer splits on underscore. WhitespaceAnalyzer does not. Note that you can use a PerFieldAnalyzerWrapper to use different analyzers for each field - you might want to keep some of the standard analyzer's functionality for everything except table/column name.

WhitespaceAnalyzer only does whitespace splitting though. It won't lowercase your tokens, for example. So you might want to make your own analyzer which combines WhitespaceTokenizer and LowercaseFilter, or look into LowercaseTokenizer.

EDIT: Simple custom analyzer (in C#, but you can translate it to Java pretty easily):

// Chains together standard tokenizer, standard filter, and lowercase filter
class MyAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
    {
        StandardTokenizer baseTokenizer = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
        StandardFilter standardFilter = new StandardFilter(baseTokenizer);
        LowerCaseFilter lcFilter = new LowerCaseFilter(standardFilter);
        return lcFilter; 
    }
}

edited Dec 01 '10 at 16:59

answered Dec 01 '10 at 16:16

Xodarap

11,581
11
56
94

1

I think I will want lowercase tokens. I'm assuming there is not a "non-source compile" way of combining Whitespace and lowercase. What is the difference between using LowercaseFilter and lowercaseTokenizer? – automatic Dec 01 '10 at 16:34
@automatic: I have added an example of how to chain filters/tokenizers together. In general, Solr is intended to be the "easy to use" version of lucene, so yes, there is not a way of doing this which doesn't require writing code if you use only lucene. But that is quasi-intentional. – Xodarap Dec 01 '10 at 17:01
@automatic: Also, LowercaseTokenizer is LowercaseFilter + LetterTokenizer; looking at LetterTokenizer though, it will split at underscore too. So that is not what you want. Sorry. – Xodarap Dec 01 '10 at 17:03

Lucene.Net Underscores causing token split

1 Answers1

Linked