How to customize Lucene.NET to search for words with symbols without case-sensitivity (e.g. "C#" or ".net")?

Question

The standard analyzer does not work. From what I can understand, it changes this to a search for c and net

The WhitespaceAnalyzer would work but it's case sensitive.

The general rule is search should work like Google so hoping it's a configuration thing considering .net, c# have been out there for a while or there's a workaround for this.

Per the suggestions below, I tried the custom WhitespaceAnalyzer but then if the keywords are separated by a comma and no-space are not handled correctly e.g.

java,.net,c#,oracle

will not be returned while searching which would be incorrect.

I came across PatternAnalyzer which is used to split the tokens but can't figure out how to use it in this scenario.

I'm using Lucene.Net 3.0.3 and .NET 4.0

is your domain source-code? Or are those just examples? – phanin Mar 03 '13 at 02:43 — phanin, Mar 03 '13 at 02:43
@phani those are just examples – Kumar Mar 07 '13 at 16:28 — Kumar, Mar 07 '13 at 16:28

groverboy · Answer 1 · 2013-02-25T02:49:26.127

Write your own custom analyzer class similar to SynonymAnalyzer in Lucene.Net – Custom Synonym Analyzer. Your override of TokenStream could solve this by pipelining the stream using WhitespaceTokenizer and LowerCaseFilter.

Remember that your indexer and searcher need to use the same analyzer.

Update: Handling multiple comma-delimited keywords

If you only need to handle unspaced comma-delimited keywords for searching, not indexing then you could convert the search expression expr as below.

expr = expr.Replace(',', ' ');

Then pass expr to the QueryParser. If you want to support other delimiters like ';' you could do it like this:

var terms = expr.Split(new char[] { ',', ';'} );
expr = String.Join(" ", terms);

But you also need to check for a phrase expression like "sybase,c#,.net,oracle" (expression includes the quote " chars) which should not be converted (the user is looking for an exact match):

expr = expr.Trim();
if (!(expr.StartsWith("\"") && expr.EndsWith("\"")))
{
    expr = expr.Replace(',', ' ');
}

The expression might include both a phrase and some keywords, like this:

"sybase,c#,.net,oracle" server,c#,.net,sybase

Then you need to parse and translate the search expression to this:

"sybase,c#,.net,oracle" server c# .net sybase

If you also need to handle unspaced comma-delimited keywords for indexing then you need to parse the text for unspaced comma-delimited keywords and store them in a distinct field eg. Keywords (which must be associated with your custom analyzer). Then your search handler needs to convert a search expression like this:

server,c#,.net,sybase

to this:

Keywords:server Keywords:c# Keywords:.net, Keywords:sybase

or more simply:

Keywords:(server, c#, .net, sybase)

tried that but then it does not recognize keywords like oracle,.net,c#,sybase etc. which are done by google - the golden standard for our users as it were, will look more info perhaps customizing the tokenizer if possible — Kumar, Feb 24 '13 at 06:34
this is needed for searching AND indexing, interesting idea about parsing into a separate field but that'd be more work considering we'd have to handle say .net4/ .net4.5 etc. rather than a .net* (silent) search etc. If i can't figure out a way to customize the tokenizer then will have to do something like this — Kumar, Feb 25 '13 at 04:04

score 4 · Answer 2 · answered Feb 23 '13 at 03:12

Use the WhitespacerAnalyzer and chain it with a LowerCaseFilter.

Use the same chain at search and index time. by converting everything to lower case, you actually make it case insensitive.

According to your problem description, that should work and be simple to implement.

score -2 · Accepted Answer · answered Mar 07 '13 at 16:36

-2

for others who might be looking for an answer as well

the final answer turned out be to create a custom TokenFilter and a custom Analyzer using that token filter along with Whitespacetokenizer, lowercasefilter etc., all in all about 30 lines of code, i will create a blog post and post the link here when i do, have to create a blog first !

answered Mar 07 '13 at 16:36

Kumar

10,997
13
84
134

Hi any chance you could publish it on a Gist? It sounds very useful – mcintyre321 Jul 02 '13 at 13:01

How to customize Lucene.NET to search for words with symbols without case-sensitivity (e.g. "C#" or ".net")?

3 Answers3

Linked