Searching inside Lucene for raw-text without using any analyzer

Question

I have a Lucene index built using an analyzer. I am using the index to search for content and in most of the cases I would need an analyzer.

Now, in few cases where suppose I want to just look for a text in a field without the effect of the analyzer, is it still possible to look into a field of the same index? How should I go about constructing the query?

If I use a wildcardquery, it will still look inside the analyzed text, while I want to do a search in the raw text.

What do you mean without the analyzer? Can you give an example? You can search phrases usign "", but if you want to find exact text, you might need to index usign a WhiteSpaceAnalyzer — Rob Audenaerde, Jun 11 '13 at 05:41
I want to search using the analyzed text, but at times on the same text I want to do a simple text search (without the effect of the analyzer). If I am searching for Boxing, the analyzer indexes it as Box. Now, I want to search for the text "Boxing", the results will be both from those documents that have just Box and also Boxing. — London guy, Jun 11 '13 at 05:54

femtoRgon · Answer 1 · 2013-06-11T18:50:39.347

The case you describe in comments indicates that you are using an Analyzers with a Stemmer. Possibly EnglishAnalyzer (which incorporates a PorterStemmer). Rather than going without an Analyzer at all, which would result in an untokenized field, making search difficult, I would look into Analyzers that do not Stem.

StandardAnalyzer - A good standard, implements unicode standard text segmentation, largely non-language-specific.
SimpleAnalyzer - A very simple analyzer as indicated. Tokenizes into groups of contiguous letters, and lowercases them. Warning: numbers are lost by this tokenizer!
WhitespaceAnalyzer - Also very simple, simple creates tokens around whitespace. Doesn't lowercase or otherwise normalize tokens. This is often too simple to be useful.
ClassicAnalyzer - Implements the logic of what used to be the StandardAnalyzer in 3.X. Still a useful analyzer.

If you really do want to go without an Analyzer, simply using StringField bypasses any tokenization or analysis.

score 1 · Answer 2 · answered Jun 11 '13 at 13:00

I would suggest building an index with a field that contains the document with your default analyzer, and one with a WhitespaceAnalyzer

Your can create this usign http://lucene.apache.org/core/4_0_0-BETA/analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html

If you need exact search, search in the field with the whitespacetokenizer, else use the field that contains the text that is handled by your Analyzer

Searching inside Lucene for raw-text without using any analyzer

2 Answers2