I'm using solr4.1.0 and I'm trying to get common word phrase search to work. This means when searching for "the cat" I want documents containing this phrase to be shown, but not documents containing "the" and "cat" somewhere or in different fields.
What I have:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.CommonGramsFilterFactory" words="lang/stopwords.txt" format="snowball" />
<filter class="solr.StopFilterFactory" words="lang/stopwords.txt" format="snowball" enablePositionIncrements="true" />
</analyzer>
</fieldType>
This should output special gram tokens when a "normal" word is combined with a stopword from stopwords.txt. In analyze view this works as expected, so "the cat" gets common-grammed to "the_cat cat".
The solution my client is after is that when stop words in the query are used in conjunction with normal words, only elements with this exact phrase (stop-word-2-shingle) should match. The overall default operator is still AND.
For example, I have documents with the following fields
- id: 1; title: my cat in its natuarl surroundings; desc: the nicest animal in da world is a cat
- id: 2; title: the cat is evil; desc: everyone knows that cats are pure evil
- id: 3; title: cat solving mysteries; desc: our cat is called Sherlock
The following are examples of what I'd like to achieve... bascially the users are more or less illiterate with respect to searches and queries and operators, thus the search should interpret the input and "do the right thing". The right thing would be:
- input: cat
result: docs 1, 2, 3 (w/o scoring for the sake of easiness) - input: cat world
result: doc 1
AND is default - input: cat everyone
result: doc 2
AND spanning multiple fields - input: the cat
result: doc 1 because only this field contains the phrase "the cat", that somehow has to magically appera during query - input: the nice cat
reult: []
because no document contains the phrase "the nice" and the algorithm would interpret this as a common word phrase - input: the cat world
result: doc 1 - input: the pure result: []
The reasoning behind this is that the client has some specific ideas regarding some (carefully selected) stop words.
So is this a realistic way of doing it? Is it necessary to do some kind of query pre-parsing before passing it to solr? Are there other ways to achieve the desired results?