How to work with solr phrases

Question

I'm using solr4.1.0 and I'm trying to get common word phrase search to work. This means when searching for "the cat" I want documents containing this phrase to be shown, but not documents containing "the" and "cat" somewhere or in different fields.

What I have:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory" />
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.CommonGramsFilterFactory" words="lang/stopwords.txt" format="snowball" />
            <filter class="solr.StopFilterFactory" words="lang/stopwords.txt" format="snowball" enablePositionIncrements="true" />
        </analyzer>
    </fieldType>

This should output special gram tokens when a "normal" word is combined with a stopword from stopwords.txt. In analyze view this works as expected, so "the cat" gets common-grammed to "the_cat cat".

The solution my client is after is that when stop words in the query are used in conjunction with normal words, only elements with this exact phrase (stop-word-2-shingle) should match. The overall default operator is still AND.

For example, I have documents with the following fields

id: 1; title: my cat in its natuarl surroundings; desc: the nicest animal in da world is a cat
id: 2; title: the cat is evil; desc: everyone knows that cats are pure evil
id: 3; title: cat solving mysteries; desc: our cat is called Sherlock

The following are examples of what I'd like to achieve... bascially the users are more or less illiterate with respect to searches and queries and operators, thus the search should interpret the input and "do the right thing". The right thing would be:

input: cat
result: docs 1, 2, 3 (w/o scoring for the sake of easiness)
input: cat world
result: doc 1
AND is default
input: cat everyone
result: doc 2
AND spanning multiple fields
input: the cat
result: doc 1 because only this field contains the phrase "the cat", that somehow has to magically appera during query
input: the nice cat
reult: []
because no document contains the phrase "the nice" and the algorithm would interpret this as a common word phrase
input: the cat world
result: doc 1
input: the pure result: []

The reasoning behind this is that the client has some specific ideas regarding some (carefully selected) stop words.
So is this a realistic way of doing it? Is it necessary to do some kind of query pre-parsing before passing it to solr? Are there other ways to achieve the desired results?

Thanks, this (and a bug I fixed in my schema.xml) at least yields hits, thus "the cat" finds "the cat". Now I see that I have to change my question because what my client _actually_ wants is something slightly different ;) — chammp, Feb 21 '14 at 11:40

How to work with solr phrases

0 Answers0