1

I'm using solr4.1.0 and I'm trying to get common word phrase search to work. This means when searching for "the cat" I want documents containing this phrase to be shown, but not documents containing "the" and "cat" somewhere or in different fields.

What I have:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory" />
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.CommonGramsFilterFactory" words="lang/stopwords.txt" format="snowball" />
            <filter class="solr.StopFilterFactory" words="lang/stopwords.txt" format="snowball" enablePositionIncrements="true" />
        </analyzer>
    </fieldType>

This should output special gram tokens when a "normal" word is combined with a stopword from stopwords.txt. In analyze view this works as expected, so "the cat" gets common-grammed to "the_cat cat".

The solution my client is after is that when stop words in the query are used in conjunction with normal words, only elements with this exact phrase (stop-word-2-shingle) should match. The overall default operator is still AND.

For example, I have documents with the following fields

  1. id: 1; title: my cat in its natuarl surroundings; desc: the nicest animal in da world is a cat
  2. id: 2; title: the cat is evil; desc: everyone knows that cats are pure evil
  3. id: 3; title: cat solving mysteries; desc: our cat is called Sherlock

The following are examples of what I'd like to achieve... bascially the users are more or less illiterate with respect to searches and queries and operators, thus the search should interpret the input and "do the right thing". The right thing would be:

  1. input: cat
    result: docs 1, 2, 3 (w/o scoring for the sake of easiness)
  2. input: cat world
    result: doc 1
    AND is default
  3. input: cat everyone
    result: doc 2
    AND spanning multiple fields
  4. input: the cat
    result: doc 1 because only this field contains the phrase "the cat", that somehow has to magically appera during query
  5. input: the nice cat
    reult: []
    because no document contains the phrase "the nice" and the algorithm would interpret this as a common word phrase
  6. input: the cat world
    result: doc 1
  7. input: the pure result: []

The reasoning behind this is that the client has some specific ideas regarding some (carefully selected) stop words.
So is this a realistic way of doing it? Is it necessary to do some kind of query pre-parsing before passing it to solr? Are there other ways to achieve the desired results?

chammp
  • 822
  • 1
  • 10
  • 20
  • 1
    Did you try querying `q=text:"*the cat*"` ? – buddy86 Feb 21 '14 at 05:44
  • you don't even have to use *'s, just query q=text:"the cat" – MYK Feb 21 '14 at 07:14
  • Thanks, this (and a bug I fixed in my schema.xml) at least yields hits, thus "the cat" finds "the cat". Now I see that I have to change my question because what my client _actually_ wants is something slightly different ;) – chammp Feb 21 '14 at 11:40

0 Answers0