1

In our search based on Solr, we have started by using phrases. For example, when the user types

blue dress

then the Solr query will be

title:"blue dress" OR description:"blue dress"

We now want to remove stop words. Using the default StopFilterFactory, the query

the blue dress

will match documents containing "blue dress" or "the blue dress".

However, when typing

blue the dress

then it does not match documents containing "blue dress".

I am starting to wonder if we shouldn't instead only search using single terms. That is, convert the above user search into

title:the OR title:blue OR title:dress OR description:the OR description:blue OR description:dress

I am a bit reluctant to do this, though, as it seems doing the work of the StandardTokenizerFactory.

Here is my schema.xml:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
      <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
      <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" />
  </analyzer>
</fieldType>

The title and the description fields are both of type text_general.

Is the single terms search the standard way of searching in Solr? Am I exposing ourselves to problems by tokenising the words before calling Solr (performance issues, maybe)? Maybe thinking in term of single terms vs. phrases is just wrong and we should leave it to the user to decide?

Eric
  • 362
  • 1
  • 10
  • try by adding the StopFilterFactory in the query analyser as well... – Abhijit Bashetti Aug 24 '15 at 07:04
  • We do have the StopFilterFactory in both the query and the index analysers. – Eric Aug 24 '15 at 07:21
  • could you please share the schema.xml – Abhijit Bashetti Aug 24 '15 at 07:49
  • I've added the portion of schema.xml that seems relevant. – Eric Aug 24 '15 at 08:22
  • is there a reason why you are not using the DisMax queryhandler? This is usually the most reasonable choice as soon as you start spreading searches over several fields. – cheffe Aug 24 '15 at 09:02
  • Eric ... I treid your fieldType and its works fine for what are you expecting...Just check if you have stopword added in your stopwords.txt file...I have analysed the same in analysis tool... Have you analysed the same? – Abhijit Bashetti Aug 24 '15 at 09:38
  • @cheffe well, I'm not using DisMax/edismax mostly because the various examples generally use the default handler. After researching edismax, I do think now it would solve my problem. – Eric Aug 24 '15 at 10:12
  • @AbhijitBashetti did you do a query with single terms or a phrase query (that is: q:"title:\""blue dress\"")? I think that the phrase query is my problem. If I replace it either with a (manually) tokenised query or with a q:"blue dress", then I should be fine. – Eric Aug 24 '15 at 10:14

2 Answers2

1

What you stumble over is the fact that the stopwordfilter prevents the indexing of stopwords, but their position is indexed nevertheless. Something like a spaceholder is stored in the index where the stopword occurs.

So when you put this to your index

the blue dress

it will be indexed as

* blue dress

The same happens when you hand in the phrase

"blue the dress"

as a query. It will be treated as

"blue * dress"

Now Solr compares these two fragments and it does not match as the * is at the wrong position.

Prior to Solr 4.4 this used to be tackled via setting enablePositionIncrements="true" in the StopFilterFactory as described by Pascal Dimassimo. Apparently there has been a refactoring that did break that option on the StopFilterFactory as discussed on SO and Solr's Jira.


Update When reading through the reference documentation of the Extended Dis Max Query Parser I found this

The stopwords Parameter

A Boolean parameter indicating if the StopFilterFactory configured in the query analyzer should be respected when parsing the query: if it is false, then the StopFilterFactory in the query analyzer is ignored.

I will check if this helps with the problem.

Community
  • 1
  • 1
cheffe
  • 9,345
  • 2
  • 46
  • 57
  • Well, the stopwords parameter is not really an issue, as I was not using the edismax parser. – Eric Aug 24 '15 at 12:32
  • Anyway, it seems that my problem is precisely that I was not using edismax. I now am. I must write that up as an answer. Thanks for your help in confirming that! – Eric Aug 24 '15 at 12:42
0

Although the initial approach might work if the query was split into multiple title:term statements, this is prone to errors (as the tokens might be split in the wrong places) and is also duplicating, probably badly, the work done by the built-in tokenizer.

The right approach is to maintain the initial query as-is and rely on the Solr configuration to handle it properly. This makes sense, but the difficulty was that I wanted to specify the fields in which I wanted to search. And it turns out that there is no way to do that using the default query parser, which is the one known as LuceneQParserPlugin (confusingly, there is a parameter called fl, for Field List, which is used for specifying the returned fields, not the fields to search in).

To be complete, it must be mentioned that it is possible to simulate the list of parameters to search in by using the copyField configuration is schema.xml. I do not find this very elegant nor flexible enough.

The elegant solution is to use the ExtendedDisMax query parser, aka edismax. With it, we can maintain the query as is, and fully leverage the configuration in the schema. In our case, it looks like this:

        SolrQuery solrQuery = new SolrQuery();
        solrQuery.set("defType", "edismax");
        solrQuery.set("q", query); // ie. "blue the dress"
        solrQuery.set("qf", "description title");

According to this page:

(e)Dismax generally makes the best first choice query parser for user facing Solr applications

It would have helped if this had indeed been the default choice.

Community
  • 1
  • 1
Eric
  • 362
  • 1
  • 10
  • is the type for both `title` and `description` fields is `test_general`. Can you please also add that information to the question. Also now your index and query phase analyzer are same, so you can merge them under plain `analyzer` tag. – YoungHobbit Aug 24 '15 at 13:52
  • Yes, they both are using the text_general type (I've updated the question). I'm curious about your recommendation to merge the 2 configs under the same tag. I didn't realised that was possible as most examples show them both present. Is that the recommended way when they are the same? – Eric Aug 25 '15 at 07:23
  • We separate them for different set of tokenizer and filters to be run for indexing and querying the data. But when it is same set of operation, then better to combine. It looks cleaner and simple. – YoungHobbit Aug 25 '15 at 07:32
  • Right. Thanks, @abhishekbafna. – Eric Aug 25 '15 at 12:21