10

I have basically the same problem as discussed here: Solr wildcard query with whitespace, but this question was not answered.

I'm using a wildcard in a filter query on a field called "brand."

I'm having trouble when the brand name has whitespace in it. For instance, filtering the brand "Lexington" works fine when I say fq={!tag=brand}brand:Lexing*n. A multi-word brand like "Athentic Models" causes problems, however. It seems double quotes must be included around the name.

When there are "s, *s don't do anything, ie brand:"Athentic Mode*" or brand:"Lexingt*", won't match anything. Without double quotes, it does work to say brand:Authen*, with no quotes and no space, and that will match Authentic Models. But once whitespace is included in the brand name, it seems to only consider the string up to the first space when matching.

The brand field is of type

<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

which is not whitespace tokenized, to my understanding. It is populated with a copyField from a whitespace tokenized field, though.

Is there something I can do to stop Solr from tokenizing the filter query without using double quotes?

Community
  • 1
  • 1
Jon B
  • 133
  • 1
  • 3
  • 7

3 Answers3

14

Just like Rob said in his answer, I've posted an answer on my own on the question he linked to.

All you need to do is escape the space in your query (as in, customer_name:Pop *Tart --> customer_name:Pop\ *Tart). From my experience, this method works no matter where you place the wildcard, which is backed up by how Solr claims that something like:

customer_name:Pop\ *Tart

Is parsed as:

customer_name:Pop *Tart
Community
  • 1
  • 1
Aubergine
  • 1,117
  • 10
  • 11
1

Try to change the type from string to something like text. String type is not tokenized so when there is a whitespace in a string field, it will try to match your query, including the whitespace in the field.

in the default schema file you can see this line just above the string field type

<!-- The StrField type is not analyzed, but indexed/stored verbatim. -->

using a text type should fix your problem, like text_general or a similar one.

denizdurmus
  • 1,289
  • 1
  • 13
  • 39
  • I would think that I do want a non-tokenized field to store brands. I want to be able to filter on a brand by specifying its full name in a filter query, whitespace included. – Jon B Sep 12 '12 at 13:36
  • I tried using a wildcard on a tokenized field, and the matches it returned were too permissive...something like only requiring that the first token of the query matches a token in the index. – Jon B Sep 12 '12 at 19:25
  • for sure, using string on for a field that you will use for filtering or sorting makes sense for performance, but then you will ignore the space problem.. you can do some benchmarks to check performance difference for the string and text fields.. or you can try some other tricks like sorting on N-first letters or tokens of the field and so on.. i am not sure if you can define customized fields, though it could worth working on – denizdurmus Sep 13 '12 at 04:42
  • I'm not so worried about performance, actually. I found that doing the filter query on a text field resulted in matches that were not exact. For example, this produced matches for fq:"My Brand" like not only "My Brand," but "My Brand Foo" and "My Brand Bar," etc. I don't fully understand the behavior of this field, but I believe it won't meet my requirements. The reason I am using a wildcard is because for brands with a TM symbol, we are having trouble matching them because of encoding. I'd like to replace TM with a wildcard in the query and match the rest of the brand exactly. – Jon B Sep 13 '12 at 14:08
  • for TM thing, you may use a transformer? you can replace or remove it from the field and then you dont need to use a wildcard. and for matching on text query, i would recommend playing the analyzers and filters in the schema.xml file.. though the tutorial amount is not nice enough i think.. you can also post your question on here: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html this is one of the main lists of solr/lucene community – denizdurmus Sep 14 '12 at 00:31
0

I have added a possible solution back on the original question Solr wildcard query with whitespace

Note this ONLY works with trailing wildcards. I know this question example uses the wildcard within the string, but it serves to answer a specific case of the question in point.

Basically it amounts to using the FieldQParserPlugin query parser. Check my post on the original question for more details so I don't get scorn for repeating myself.

Community
  • 1
  • 1
Rob
  • 1,663
  • 1
  • 19
  • 16