0

I have made the following type definition in Solr:

<fieldType name="text_phrase" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>    
</fieldType>

It should index values verbatim (no tokenization).

I add the value "skinny jeans" to my index.

When I run the following search query (url decoded for reading) I get no results:

http://myvm:8983/solr/mycore/select?q=*:*&fq=name:("skinny jeans")&wt=json&indent=true&debugQuery=true

You can see the URL is searching for everything (*:*) with a filter query for the exact value "skinny jeans".

I then add the value "jeans" to my index, and run a similar query with

&fq=name:("jeans")

And I do find the "jeans" element.


So it works for a single word, but not for multiple words. Why would this be? I'm searching for an exact value after all. It makes me suspect that the KeywordTokenizerFactory is doing something odd. Can anyone please advise why no results are being returned from such a basic setup?

Thanks,

mils
  • 1,878
  • 2
  • 21
  • 42

1 Answers1

1

This is because you are using the KeywordTokenizerFactory for indexing which keeps the word as it is. Does not apply any tokenization or does not create any tokens. But While querying you are using WhitespaceTokenizerFactory which creates tokens for the whitespace.

So KeywordTokenizerFactory will have a token like "skinny jeans" as single token in the index.

WhitespaceTokenizerFactory will create tokens like "skinny", "jeans".

You can see the difference, it wont match. You are searching for "skinny", "jeans" against "skinny jeans".

You need to either change the index tokenizer or the query tokenizer.

If you want to go ahead for the exact match then keep the KeywordTokenizerFactory for both as in tokenizer while indexing and querying as shown below

<fieldType name="text_phrase" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
  </analyzer>    
</fieldType>

You can check the token created while indexing and token created while querying using solr analysis tool.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Abhijit Bashetti
  • 8,518
  • 7
  • 35
  • 47
  • Ah I think I understand, are you saying that even though in my query I double-quoted the phrase "skinny jeans", it won't actually be queried as a phrase, but rather as individual tokens? – mils May 25 '16 at 06:07
  • If I used a WhitespaceTokenizerFactory and ShingleFilterFactory on the query side could I get similar results to what I'm after? – mils May 25 '16 at 07:18
  • @mils : Not very sure on this...But yes give a try by building the fieldType and apply the same to the field...The best way is to analyse the same in the solr analysis page/tool... – Abhijit Bashetti May 25 '16 at 07:42