3

I have a SOLR DB with ca. 70M documents. Certain query returns about 300 documents. With

  • facet.field=A it takes only 4 ms,
  • facet.field=B needs 800 ms to return!

Are there errors in my schema? Can it be done faster?

<fieldtype name="B_type" class="solr.TextField" positionIncrementGap="100"    
           sortMissingLast="true" omitNorms="true">
    <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory" />
        <filter class="solr.StandardFilterFactory" ignoreCase="true" />
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory" />
        <filter class="solr.StandardFilterFactory" ignoreCase="true" />
    </analyzer>
</fieldtype>

<field name="A" type="string" indexed="true" stored="true" multiValued="false" />
<field name="B" type="B_type" indexed="true" stored="false" multiValued="true" />
cheffe
  • 9,345
  • 2
  • 46
  • 57
Stefan Weiss
  • 229
  • 1
  • 3
  • 9

1 Answers1

6

Field A is of type string, which is good for use as facet. Your Field B is analyzed, you strip of special chars and you lower case it, which is not so good for use as a facet. The later things are done when applying the StandardFilterFactory.

In Solr's Wiki there is an interesting part about facets

Because faceting fields are often specified to serve two purposes, human-readable text and drill-down query value, they are frequently indexed differently from fields used for searching and sorting:

  • They are often not tokenized into separate words
  • They are often not mapped into lower case
  • Human-readable punctuation is often not removed (other than double-quotes)
  • There is often no need to store them, since stored values would look much like indexed values and the faceting mechanism is used for value retrieval.

As you can see you are missing the two points in the middle, you lower case and you remove special chars.

As advised in Indexing Fields with SOLR and LowerCaseFilterFactory you should introduce a new field in your schema, which should be of type string and be kept in sync with your field B via copyField. That new field should be used for faceting and it should be quicker. We name such fields usually with a suffix, like B_raw.

Since you do have 70m documents it would be a good idea to test it with a subset in advance to save your time.

Community
  • 1
  • 1
cheffe
  • 9,345
  • 2
  • 46
  • 57