0

I have been saving some product specifications into Solr 5. Most of the products contain unique variant ids that use dashes or dots, like this: Samesung TV 54 : AD-oi-230, Sony TV 24 : 1.849.32s.s.

But occassionally, I come across some variant ids that use spaces instead of dashes, like Samsung 54 : OPD 1 jud, Sony 32 : s1 90 b33 9 337.

Since those ids don't have much meaning, if I removed those spaces (Samsung 54 : OPD1jud, Sony 32 : s190b339337), would it scale better or make the index size smaller?

Here is my field that stores the model name. I have enabled the WordDelimiterFilterFactory:

  <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="0" generateNumberParts="1" splitOnCaseChange="0" catenateWords="1" splitOnNumerics="1" stemEnglishPossessive="0" generateWordParts="1" catenateAll="0" catenateNumbers="0"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.LengthFilterFactory" min="2" max="20"/>
    </analyzer>
  </fieldType>
RedGiant
  • 4,444
  • 11
  • 59
  • 146
  • Why do you need to make the index smaller (removing spaces would just buy you a little bit of time before the index grows into that size anyway)? Do you need the exact value? Are you using this as an actual ID field? – MatsLindh Jan 01 '16 at 20:08

1 Answers1

1

Index size is not an issue here. Especially, since whatever you do with analyzers, you still have the original stored values.

However, what you describe (removing spaces) makes sense for normalization of values and to ensure that search matches whether the ID contained spaces or dashes. So, that's a better reason to look at this anyway.

Alexandre Rafalovitch
  • 9,709
  • 1
  • 24
  • 27