1

how do I make Solr/Lucene ignores space? What I want to achieve is to make the search engine match search phrases like, ie. "Hongkong" when only "hong kong" is indexed.

As far as I know I should play with some text-analyzers. I cannot find any good source describing this approach.

Thanks!

Jacek Francuz
  • 2,420
  • 8
  • 45
  • 61

3 Answers3

2

The search criteria in your case is different.
You would need to use solr.SynonymFilterFactory and define this combination as synonyms.
Check out the examples in the above link.
That would enable you to search for both hong kong and hongkong and still get the result.

Usually WordDelimiterFilterFactory would be used for combinations without space.
It is used for situations like change in case or alphanumeric combinations where you want to search with any combination.

e.g.
Wi-fi should be searchable by wifi, wifi, wi fi etc ....
iPhone should be searchable as iphone, iPhone, i phone etc ...
j2se searchable by j2se, j 2 se etc ...

Jayendra
  • 52,349
  • 4
  • 80
  • 90
  • 2
    I'd also enable spell checking. Solving that problem via synonyms is an endless battle. I would use your analytics package to help identify the main offenders and let spell check pick up the rest. – Mike R. May 31 '12 at 16:33
2

you must know when those spaces are relevant or not, so you have the list of words, you should use synomyms...see doc for SynonymFilterFactory

Persimmonium
  • 15,593
  • 11
  • 47
  • 78
2

You can use ShingleFilterFactory to create word combinations. You need to set tokenSeparator="" in order to remove space between tokens. You may want to leave outputUnigrams=true if you still want to search individual words.

 <fieldType name="text_shingle" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="2"
         outputUnigrams="true" outputUnigramsIfNoShingles="false" tokenSeparator=""/>
    </analyzer>
  </fieldType> 

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory

You need to be careful though. ShingleFilter will create combinations for everything in your document. For example "need to be carefull" will produce " needto tobe becareful." . this example looks good lets look at this one: "Are the eaters also" will produce "arethe theaters eatersalso". Query for "theaters" will result with a false positive hit.

if you are indexing short documents such as people names then I certainly suggest ShingleFilter because combinations are always used in person names. However, if you are indexing documents, you need to know what you are combining. Synonym filter may suit better in this case. You can create your combinations from a dictionary and use them with SynonymFilterFactory.

ali
  • 88
  • 1