7

I was about to integrate the Sphinx-based search into the website, but I've found that there's no built support for spelling correction.

Folks on the web suggest using pspell or other third-party libraries to get things done, but the problem is the data I'm going to search in, contains mostly "technical" terms like brand names, thus I don't think common libraries will include them.

On the other hand, Xapian states to have spelling correction support based on the data indexed, so exactly what I want. Is it worth using Xapian instead? I'm still quite confused of which fulltext search engine I should use: Sphinx seems to be quite good, but lacking some cool features of Xapian (or maybe Lucene?), while it looks like the latter has smaller community and less documentation.

I think I can solve the problem with words not present in pspell dictionary using the custom one for it, but I'm not sure whether that will impose noticeable performance losses? I'm going to use the search system for the spotlight search (separate search via ajax on every letter entered) on a pretty popular website, so performance matters.

Ideally, I'd like to make some fields like brand names have more priority over common dictionary but I guess that's not really important since most brand names a quite distinct from the other words.

Any suggestions on the general design of the custom full-text search engine are welcome too.

Thanks

htf
  • 1,503
  • 4
  • 15
  • 21
  • Did you consider switching to Apache Solr? It is a search platform built on top of Lucene: http://lucene.apache.org/solr/features.html#Detailed+Features – nuqqsa May 19 '10 at 10:37

2 Answers2

6

Sphinx has no built-in spelling-correction, but that can be implemented using Sphinx. Only one how-to article (by Sphinx author) about this can be found there http://habrahabr.ru/blogs/sphinx/61807 (in Russian, You can use GoogleTranslate for read this article. Look on the second part of article named "Я понял, это намек.")

I implement that method recently - works perfect!

seriyPS
  • 6,817
  • 2
  • 25
  • 16
  • Google's Russian is way better than mine, but it's still pretty much useless for technical instruction purposes. – Brad Mace Nov 06 '10 at 03:27
  • @bemace, look into misc/suggest directory in the source tarball. It gives a basic idea on how it works. – user187291 Nov 07 '10 at 23:59
  • Yeah! My implementation of suggestion feature were based on contents of misc/suggest folder of sphinx tarball. @stereofrog thanks! – seriyPS Nov 08 '10 at 10:01
1

Sphinx allows you to use morphology preprocessors and word forms dictionaries. Both of these combined could get you closer to what you want to achieve. You can read more about both topics here: http://sphinxsearch.com/docs/manual-0.9.8.html#conf-morphology and further below.

There are several "flavours" of morphology preprocessors available, choose one that best fits your needs. The docs also mention the Snowball project, which can be used to add stems in other languages than the built-in english and russian, if needed. The project website: http://snowball.tartarus.org/

Sphinx is a very fast full text search engine and using stemmers is not likely to slow it down to the extent that you start noticing it.

guntars
  • 320
  • 2
  • 6