1

I'm using Crate for a german news site and use fulltext search extensively (which generally works well enough). However I was wondering about stop words usage. I'd like to minimize this since search is plenty fast so I'm not too worried about performance. Is this advisable? And: which stop words are actually getting used by default-- is there a list of builtin stop words somewhere?

admdrew
  • 3,790
  • 4
  • 27
  • 39
Peter Sabaini
  • 45
  • 1
  • 5

1 Answers1

2

the built-in words are actually from lucene and are inside the lucene-analyzers-common*.jar file inside the lib directory of the crate tarball.

If you extract the contents of the jar file you'll find a file called german_stop.txt which contain all german stop words.

There is also a set of words inside the lucene source code which is marked as deprecated so I assume it's no longer in use. These words would be:

"einer", "eine", "eines", "einem", "einen",
"der", "die", "das", "dass", "daß",
"du", "er", "sie", "es",
"was", "wer", "wie", "wir",
"und", "oder", "ohne", "mit",
"am", "im", "in", "aus", "auf",
"ist", "sein", "war", "wird",
"ihr", "ihre", "ihres",
"als", "für", "von", "mit",
"dich", "dir", "mich", "mir",
"mein", "sein", "kein",
"durch", "wegen", "wird"

I think the default is good enough, unless you run into troubles with some specific words I don't see a reason to tweak the stop words.

mfussenegger
  • 3,931
  • 23
  • 18
  • Thanks, found it! The reason I'd like to tweak this is search precision. Stop words are very useful to cut down index size, but since I'm not worried about performance right now, I'd like to buy some search precision for a little extra load. (E.g. consider searching for phrases like "von einem zum anderen" which are all stop words). I'll do some experimentation... – Peter Sabaini Feb 21 '14 at 23:32