I'm using Crate for a german news site and use fulltext search extensively (which generally works well enough). However I was wondering about stop words usage. I'd like to minimize this since search is plenty fast so I'm not too worried about performance. Is this advisable? And: which stop words are actually getting used by default-- is there a list of builtin stop words somewhere?
Asked
Active
Viewed 143 times
1 Answers
2
the built-in words are actually from lucene and are inside the
lucene-analyzers-common*.jar
file inside the lib directory of the crate
tarball.
If you extract the contents of the jar file you'll find a file called
german_stop.txt
which contain all german stop words.
There is also a set of words inside the lucene source code which is marked as deprecated so I assume it's no longer in use. These words would be:
"einer", "eine", "eines", "einem", "einen",
"der", "die", "das", "dass", "daß",
"du", "er", "sie", "es",
"was", "wer", "wie", "wir",
"und", "oder", "ohne", "mit",
"am", "im", "in", "aus", "auf",
"ist", "sein", "war", "wird",
"ihr", "ihre", "ihres",
"als", "für", "von", "mit",
"dich", "dir", "mich", "mir",
"mein", "sein", "kein",
"durch", "wegen", "wird"
I think the default is good enough, unless you run into troubles with some specific words I don't see a reason to tweak the stop words.

mfussenegger
- 3,931
- 23
- 18
-
Thanks, found it! The reason I'd like to tweak this is search precision. Stop words are very useful to cut down index size, but since I'm not worried about performance right now, I'd like to buy some search precision for a little extra load. (E.g. consider searching for phrases like "von einem zum anderen" which are all stop words). I'll do some experimentation... – Peter Sabaini Feb 21 '14 at 23:32