0

when i query for "elegant" in solr i get results for "elegance" too.

I used these filters for index analyze

WhitespaceTokenizerFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
SynonymFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory
ReversedWildcardFilterFactory

and for query analyze:

WhitespaceTokenizerFactory
SynonymFilterFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory 

I want to know which filter affecting my search result.

Romi
  • 4,833
  • 28
  • 81
  • 113

1 Answers1

0

EnglishPorterFilterFactory

Thats the short answer ;)

A little more information:

English Porter means the english porter stemmer stemming alogrithm. And both elegant and elegance have according to the stemmer (which is a heuristical word root builder) the same stem.

You can verify this online e.g. Here. Basically you will see "eleg ant " and "eleg ance" stemmed to the same stem > eleg.

From Solr source:

       public void inform(ResourceLoader loader) {
            String wordFiles = args.get(PROTECTED_TOKENS);
            if (wordFiles != null) {
                try {

Here exactly comes the protwords file into play:

                    File protectedWordFiles = new File(wordFiles);
                    if (protectedWordFiles.exists()) {
                        List<String> wlist = loader.getLines(wordFiles);
                        //This cast is safe in Lucene
                        protectedWords = new CharArraySet(wlist, false);//No need to go through StopFilter as before, since it just uses a List internally
                    } else {
                        List<String> files = StrUtils
                                .splitFileNames(wordFiles);
                        for (String file : files) {
                            List<String> wlist = loader.getLines(file
                                    .trim());
                            if (protectedWords == null)
                                protectedWords = new CharArraySet(wlist,
                                        false);
                            else
                                protectedWords.addAll(wlist);
                        }
                    }
                } catch (IOException e) {
                    throw new RuntimeException(e);
                }
            }
        }

Thats the part which affects the stemming. There you see the invocation of the snowball library

        public EnglishPorterFilter create(TokenStream input) {
            return new EnglishPorterFilter(input, protectedWords);
        }

    }

    /**
     * English Porter2 filter that doesn't use reflection to
     * adapt lucene to the snowball stemmer code.
     */
    @Deprecated
    class EnglishPorterFilter extends SnowballPorterFilter {
        public EnglishPorterFilter(TokenStream source,
                CharArraySet protWords) {
            super (source, new org.tartarus.snowball.ext.EnglishStemmer(),
                    protWords);
        }
    }
fyr
  • 20,227
  • 7
  • 37
  • 53
  • @fyr: Ya i useed solr adimn page to see the effect :), But englishPorterFilter using portwords.txt, in which i included nothing. then how it is doing it ? – Romi Jun 29 '11 at 10:07
  • what is the use of portwords.txt – Romi Jun 29 '11 at 10:07
  • no it uses only portwords for stems which you fix. It is heuristical so it will make mistakes. The English Porter algorithm uses the snowball library. – fyr Jun 29 '11 at 10:08
  • i used it as : then what is portwords.txt here – Romi Jun 29 '11 at 10:11
  • look at my edit. prot words are words which are not stemmed. "protected words" – fyr Jun 29 '11 at 10:13
  • U mean, words of which i do not want to stem should include in protwords.txt file – Romi Jun 29 '11 at 10:27
  • @fyr let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/955/discussion-between-romi-and-fyr) – Romi Jun 29 '11 at 10:30