3

I was trying to use the MoreLikeThisComponent to find similar documents. With one of the results I was wondering about the terms used by MLT, since the "interestingTerms" contained terms that were not part of the text analysis results.

Here is what terms were identified during text analysis:

  • 1er
  • anlag
  • loesch
  • reservierung

And here is what the TermsComponent returned:

  • 1er
  • anlag
  • geloscht
  • losch
  • p12
  • reservierung
  • schneider.go

So according to the result of the text analysis the terms "p12" and "schneider.go" should not apear in the terms list returned by TermsComponent. The term "geloscht" was replaced by "loesch" during text analysis and should therefore also not appear in the terms list returned by TermsComponent.

My approach for text analysis: At first I remove parts of the text passed to the text field using PatternReplaceCharFilter. The reason for that is, that all documents contain repeating text parts. Those text parts have no semantical meaning and are used to denote the type of text, the user that added it and the date the text block was added.

The two additional terms returned by the TermsComponent come from the original text and were removed by PatternReplaceCharFilter.

I checked, the "interestingTerms" identified by MLT are the same as returned by the TermsComponent. I also checked if there is a difference between a field storing the TermVector and a text field without TermVector. For both variants the same terms are returned by the TermsComponent.

Since the terms used by MLT differ from the terms identified during text analysis, MLT returns too many documents.

Does anybody know why MLT uses terms and TermsComponent returns terms that were not part of the text analysis results?

And maybe does anybody know a solution?

For completeness: I'm using the Solr 4 Trunk binary build from 7th Dec. 2011.

javanna
  • 59,145
  • 14
  • 144
  • 125
Jan Rasehorn
  • 311
  • 2
  • 6

0 Answers0