3

I'm using Mahout with the Pearson Correlation algorithm to compare and find similar users based on their preferences for several items. The problem I'm running into is that Mahout and/or Pearson is ignoring users that select the same preference for every item. Does anyone know if there is a way to configure Mahout to NOT ignore people that select the same preference value for every item.

SGT Grumpy Pants
  • 4,118
  • 4
  • 42
  • 64

1 Answers1

6

It is not a question of configuration. The Pearson correlation is undefined in this case, so there can be no similarity computed between them using this metric.

Essentially -- Pearson is the ratio of the two preference series' covariance to the product of their standard deviations. But when one or both sequences are identical, the standard deviation is 0, as is the covariance, so the correlation is 0/0.

(This and a few other Pearson gotchas are covered in Chapter 4 of Mahout in Action, and I'm author of this part of the book and code.)

Sean Owen
  • 66,182
  • 23
  • 141
  • 173
  • Thank you. Is there another algorithm that Mahout supports that would work as well as Pearson, but that would allow users to select uniform preference values. – SGT Grumpy Pants Oct 17 '11 at 13:25
  • You could try EuclideanDistanceSimilarity. LogLikelihoodSimilarity is another good choice; it doesn't even use the pref value. – Sean Owen Oct 17 '11 at 13:48
  • Thank you for your responses, I have a follow up question that is related but that doesn't fit under this topic. I wonder if you might look at it? http://stackoverflow.com/questions/7821944/apache-mahout-euclidean-distance-unexpected-results Thank You. – SGT Grumpy Pants Oct 19 '11 at 13:23