1

Using Python 3.6. I am not getting logical results when using Manhattan distance for similarity measurement. Even comparing to the results from Pearson and Euclidean correlation, the units for Euclidean and Manhattan looks off?

I am working on a crude recommendation model that involves recommending similar items by measuring similarity between user rating for preferred item X and other user ratings for the same item, and recommending the items of the other users with whom a strong match is found with the user who raised the request

The results I got are

Pearson: 
 [('Men in Black II', 0.12754201365635218), ('Fried Green Tomatoes', 0.11361596992427059), ('Miami Vice', 0.11068770878125743), ('The Dark', 0.11035867466994702), ('Comanche Station', 0.10994620915146613), ('Terminator 3: Rise of the Machines', 0.10802689932238932), ('Stand by Me', 0.10797224471029637), ('Dancer in the Dark', 0.10241410378191894), ('Los Olvidados', 0.10044018848844877), ('A Shot in the Dark', 0.10036315249837004)]

Euclidean: 
 [('...And the Pursuit of Happiness', 1.0), ('12 Angry Men', 1.0), ('4 Little Girls', 1.0), ('4교시 추리영역', 1.0), ('8MM', 1.0), ('A Band Called Death', 1.0), ('A Blank on the Map', 1.0), ('A Dandy in Aspic', 1.0), ('A Date with Judy', 1.0), ('A Zona', 1.0)]

Manhattan: 
 [('...And the Pursuit of Happiness', 1.0), ('12 Angry Men', 1.0), ('4 Little Girls', 1.0), ('4교시 추리영역', 1.0), ('8MM', 1.0), ('A Band Called Death', 1.0), ('A Blank on the Map', 1.0), ('A Dandy in Aspic', 1.0), ('A Date with Judy', 1.0), ('A Zona', 1.0)]

Cosine: 
 [('...And the Pursuit of Happiness', 1.0), ('4 Little Girls', 1.0), ('4교시 추리영역', 1.0), ('8MM', 1.0), ('A Band Called Death', 1.0), ('A Blank on the Map', 1.0), ('A Dandy in Aspic', 1.0), ('A Date with Judy', 1.0), ('A Zona', 1.0), ('A.I. Artificial Intelligence', 1.0)]
user1940212
  • 199
  • 1
  • 4
  • 14

1 Answers1

2

I cannot tell you why you get strange results without seeing your code, however, I can give you some explanation of the difference between Pearson, Euclidean and Manhattan similarities between two vectors.

  1. Pearson: this can be thought of as the cosine between the two vectors, and is therefore scale invariant. Thus, if two vectors are the same, but scaled diffently, this will be 1. With movie recommendations I assume this means that if I rated movie 1: 2/5, movie 2: 1/5 and movie 3: 2/5 and you rated the same movies 4/5, 2/5 and 4/5 respectively, then we will have the same movies recommended to us.

  2. Euclid: This is the normal way to measure distance between vectors. Note that large differences are exaggerated and small differences are ignored (small numbers squared becomes tiny numbers, large numbers squared becomes huge numbers). Thus if two vectors almost agree everywhere, they will be regarded as very similar. Additionally scale matters, and the example above would give a relatively large dissimilarity.

  3. Manhattan: This is similar to Euclidean in the way that scale matters, but differs in that it will not ignore small differences. If two vectors almost agree everywhere, the Manhattan distance will be large. Additionally, large differences in a single index will not have as large an impact on final similarities as with the Euclidean distance.

I assume that it is the fact that small dissimilarities add up to become a large dissimilarity in Manhattan but not Pearson and Euclidean that is the source of your confusion.

Ok, so upon looking at your code some more, I see that you use 1/(1+euclidean_distance) for Euclidean similarity, but manhattan_distance for Manhattan similarity. Try this instead

def Manhattan(x, y):
    return 1/(1+np.sum(np.abs(x-y)))

Ps. Sorry for any typos, I'm on my phone. Hopefully everything is still understandable.

Pps. note that you can write np.linalg.norm(x-y) for Euclidean distance between x and y and np.linalg.norm(x-y, 1) for Manhattan distance between x and y (instead of dealing with sqrt(sum((x-y)**2)) and np.sum(np.abs(x-y)).

Yngve Moe
  • 1,057
  • 7
  • 17
  • My confusion is that the answers seem to be weird. When I am using Pearson correlation formula, the answers that I got were having correlation in the range of 0.12, 0.11, etc. When I am using Euclidean correlation formula, as return 1/(1+(np.sqrt(np.sum((x1-x2)**2)))), all the answers display correlation of 1.0, which is weird. – user1940212 Apr 18 '18 at 21:56
  • When using Manhattan distance, I am getting distance as 674, 579, etc. Manhattan does not has a correlation formula, then Is it meaningful to use distance? I added my code to my question – user1940212 Apr 18 '18 at 22:03
  • Oh, you are using 1/(1+Euclidean), but not 1/(1+Manhattan). I'll edit my post and add this. – Yngve Moe Apr 19 '18 at 05:20
  • Thanks, one follow up question. For Pearson and Euclidean, I was using correlation formula, but for Manhattan I only know of the Distance formula. – user1940212 Apr 19 '18 at 18:01
  • If I were to use Euclidean distance then the formula would be simply np.sqrt(np.sum((x1-x2)**2). The reason I am using Correlation formula is because for ex: If there was a user X who liked the movie M and wants to get recommendations, then the code tries to recommend other movies by finding the Correlation between the rating of that Movie M and the ratings of other movies, and returns the movies with the highest correlation. I am using Manhattan because it only has distance formula. Would you say it dosen't matter whether one uses Correlation or Distance to identify similarity? – user1940212 Apr 19 '18 at 18:35
  • Let me start by saying that I think the Euclidean distance (L2) makes more sense than the Manhattan distance (L1) in this case. However, you can define your own "Manhattan correlation" by computing `1/(1+L1(x, y))`, it should give results that are similar to the L2 correlation. But, as mentioned above, L2 suppressed small differences and emphasise large, which I think make sense in this case. – Yngve Moe Apr 20 '18 at 08:18
  • Thanks! One last question is why are the results for Euclidean, Manhattan, and Cosine showing up as 1.0 for all the movies? How would the sorting work if all the correlation results are 1.0. Am I missing something? Can I include more decimals to be returned? For ex: if the results were Movie 1 - Correlation 0.87, M2 is 0.97, etc, then the max correlation can be shown first Please clarify. – user1940212 Apr 20 '18 at 19:45
  • It is indeed odd that you get correlation 1 with Euclidean, Manhattan and Cosine. This seems like an error of sorts. Try with the np.linalg.norm thing. If you still get the same problem, then I am not sure what's going on. – Yngve Moe Apr 24 '18 at 12:14
  • I would try some serious debugging, because this indicate that some of your vectors are equal. Are you sure that you are extracting the right dimension from your data matrix? You're not taking rows instead of columns or something like that? – Yngve Moe Apr 24 '18 at 12:16
  • I tried using np.linalg.norm(x-y) for Euclidean and np.linalg.norm(x-y,1) for Mangattan, but I am getting NO results when I use those formulas. Not sure why you think that np.linalg.norm(x-y) and np.linalg.norm(x-y, 1) is same as the formula for Euclidean and Manhattan. In Euclidean we are squaring the result, but in the formula np.linalg.norm(x-y) we are not. – user1940212 Apr 30 '18 at 21:58