0

I have a question about using the pearson correlation coefficient in a recommender system.

I currently have 3 collections in my database. 1 for users, 1 for restaurants and 1 for reviews.

I have written a function which takes 2 user id's and their list of submitted reviews and returns a double, which is the pearson correlation coefficient between the 2 users based on the reviews they've submitted.

So what the function does is make 2 lists of all reviews the users have submitted. Then a for loop checks if they have reviews which are left on the same restaurant, and places these reviews in a list. This list is used in calculating the coefficient.

I just wanted to know if I'm using this coefficient the right way. I want to give recommendations to the first user. Can I use this coefficient as a good indicator of someone who fits well with another user?

And if it's not a good way to match users, what would be a better way to do so?

In case anyone wonders, here's my function which calculates the coefficient.

public static double CalculatePearsonCorrelation(Guid userId1, List<Review> user1Reviews, 
                                                Guid userId2, List<Review> user2Reviews)
    {
        //Resetting the dictionary
        restaurantRecommendations = new Dictionary<Guid, List<Review>>();
        //Matching the reviews with the corresponding user
        restaurantRecommendations.Add(userId1, user1Reviews);
        restaurantRecommendations.Add(userId2, user2Reviews);
        //Check if users have enough reviews to get a correct correlation
        if (restaurantRecommendations[userId1].Count < 4)
            throw new NotEnoughReviewsException("UserId " + userId1 + " doesn't contain enough reviews for this correlation");
        if (restaurantRecommendations[userId2].Count < 4)
            throw new NotEnoughReviewsException("UserId " + userId2 + " doesn't contain enough reviews for this correlation");                
        //This will be the list of reviews that are the same per subject for the two users.
        List<Review> shared_items = new List<Review>();
        //Loops through the list of reviews of the selected user (userId1)
        foreach (var item in restaurantRecommendations[userId1])
        {
            //Checks if they have any reviews on subjects in common
            if (restaurantRecommendations[userId2].Where(x => x.subj.Id == item.subj.Id).Count() != 0)
            {
                //Adds these reviews to a list on which the correlation will be based
                shared_items.Add(item);
            }
        }
        //If they don't have anything in common, the correlation will be 0
        if (shared_items.Count() == 0)
            return 0;
        //I decided users need at least 4 subjects in common, else there won't be an accurate correlation
        if (shared_items.Count() < 4)
            throw new NotEnoughReviewsException("UserId " + userId1 + " and UserId " + userId2 + " don't have enough reviews in common for a correlation");
        ////////////////////////// Calculating the pearson correlation //////////////////////////
        double product1_review_sum = 0.00f;
        double product2_review_sum = 0.00f;
        double product1_rating = 0f;
        double product2_rating = 0f;
        double critics_sum = 0f;
        foreach (Review item in shared_items)
        {
            product1_review_sum += restaurantRecommendations[userId1].Where(x => x.subj.Id == item.subj.Id).FirstOrDefault().rating;
            product2_review_sum += restaurantRecommendations[userId2].Where(x => x.subj.Id == item.subj.Id).FirstOrDefault().rating;
            product1_rating += Math.Pow(restaurantRecommendations[userId1].Where(x => x.subj.Id == item.subj.Id).FirstOrDefault().rating, 2);
            product2_rating += Math.Pow(restaurantRecommendations[userId2].Where(x => x.subj.Id == item.subj.Id).FirstOrDefault().rating, 2);
            critics_sum += restaurantRecommendations[userId1].Where(x => x.subj.Id == item.subj.Id).FirstOrDefault().rating *
                            restaurantRecommendations[userId2].Where(x => x.subj.Id == item.subj.Id).FirstOrDefault().rating;
        }
        //Calculate pearson correlation
        double num = critics_sum - (product1_review_sum * product2_review_sum / shared_items.Count);
        double density = Math.Sqrt((product1_rating - Math.Pow(product1_review_sum, 2) / shared_items.Count) * 
                                    ((product2_rating - Math.Pow(product2_review_sum, 2) / shared_items.Count)));
        if (density == 0)
            return 0;

        return num / density;
    }
}
  • The larger the commonality of the two users the better the fit is going to be. If the two users don't only have one or two restaurants that both have gone to the your fit isn't going to be very accurate. It may be better to use age and gender to get a larger commonality. – jdweng Sep 23 '16 at 12:20
  • I want the correlation to be based on their taste. If they both reviewed a same restaurant, the chance of their taste being the same is high enough to make recommendations. Thoughts? –  Sep 23 '16 at 12:28
  • Try type of restaurant : Italian, Mexican, Chinese. You are not going to get good fit on a small common sample. – jdweng Sep 23 '16 at 12:44
  • Okay let's say I'll just the type of a restaurant. I was wondering about the accuracy of the correlation. If the first user has 5 reviews, and the second one has 7, the shared_items list will only contain 5 reviews (If they have the same type of restaurant in common) How will this impact the correlation, and do you have any idea on how to do this better? –  Sep 27 '16 at 09:32
  • My issue is when there are zero or 1 item in common. Yes going to same restaurant is better only if the number of fits is high. So you have a trade-off. to get a higher number of fits. No correlation is perfect, there are always errors. Look at exit polling during an elections. Usually accuracy is has error of 2%-4%. I would expect your error under any condition would be a lot higher. You are asking for a good indicator. What is GOOD 5%, 10%, 15% error? Your correlation is better than not using a correlation, but you will never get 100% accuracy. I don't have a better method. – jdweng Sep 27 '16 at 09:43
  • Thanks a lot, you've been a great help for me. I'm still not sure about the error percentage, I'll have to ask my product owner. But coming back on my code. It only checks each review for the first user, but how can I say which review it has to match with from the second user when he has more reviews? And does this matter when calculating the coefficient? –  Sep 27 '16 at 10:04
  • The correlation coefficient is a combination of two numbers. In this case the number of matches and ratings. One good match can yield better results then 10 matches with poorer ratings. Read the wiki article : https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient – jdweng Sep 27 '16 at 11:20
  • Again, thanks a lot! If you'd submit an answer, I can mark it as the answer –  Sep 27 '16 at 12:08
  • I had another question. I have done research on different correlation coefficients, and I wanted to know what the better option is: The pearson correlation or the spearman's correlation. I'm using a basic rating system, with ratings ranging from 1 until 5 per restaurant. Would it make sense to use a spearman's correlation instead of a pearson correlation? –  Sep 30 '16 at 11:21
  • Not sure See Wiki article. I don't think Spearman's is going to be good for a small number of matches because the rank is going to skew results. Wiki : https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient – jdweng Sep 30 '16 at 12:51

0 Answers0