14

I have found what must be dozens of articles on Towards Data Science/ medium/ etc. of people making recommendation engines with imdb data (based on ratings that users gave to movies, what movies should we recommend to those users). These articles begin with 'memory based approaches' of user-based content filtering and item-based content filtering. I have been tasked with making a recommendation engine, and since none of the suits really care or know anything about this, I want to do the bare minimum (which seems to be user-based content filtering).

Problem is, all of my data is binary (no ratings, just based on the items that other users bought, should we recommend items to similar users - this is actually similar to the cartoons that all of the medium articles have stolen from eachother, but none of the medium articles give an example of how to do that).

All of the articles use Pearson Correlation or cosine similarity to determine user similarity, can I use these approaches with binary dimensions (bought or not), if so how, and if not is there a different way to measure user similarity?

I am working with python btw. And I was thinking of maybe using Hamming Distance (is there a reason that wouldn't be good)

amchugh89
  • 1,276
  • 1
  • 14
  • 33

3 Answers3

6
  • Similarity score based approaches do work even with binary dimension. When you have scores, two similar users may look like [5,3,3,0,1] and [4,3,3,0,0], where as in your case it would be something like [1,1,1,0,1] and [1,1,1,0,0].
from scipy.spatial.distance import cosine
1 - cosine([5,3,2,0,1],[4,3,3,0,0])
0.961161313666907
1 - cosine([1,1,1,0,1],[1,1,1,0,0]) 
0.8660254037844386
  • Another approach is, if you can get the number of times a user bought a product, that count can be used as rating and then similarities can be calculated
  • Thanks for an answer. I thought of this, and ended up using Hamming distance instead just because I forgot the formula for distance in 3+ dimensions and already had a Hamming distance formula – amchugh89 Nov 25 '19 at 18:29
3

The data you have is an implicit data which means interactions are not necessarily indicate user's interest it's just interaction. Interaction value of 1 and interaction value of 1000 has no difference in this case they both shows interaction nothing else, such that memory based algorithms are useless here. If you are not familiar with neural networks, then you have to at least use matrix factorization techniques to make a meaningful recommendation using this data, you can start with surprise library here which has a bunch of matrix factorization models.

It will be better if you use ALS as optimization technique, but SGD will also do the work. If you are ok with deep-learning I can refer to the sources of the best work so far.

I once used non-negative matrix factorization(NNMF for short) algorithm in surprise for data like yours and the results was good enough.

Abdirahman
  • 180
  • 1
  • 5
  • Could you share a link to your NNMF code? – Sujit Feb 03 '21 at 10:21
  • 1
    If you are looking for starter code for these algorithms check this repo: https://github.com/jiristo/recsys_matrixfactorization/blob/master/recsys.ipynb – Abdirahman Feb 03 '21 at 11:57
2

It seems, that in your situation the best approach would be collaborative filtering. You don't need scores, everything that you need is a user-item interaction matrix. The simplest algorithm, in this case, is Alternating Least Square (ALS).

There're already a few implementations in python. For instance, this one. Also, there's an implementation in PySpark recommendation module.

Danylo Baibak
  • 2,106
  • 1
  • 11
  • 18