0

I'm trying to set up an item-item matrix for a collaborative filtering system. I have a movie review system so I want a matrix where the columns are the movies (titles or ids) and the rows are the reviewers.

So, I tried pivoting a pandas frame with those information and it worked but with few data. I have around 4500000 reviews and pandas pivoting don't support that much data. So, I changed my approach and tried to create a sparse matrix with scipy.sparse csr_matrix. And the problem here is that my movie ids and reviewer ids are strings, and the ratings are double, and I get an error that scipy tried converting those values to int.

The pandas approach: overall is the 5-star rating given by the reviewer

import pandas as pd 
import numpy as np

reviews = pd.read_json('reviews_Movies_and_TV.json', lines=True)
reviews = reviews[pd.notnull(reviews['reviewText'])]

movie_titles = pd.read_json('meta_Movies.json', lines=True)
reviews = pd.merge(reviews, movie_titles, on='asin')

ratings = pd.DataFrame(reviews.groupby('title')['overall'].mean())
ratings['number_of_ratings'] = reviews.groupby('title')['overall'].count()

movie_matrix = reviews.pivot_table(index='reviewerID', columns='title', values='overall').fillna(0)

The csr matrix approach:

import pandas as pd 
import numpy as np

reviews = pd.read_json('reviews_Movies_and_TV.json', lines=True)
reviews = reviews[pd.notnull(reviews['reviewText'])]
reviews = reviews.filter(['reviewerID', 'asin', 'overall'])

movie_titles = pd.read_json('meta_Movies_and_TV.json', lines=True)
movie_titles = movie_titles.filter(['asin', 'title'])
reviews = pd.merge(reviews, movie_titles, on='asin')

ratings = pd.DataFrame(reviews.groupby('title')['overall'].mean())
ratings['number_of_ratings'] = reviews.groupby('title')['overall'].count()

reviews_u = list(reviews.reviewerID.unique())
movie_titles_u = list(reviews.asin.unique())

data = np.array(reviews['overall'].tolist(),copy=False)
row = np.array(pd.Series(reviews.reviewerID).astype(pd.api.types.CategoricalDtype(categories = reviews_u)),copy=False)
col = np.array(pd.Series(reviews.asin).astype(pd.api.types.CategoricalDtype(categories = movie_titles_u)),copy=False)
sparse_matrix = csr_matrix((data, (row, col)), shape=(len(reviews_u), len(movie_titles_u)))

df = pd.DataFrame(sparse_matrix.toarray())

So, now I'm stuck and I don't know how to solve this. The pandas is off the table with pivoting, unless there is another solution with pandas I haven't found. And csr matrix could work if there is a way I can associate 'X953D' reviewer or movie with an int number (which I haven't found yet)

Marilou
  • 21
  • 7

2 Answers2

0

you can use two word dictionary one for movies title one for reviewers id use python dict to save string and return int value. It’s similar to dictionary for word embedding

joyzaza
  • 186
  • 1
  • 8
  • Could you give me an example? Python is not my strongest ability. I actually thought of that solution, but I find it kind of messy.... buuut, if there is absolutely no other idea, then it'll have to do – Marilou Apr 24 '19 at 06:53
  • you can search the dict part for word embedding in python – joyzaza Apr 24 '19 at 07:04
  • What does word embedding has to do with that? I mean word embeddings produce vectors, not numbers – Marilou Apr 24 '19 at 07:09
  • need a word to id mapping first . It – joyzaza Apr 24 '19 at 07:11
  • ok maybe not word embeddings but what I had in mind is a dictionary with ascending numbers that map reviewer id and one to map movie id. Like 'AX934' movie is 1, 'AX935' movie is 2 etc... but it feels kind of messy. There has got to be a better solution.... If there is no better idea until tomorrow that I have some time, I will definitely try this. Thanks! – Marilou Apr 24 '19 at 07:17
0

Please, see this post (if the question is still up to date). Basically, there is no need to do a conversion to np.array. All you need to do is

row = reviews.reviewerID.astype(pd.api.types.CategoricalDtype(categories = reviews_u)).cat.codes

Nazz
  • 1