I'm trying to set up an item-item matrix for a collaborative filtering system. I have a movie review system so I want a matrix where the columns are the movies (titles or ids) and the rows are the reviewers.
So, I tried pivoting a pandas frame with those information and it worked but with few data. I have around 4500000 reviews and pandas pivoting don't support that much data. So, I changed my approach and tried to create a sparse matrix with scipy.sparse csr_matrix. And the problem here is that my movie ids and reviewer ids are strings, and the ratings are double, and I get an error that scipy tried converting those values to int.
The pandas approach: overall is the 5-star rating given by the reviewer
import pandas as pd
import numpy as np
reviews = pd.read_json('reviews_Movies_and_TV.json', lines=True)
reviews = reviews[pd.notnull(reviews['reviewText'])]
movie_titles = pd.read_json('meta_Movies.json', lines=True)
reviews = pd.merge(reviews, movie_titles, on='asin')
ratings = pd.DataFrame(reviews.groupby('title')['overall'].mean())
ratings['number_of_ratings'] = reviews.groupby('title')['overall'].count()
movie_matrix = reviews.pivot_table(index='reviewerID', columns='title', values='overall').fillna(0)
The csr matrix approach:
import pandas as pd
import numpy as np
reviews = pd.read_json('reviews_Movies_and_TV.json', lines=True)
reviews = reviews[pd.notnull(reviews['reviewText'])]
reviews = reviews.filter(['reviewerID', 'asin', 'overall'])
movie_titles = pd.read_json('meta_Movies_and_TV.json', lines=True)
movie_titles = movie_titles.filter(['asin', 'title'])
reviews = pd.merge(reviews, movie_titles, on='asin')
ratings = pd.DataFrame(reviews.groupby('title')['overall'].mean())
ratings['number_of_ratings'] = reviews.groupby('title')['overall'].count()
reviews_u = list(reviews.reviewerID.unique())
movie_titles_u = list(reviews.asin.unique())
data = np.array(reviews['overall'].tolist(),copy=False)
row = np.array(pd.Series(reviews.reviewerID).astype(pd.api.types.CategoricalDtype(categories = reviews_u)),copy=False)
col = np.array(pd.Series(reviews.asin).astype(pd.api.types.CategoricalDtype(categories = movie_titles_u)),copy=False)
sparse_matrix = csr_matrix((data, (row, col)), shape=(len(reviews_u), len(movie_titles_u)))
df = pd.DataFrame(sparse_matrix.toarray())
So, now I'm stuck and I don't know how to solve this. The pandas is off the table with pivoting, unless there is another solution with pandas I haven't found. And csr matrix could work if there is a way I can associate 'X953D' reviewer or movie with an int number (which I haven't found yet)