I have a CSR formatted sparse matrix (scipy.sparse.csr_matrix) with around 100,000 rows and 10,000 columns. The rows represent users, and the columns represent items, and the values in the matrix, the rating for that user and item.
I am trying to calculate correlation between two users. So I am looping over each user (call it user_a), and doing matrix operations to get the correlation of user_a against all other users.
The first step, is to generate the current user matrix. This matrix contains the elements of the current user, masked to match the common elements of user_a with all other users.
My code at the moment is:
# ratings is the big original matrix
R = ratings.getrow(user_id)
user_matrix = sparse.csr_matrix(R)
user_matrix = user_matrix[numpy.array([0]).repeat(ratings.shape[0]),:]
user_matrix = user_matrix.multiply(ratings.astype(numpy.bool))
(https://stackoverflow.com/a/25342156/947194)
But these lines take 4 seconds for a user with just 500 items. And I need to run it for each user (100,000 times). So it is a bit slow.
I tried generating user_matrix using vstack, but it took 7 seconds
Is there a way to reduce a bit more the time of these lines?