16

I m doing an assignment where I am trying to build a collaborative filtering model for the Netflix prize data. The data that I am using is in a CSV file which I easily imported into a data frame. Now what I need to do is create a sparse matrix consisting of the Users as the rows and Movies as the columns and each cell is filled up by the corresponding rating value. When I try to map out the values in the data frame I need to run a loop for each row in the data frame, which is taking a lot of time in R, please can anyone suggest a better approach. Here is the sample code and data:

buildUserMovieMatrix <- function(trainingData)
{
  UIMatrix <- Matrix(0, nrow = max(trainingData$UserID), ncol = max(trainingData$MovieID), sparse = T);
  for(i in 1:nrow(trainingData))
  {
    UIMatrix[trainingData$UserID[i], trainingData$MovieID[i]] = trainingData$Rating[i];
  }
  return(UIMatrix);
}

Sample of data in the dataframe from which the sparse matrix is being created:

    MovieID UserID  Rating
1       1      2       3
2       2      3       3
3       2      4       4
4       2      6       3
5       2      7       3

So in the end I want something like this: The columns are the movie IDs and the rows are the user IDs

    1   2   3   4   5   6   7
1   0   0   0   0   0   0   0
2   3   0   0   0   0   0   0
3   0   3   0   0   0   0   0
4   0   4   0   0   0   0   0
5   0   0   0   0   0   0   0
6   0   3   0   0   0   0   0
7   0   3   0   0   0   0   0

So the interpretation is something like this: user 2 rated movie 1 as 3 star, user 3 rated the movie 2 as 3 star and so on for the other users and movies. There are about 8500000 rows in my data frame for which my code takes just about 30-45 mins to create this user item matrix, i would like to get any suggestions

user37940
  • 478
  • 1
  • 4
  • 17

2 Answers2

15

The Matrix package has a constructor made especially for your type of data:

library(Matrix)
UIMatrix <- sparseMatrix(i = trainingData$UserID,
                         j = trainingData$MovieID,
                         x = trainingData$Rating)

Otherwise, you might like knowing about that cool feature of the [ function known as matrix indexing. Your could have tried:

buildUserMovieMatrix <- function(trainingData) {
  UIMatrix <- Matrix(0, nrow = max(trainingData$UserID),
                        ncol = max(trainingData$MovieID), sparse = TRUE);
  UIMatrix[cbind(trainingData$UserID,
                 trainingData$MovieID)] <- trainingData$Rating;
  return(UIMatrix);
}

(but I would definitely recommend the sparseMatrix approach over this.)

flodel
  • 87,577
  • 21
  • 185
  • 223
10

This will probably be faster than a loop.

library(reshape2)
m <- dcast(df,UserID~MovieID,fill=0)[-1]
m
#   1 2
# 1 3 0
# 2 0 3
# 3 0 4
# 4 0 3
# 5 0 3

If you use data.tables, it will be a lot faster:

library(data.table)
DT <- as.data.table(df)
m  <- dcast(DT,UserID~MovieID,fill=0)[-1]

And as I'm sure someone will point out, you can use this instead

setDT(df)
m  <- dcast(df,UserID~MovieID,fill=0)[-1]

This converts df to a data.table in place (without making a copy). if your data set is enormous, that can make a difference...

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • 1
    There is a slight problem with this approach though, this approach does not map out the USERIDs and MOVIEIDs correctly for e.g. the USER 11 was missing from the training data so now in row 11 I have the user ID and movie ID rating for the user 12 and subsequently all the rows are shifted by 1, I have 10916 users in my train set and I want to keep all of them in my User Item matrix, in case if a user is missing from my training data I can mark that whole row vector as zero, this would prevent any mismatch in the training data data frame and my matrix, can u suggest some other approach, thanks – user37940 Oct 05 '14 at 23:24