R - data munging and scalable code

Question

Hy, in the last days I had a small/big problem.

I have a transaction dataset, with 1 million rows and two columns (Client Id and product id) and I want transform this in a binary matrix. I used reshape and spread function, but in both cases I used 64mb ram and Rstudio/R goes down. Because I only use 1 CPU, the process takes a lot of time My question is, what is it the new steep forward in this transition between small and big data? Who can I use more cpu?

I search and I found a couple of solution but I need a expertise opinion

1 - Using Spark R?

2 - H20.ai solution? http://h2o.ai/product/enterprise-support/

3 - Revolution analytics? http://www.revolutionanalytics.com/big-data

4 - go to the cloud? like microsoft azure?

If I needed I can use a virtual machine with a lot of cores.. but I need to know what is the smooth way to make this transaction

My specific problem

I have this data.frame (but with 1 million rows)

Sell<-data.frame(UserId = c(1,1,1,2,2,3,4), Code = c(111,12,333,12,111,2,3))

and I did:

Sell[,3] <-1

test<-spread(Sell, Code, V3)

this works with a little data set.. but with 1 million rows this takes a long time (12 hours) and goes down because my maximum ram is 64MB. Any suggestions?

Your question is too broad and asks for opinions (both are off-topic). Show your actual problem (with a reproducible example) and someone might offer a viable alternative. Probably you can stay in vanilla R without parallelization. — Roland, Nov 19 '15 at 10:56
Hy Roland, thanks for your comment. I put the example just now. regards — Kardu, Nov 19 '15 at 11:27

score 1 · Answer 1 · answered Nov 19 '15 at 12:50

1

You don't tell what you want to do with the result, but the most efficient way to create such a matrix would be creating a sparse matrix.

This is a dense matrix-like object that wastes a lot of RAM for all these NA values.

test
#  UserId  2  3 12 111 333
#1      1 NA NA  1   1   1
#2      2 NA NA  1   1  NA
#3      3  1 NA NA  NA  NA
#4      4 NA  1 NA  NA  NA

You can avoid this with a sparse matrix, which internally is still basically a long-format structure, but has methods for matrix operations.

library(Matrix)
Sell[] <- lapply(Sell, factor)
test1 <- sparseMatrix(i = as.integer(Sell$UserId), 
                      j = as.integer(Sell$Code), 
                      x = rep(1, nrow(Sell)), 
                      dimnames = list(levels(Sell$UserId), 
                                      levels(Sell$Code)))
#4 x 5 sparse Matrix of class "dgCMatrix"
#  2 3 12 111 333
#1 . .  1   1   1
#2 . .  1   1   .
#3 1 .  .   .   .
#4 . 1  .   .   .

You would need even less RAM with a logical sparse matrix:

test2 <- sparseMatrix(i = as.integer(Sell$UserId), 
                      j = as.integer(Sell$Code), 
                      x = rep(TRUE, nrow(Sell)), 
                      dimnames = list(levels(Sell$UserId), 
                                      levels(Sell$Code)))
#4 x 5 sparse Matrix of class "lgCMatrix"
#  2 3 12 111 333
#1 . .  |   |   |
#2 . .  |   |   .
#3 | .  .   .   .
#4 . |  .   .   .

answered Nov 19 '15 at 12:50

Roland

127,288
10
191
288

I want a matrix with 1 and 0 (1 if the client buy a product and 0 if not) and then use this matrix ( package recommenderlab.) in this code binary_matrix <- as(test, "binaryRatingMatrix") – Kardu Nov 19 '15 at 12:56
@Kardu Your question is still not sufficiently specific. We still don't know what exactly you are trying to do, but you probably need to get more low-level than just using the package (if it doesn't offer facilities for data the size of yours). A dense binary matrix will be huge if you have many users and many codes. Even if you can fit it in your RAM you won't be able to work with it. – Roland Nov 19 '15 at 13:03
However, after looking into the documentation, a `binaryRatingMatrix` seems to be a sparse matrix object. Your question actually seems to be how to create that from your data. – Roland Nov 19 '15 at 13:05
I have 200k users and more that 100k products.. its a huge matrix... this is the point.. I don't know How can I solve this problem?parallelize the code? – Kardu Nov 19 '15 at 13:08
No, simply study the documentation of the recommenderlab and arules packages. They use sparse matrices internally and should be able to deal with this. However, don't attempt to create a dense matrix-like object with tidyr since that will be too big for your RAM. – Roland Nov 19 '15 at 13:26
Ok Roland, I will try another way. thanks. – Kardu Nov 19 '15 at 14:25

score 0 · Accepted Answer · answered Nov 19 '15 at 12:40

I'm not sure this is a coding question...BUT...

The new Community Preview of SQL Server 2016 has R built in on the server, and you can get download the preview to try here: https://www.microsoft.com/en-us/evalcenter/evaluate-sql-server-2016

Doing this will bring your R code to your data and run on top of the SQL engine, allowing for the same sort of scalability you get built in with SQL.

Or you can stand up a VM in Azure, by going to the new portal, selecting "New" "Virtual Machine" and search for "SQL"

R - data munging and scalable code

2 Answers2