Revo Scale R rxCube and other open source parallel package?

Question

I have a 72 million observation data frame. It has two columns, my_id and my_rand variables. The data frame has about 6 million unique my_id. I need to calculate average my_rank value by my_id (group by my_id). I tried to run the above regular R command, however it seems freeze the R (maybe data too big to fit memory).

avg_rank_by_id<-aggregate(dataframe1["my_rank"],by=dataframe1["my_id"], mean, na.rm=TRUE)

Is there a way to run Revo Scale R such as rxCube etc. to achieve the goal? I am running on Linux. It tried below, but got error.

I am new to R. Besides Revo Scale R, is there another high performance computing open source R package available? Thanks.

acct_avg_rank <- rxCube( N(m13_rank)~acct_id, data=payee_merge, means=TRUE, returnDataFrame=TRUE)

All independent variables must be factors for rxCube and rxCrossTabs: "acct_id". Use F(x) to declare that a continuous variable x is to be treated as a factor.

Error in rxCall("RxCrossTabs", params) :
Calls: rxCube -> rxCubeBase -> rxCall -> .Call

There is a dedicated CRAN Task View: http://cran.r-project.org/web/views/HighPerformanceComputing.html — , Jul 06 '15 at 22:21

score 0 · Answer 1 · answered Jul 08 '15 at 16:12

It looks like your acct_id was treated as numeric when you imported and it needs to be a factor.

You have three options:
1. Import again and use the colClasses = c(acct_id = "factor").
2. Use rxFactors to change acct_id to a factor.
3. Change to a factor in the formula.

acct_avg_rank <- rxCube(N(m13_rank) ~ F(acct_id), data=payee_merge,
                        means=TRUE, returnDataFrame=TRUE)

Revo Scale R rxCube and other open source parallel package?

1 Answers1