0

I have a 72 million observation data frame. It has two columns, my_id and my_rand variables. The data frame has about 6 million unique my_id. I need to calculate average my_rank value by my_id (group by my_id). I tried to run the above regular R command, however it seems freeze the R (maybe data too big to fit memory).

avg_rank_by_id<-aggregate(dataframe1["my_rank"],by=dataframe1["my_id"], mean, na.rm=TRUE)

Is there a way to run Revo Scale R such as rxCube etc. to achieve the goal? I am running on Linux. It tried below, but got error.

I am new to R. Besides Revo Scale R, is there another high performance computing open source R package available? Thanks.

acct_avg_rank <- rxCube( N(m13_rank)~acct_id, data=payee_merge, means=TRUE, returnDataFrame=TRUE)

All independent variables must be factors for rxCube and rxCrossTabs: "acct_id". Use F(x) to declare that a continuous variable x is to be treated as a factor.

Error in rxCall("RxCrossTabs", params) :
Calls: rxCube -> rxCubeBase -> rxCall -> .Call

989
  • 12,579
  • 5
  • 31
  • 53
Eric_IL
  • 171
  • 2
  • 10
  • There is a dedicated CRAN Task View: http://cran.r-project.org/web/views/HighPerformanceComputing.html –  Jul 06 '15 at 22:21

1 Answers1

0

It looks like your acct_id was treated as numeric when you imported and it needs to be a factor.

You have three options:
1. Import again and use the colClasses = c(acct_id = "factor").
2. Use rxFactors to change acct_id to a factor.
3. Change to a factor in the formula.

acct_avg_rank <- rxCube(N(m13_rank) ~ F(acct_id), data=payee_merge,
                        means=TRUE, returnDataFrame=TRUE)