I have recently been using R to run a Generalised Linear Model(GLM) on a 100 mb csv file ( 9 million rows by 5 columns). The contents of this file includes 5 columns called depvar, var1,var2,var3,var4 and are all randomly distributed such that the columns contain numbers that are either 0,1 or 2. Basically I have used the biglm package to run the GLM on this data file and R processed this in approximately 2 minutes. This was on a linux machine using R version 2.10 (I am currently updating to 2.14), 4 cores and 8 GB of RAM. Basically I want to run the code faster at around 30 to 60 seconds. One solution is adding more cores and RAM but this would only be a temporary solution as I realise that datasets will only get bigger. Ideally I want to find a way to make the code faster for bigglm. I have run some R profiling code on the dataset. Adding the following code (before the code I want to run to check it's speed):
Rprof('method1.out')
Then after typing this command I write my bigglm code which looks something like this:
x<-read.csv('file location of 100mb file')
form<-depvar~var1+var2+var3+var4
a<-bigglm(form, data=x, chunksize=800000, sandwich=FALSE, family=binomial())
summary(a)
AIC(a)
deviance(a)
After running these codes which take around 2 to 3 minutes, I type the following to see my profiling code:
Rprofsummary('method1.out')
What I then get is a breakdown of the bigglm process and which individual lines are taking very long. After viewing this I was surprised to see that there was a call to fortran code that was taking a very long time (Around 20 seconds). This code can be found in the base file of Bigglm at:
http://cran.r-project.org/web/packages/biglm/index.html
In the bigglm 0.8.tar.gz file
Basically what I am asking the community is can this code be made more faster? For example by changing the code to recall the Fortran code to do the QR decomposition. Furthermore there were other functions like as.character and model.matrix which also took a long time. I have not attached the profiling file here as I believe it can easily be reproduced given the information I have supplied but basically I am hinting at the big problem of big data and processing GLM on this big data. This is a problem that is shared amongst the R community and I think any feedback or help would be grateful on this issue. You can probably easily replicate this example using a different dataset and look at what is taking so long in the bigglm code and see if they are the same things that I found. If so can someone please help me figure out how to make bigglm run faster. After Ben requested it I have uploaded the snippet of the profiling code I had as well as the first 10 lines of my csv file:
CSV File:
var1,var2,var3,var4,depvar
1,0,2,2,1
0,0,1,2,0
1,1,0,0,1
0,1,1,2,0
1,0,0,3,0
0,0,2,2,0
1,1,0,0,1
0,1,2,2,0
0,0,2,2,1
This CSV output was copied from my text editor UltraEdit and it can be seen that var1 takes on values 0 or 1, var2 takes on values 0 and 1, var3 takes on values 0,1,2, var4 takes on values 0,1,2,3 and depvar takes on values 1 or 0. This csv can be replicated in excel using the RAND function up to around 1 million rows then it can be copied and pasted several times over to get a large number of rows in a text editor like ultraedit. Basically type RAND() into one column for 1 million columns then do round(column) in the column beside the RAND() column to get 1s and zeros. Same sort of thinking applies to 0,1,2,3.
The profiling file is long so I have attached lines that took most time:
summaryRprof('method1.out')
$by.self
self.time self.pct total.time total.pct
"model.matrix.default" 25.40 20.5 26.76 21.6
".Call" 20.24 16.3 20.24 16.3
"as.character" 17.22 13.9 17.22 13.9
"[.data.frame" 14.80 11.9 22.12 17.8
"update.bigqr" 5.72 4.6 14.38 11.6
"-" 4.36 3.5 4.36 3.5
"anyDuplicated.default" 4.18 3.4 4.18 3.4
"|" 3.98 3.2 3.98 3.2
"*" 3.44 2.8 3.44 2.8
"/" 3.18 2.6 3.18 2.6
"unclass" 2.28 1.8 2.28 1.8
"sum" 2.26 1.8 2.26 1.8
"attr" 2.12 1.7 2.12 1.7
"na.omit" 2.02 1.6 20.00 16.1
"%*%" 1.74 1.4 1.74 1.4
"^" 1.56 1.3 1.56 1.3
"bigglm.function" 1.42 1.1 122.52 98.8
"+" 1.30 1.0 1.30 1.0
"is.na" 1.28 1.0 1.28 1.0
"model.frame.default" 1.20 1.0 22.68 18.3
">" 0.84 0.7 0.84 0.7
"strsplit" 0.62 0.5 0.62 0.5
$by.total
total.time total.pct self.time self.pct
"standardGeneric" 122.54 98.8 0.00 0.0
"bigglm.function" 122.52 98.8 1.42 1.1
"bigglm" 122.52 98.8 0.00 0.0
"bigglm.data.frame" 122.52 98.8 0.00 0.0
"model.matrix.default" 26.76 21.6 25.40 20.5
"model.matrix" 26.76 21.6 0.00 0.0
"model.frame.default" 22.68 18.3 1.20 1.0
"model.frame" 22.68 18.3 0.00 0.0
"[" 22.14 17.9 0.02 0.0
"[.data.frame" 22.12 17.8 14.80 11.9
".Call" 20.24 16.3 20.24 16.3
"na.omit" 20.00 16.1 2.02 1.6
"na.omit.data.frame" 17.98 14.5 0.02 0.0
"model.response" 17.44 14.1 0.10 0.1
"as.character" 17.22 13.9 17.22 13.9
"names<-" 17.22 13.9 0.00 0.0
"<Anonymous>" 15.10 12.2 0.00 0.0
"update.bigqr" 14.38 11.6 5.72 4.6
"update" 14.38 11.6 0.00 0.0
"data" 10.26 8.3 0.00 0.0
"-" 4.36 3.5 4.36 3.5
"anyDuplicated.default" 4.18 3.4 4.18 3.4
"anyDuplicated" 4.18 3.4 0.00 0.0
"|" 3.98 3.2 3.98 3.2
"*" 3.44 2.8 3.44 2.8
"/" 3.18 2.6 3.18 2.6
"lapply" 3.04 2.5 0.04 0.0
"sapply" 2.44 2.0 0.00 0.0
"as.list.data.frame" 2.30 1.9 0.02 0.0
"as.list" 2.30 1.9 0.00 0.0
"unclass" 2.28 1.8 2.28 1.8
"sum" 2.26 1.8 2.26 1.8
"attr" 2.12 1.7 2.12 1.7
"etafun" 1.88 1.5 0.14 0.1
"%*%" 1.74 1.4 1.74 1.4
"^" 1.56 1.3 1.56 1.3
"summaryRprof" 1.48 1.2 0.02 0.0
"+" 1.30 1.0 1.30 1.0
"is.na" 1.28 1.0 1.28 1.0
">" 0.84 0.7 0.84 0.7
"FUN" 0.70 0.6 0.36 0.3
"strsplit" 0.62 0.5 0.62 0.5
I was mainly surprised by the .Call function that is calling to Fortran. Maybe I didn't understand it. It seems all calculations are done once this function is used. I thought this was like a linking function that went to extract the Fortran code. Furthermore if Fortran is doing all the work and all the iteratively weighted least squares/QR why is the rest of the code taking so long then.