4

I have recently been using R to run a Generalised Linear Model(GLM) on a 100 mb csv file ( 9 million rows by 5 columns). The contents of this file includes 5 columns called depvar, var1,var2,var3,var4 and are all randomly distributed such that the columns contain numbers that are either 0,1 or 2. Basically I have used the biglm package to run the GLM on this data file and R processed this in approximately 2 minutes. This was on a linux machine using R version 2.10 (I am currently updating to 2.14), 4 cores and 8 GB of RAM. Basically I want to run the code faster at around 30 to 60 seconds. One solution is adding more cores and RAM but this would only be a temporary solution as I realise that datasets will only get bigger. Ideally I want to find a way to make the code faster for bigglm. I have run some R profiling code on the dataset. Adding the following code (before the code I want to run to check it's speed):

Rprof('method1.out')

Then after typing this command I write my bigglm code which looks something like this:

x<-read.csv('file location of 100mb file')
form<-depvar~var1+var2+var3+var4
a<-bigglm(form, data=x, chunksize=800000, sandwich=FALSE, family=binomial())
summary(a)
AIC(a)
deviance(a)

After running these codes which take around 2 to 3 minutes, I type the following to see my profiling code:

Rprofsummary('method1.out')

What I then get is a breakdown of the bigglm process and which individual lines are taking very long. After viewing this I was surprised to see that there was a call to fortran code that was taking a very long time (Around 20 seconds). This code can be found in the base file of Bigglm at:

http://cran.r-project.org/web/packages/biglm/index.html

In the bigglm 0.8.tar.gz file

Basically what I am asking the community is can this code be made more faster? For example by changing the code to recall the Fortran code to do the QR decomposition. Furthermore there were other functions like as.character and model.matrix which also took a long time. I have not attached the profiling file here as I believe it can easily be reproduced given the information I have supplied but basically I am hinting at the big problem of big data and processing GLM on this big data. This is a problem that is shared amongst the R community and I think any feedback or help would be grateful on this issue. You can probably easily replicate this example using a different dataset and look at what is taking so long in the bigglm code and see if they are the same things that I found. If so can someone please help me figure out how to make bigglm run faster. After Ben requested it I have uploaded the snippet of the profiling code I had as well as the first 10 lines of my csv file:

 CSV File:

 var1,var2,var3,var4,depvar
   1,0,2,2,1
   0,0,1,2,0
   1,1,0,0,1
   0,1,1,2,0
   1,0,0,3,0
   0,0,2,2,0
   1,1,0,0,1
   0,1,2,2,0
   0,0,2,2,1

This CSV output was copied from my text editor UltraEdit and it can be seen that var1 takes on values 0 or 1, var2 takes on values 0 and 1, var3 takes on values 0,1,2, var4 takes on values 0,1,2,3 and depvar takes on values 1 or 0. This csv can be replicated in excel using the RAND function up to around 1 million rows then it can be copied and pasted several times over to get a large number of rows in a text editor like ultraedit. Basically type RAND() into one column for 1 million columns then do round(column) in the column beside the RAND() column to get 1s and zeros. Same sort of thinking applies to 0,1,2,3.

The profiling file is long so I have attached lines that took most time:

 summaryRprof('method1.out')



$by.self
                        self.time self.pct total.time total.pct
"model.matrix.default"      25.40     20.5      26.76      21.6
".Call"                     20.24     16.3      20.24      16.3
"as.character"              17.22     13.9      17.22      13.9
"[.data.frame"              14.80     11.9      22.12      17.8
"update.bigqr"               5.72      4.6      14.38      11.6
"-"                          4.36      3.5       4.36       3.5
"anyDuplicated.default"      4.18      3.4       4.18       3.4
"|"                          3.98      3.2       3.98       3.2
"*"                          3.44      2.8       3.44       2.8
"/"                          3.18      2.6       3.18       2.6
"unclass"                    2.28      1.8       2.28       1.8
"sum"                        2.26      1.8       2.26       1.8
"attr"                       2.12      1.7       2.12       1.7
"na.omit"                    2.02      1.6      20.00      16.1
"%*%"                        1.74      1.4       1.74       1.4
"^"                          1.56      1.3       1.56       1.3
"bigglm.function"            1.42      1.1     122.52      98.8
"+"                          1.30      1.0       1.30       1.0
"is.na"                      1.28      1.0       1.28       1.0
"model.frame.default"        1.20      1.0      22.68      18.3
">"                          0.84      0.7       0.84       0.7
"strsplit"                   0.62      0.5       0.62       0.5

$by.total
                        total.time total.pct self.time self.pct
"standardGeneric"           122.54      98.8      0.00      0.0
"bigglm.function"           122.52      98.8      1.42      1.1
"bigglm"                    122.52      98.8      0.00      0.0
"bigglm.data.frame"         122.52      98.8      0.00      0.0
"model.matrix.default"       26.76      21.6     25.40     20.5
"model.matrix"               26.76      21.6      0.00      0.0
"model.frame.default"        22.68      18.3      1.20      1.0
"model.frame"                22.68      18.3      0.00      0.0
"["                          22.14      17.9      0.02      0.0
"[.data.frame"               22.12      17.8     14.80     11.9
".Call"                      20.24      16.3     20.24     16.3
"na.omit"                    20.00      16.1      2.02      1.6
"na.omit.data.frame"         17.98      14.5      0.02      0.0
"model.response"             17.44      14.1      0.10      0.1
"as.character"               17.22      13.9     17.22     13.9
"names<-"                    17.22      13.9      0.00      0.0
"<Anonymous>"                15.10      12.2      0.00      0.0
"update.bigqr"               14.38      11.6      5.72      4.6
"update"                     14.38      11.6      0.00      0.0
"data"                       10.26       8.3      0.00      0.0
"-"                           4.36       3.5      4.36      3.5
"anyDuplicated.default"       4.18       3.4      4.18      3.4
"anyDuplicated"               4.18       3.4      0.00      0.0
"|"                           3.98       3.2      3.98      3.2
"*"                           3.44       2.8      3.44      2.8
"/"                           3.18       2.6      3.18      2.6
"lapply"                      3.04       2.5      0.04      0.0
"sapply"                      2.44       2.0      0.00      0.0
"as.list.data.frame"          2.30       1.9      0.02      0.0
"as.list"                     2.30       1.9      0.00      0.0
"unclass"                     2.28       1.8      2.28      1.8
"sum"                         2.26       1.8      2.26      1.8
"attr"                        2.12       1.7      2.12      1.7
"etafun"                      1.88       1.5      0.14      0.1
"%*%"                         1.74       1.4      1.74      1.4
"^"                           1.56       1.3      1.56      1.3
"summaryRprof"                1.48       1.2      0.02      0.0
"+"                           1.30       1.0      1.30      1.0
"is.na"                       1.28       1.0      1.28      1.0
">"                           0.84       0.7      0.84      0.7
"FUN"                         0.70       0.6      0.36      0.3
"strsplit"                    0.62       0.5      0.62      0.5

I was mainly surprised by the .Call function that is calling to Fortran. Maybe I didn't understand it. It seems all calculations are done once this function is used. I thought this was like a linking function that went to extract the Fortran code. Furthermore if Fortran is doing all the work and all the iteratively weighted least squares/QR why is the rest of the code taking so long then.

Naz
  • 91
  • 1
  • 5
  • Could you post your profiling results somewhere? What is the name of the Fortran function that takes the time? The easier you can make it for someone to replicate/diagnose, the more likely they actually will. Why are you surprised that the bulk of the effort is spent in the underlying Fortran code, which is doing most of the real work? – Ben Bolker Jan 03 '12 at 13:58
  • Is your code taking much longer than you expected? In regard to your question, you could adapt the code yourself. – Paul Hiemstra Jan 03 '12 at 23:10
  • Hi Paul, I don't really have a benchmark to compare against. But I think if I was a user and wanted to work on a 100mb file I would probably want it done within 1 minute with the computing power available today. 2 minutes is already fast I realise but can it be made faster? I been looking at some pseudocode for this currently and I might post this up soon as well. – Naz Jan 03 '12 at 23:18
  • Here are some thoughts: (1) "it's too slow" is a bit like "my piece of string is too short"; without a benchmark it's very hard to have any idea whether 2 minutes is reasonable or not -- that may be why you may not be getting many responses. (2) Related to this; speeding the code up by a factor of 2 to 4 might be pretty easy, if there is some obvious and dumb bottleneck in the code, or might be almost impossible. Getting more memory and cores might indeed be the best solution. In my next comment I will give some specific thoughts (none very well worked out) about directions you *might* go. – Ben Bolker Jan 04 '12 at 00:25
  • Here is a grab bag of ideas. (1) You *might* find analogues of the QR code used somewhere in an optimized linear algebra package (BLAS, ATLAS) ... (2) byte compilation might help, much of that should come for free with R 2.14 (3) I'm not sure where `as.character` is even getting used -- when I `debug(as.character)` and run a small example it doesn't seem to get hit (4) have you tried adjusting the chunk size up or down? (5) there's some fast regression code somewhere in an `Rcpp*` package [might be a lot of work to incorporate it here] ... ? – Ben Bolker Jan 04 '12 at 00:35

1 Answers1

4

My lay understanding is that biglm is breaking data into chunks and running them sequentially.

  • So you could speed things up by optimising the size of the chunks to be just as big as your memory allows.
  • This is also just using one of your cores. This is not a multi-threaded code. You'd need to do some magic to get that working.
Louis
  • 71
  • 5