-1

I have built a logistic regression model using the glm function from the stats package. I now would like to predict the outcome of this model on a large number of values, stored in a "ffdf" object (see ff package), however I do not find how to proceed:

  1. How can I create a subset of my ffdf object, in order to keep only the variables (i.e. columns) to be used in my prediction? - needed to specify as an input in the predict function

  2. How should I proceed next? Which function should be used between predict(), predict.glm(), predict.bigglm() (Maybe biglm package is helpful)?

Thank you in advance for your views on this!

Best regards

UPDATE

Thank you for your feedback BondedDust.
Let me be more precise, it is indeed a coding question, aiming at performing logistic regression based on an ffdf object (learning dataset), and predict the outcome of the model for another ffdf object (test dataset).

(1/3) Learning data set: ffdf object (created with ff package).

` class(train.random.sample)` >   
[1] "ffdf"

below is the structure of the ffdf object in case of needs:

`str(train.random.sample) ` >

List of 3   
 $ virtual: 'data.frame':   27 obs. of  7 variables:   
 .. $ VirtualVmode     : chr  "integer" "integer" "integer" "integer" ...   
 .. $ AsIs             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
 .. $ VirtualIsMatrix  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
 .. $ PhysicalIsMatrix : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
 .. $ PhysicalElementNo: int  1 2 3 4 5 6 7 8 9 10 ...   
 .. $ PhysicalFirstCol : int  1 1 1 1 1 1 1 1 1 1 ...   
 .. $ PhysicalLastCol  : int  1 1 1 1 1 1 1 1 1 1 ...   
 .. - attr(*, "Dim")= int  500000 27   
 .. - attr(*, "Dimorder")= int  1 2   
 $ physical: List of 27   
 .. $ id                : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ click             : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ hour              : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ C1                : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ banner_pos        : list()   
 ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr>    
 ..  .. ..- attr(*, "vmode")= chr "integer"   
 ..  .. ..- attr(*, "maxlength")= int 500000   
 ..  .. ..- attr(*, "pattern")= chr "ffdf"   
 ..  .. ..- attr(*, "filename")= chr "anonymized.ff"   
 ..  .. ..- attr(*, "pagesize")= int 65536   
 ..  .. ..- attr(*, "finalizer")= chr "delete"   
 ..  .. ..- attr(*, "finonexit")= logi TRUE   
 ..  .. ..- attr(*, "readonly")= logi FALSE   
 ..  .. ..- attr(*, "caching")= chr "mmnoflush"   
 ..  ..- attr(*, "virtual")= list()   
 ..  .. ..- attr(*, "Length")= int 500000   
 ..  .. ..- attr(*, "Symmetric")= logi FALSE    
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ site_id           : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ site_domain       : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ site_category     : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ app_id            : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ app_domain        : list()   
…  
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ app_category      : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_id         : list()   
 …   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_ip         : list()   
….   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_os         : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_make       : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_model      : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_type       : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_conn_type  : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_geo_country: list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ C17               : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
$ row.names:  NULL   
- attributes: List of 2   
 .. $ names: chr [1:3] "virtual" "physical" "row.names"   
 .. $ class: chr "ffdf"   

(2/3) Logistic regression based on learning dataset:

Objective is to learn/ predict ‘click’ outcome based on ‘baser_pos’ input

`logreg1 <- glm(click ~ banner_pos, data = train.random.sample, family = "binomial")   
summary(logreg1)` >   


Call:
glm(formula = click ~ banner_pos, family = "binomial", data = train.random.sample)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0555  -0.6495  -0.5951  -0.5951   1.9071  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.641416   0.004702 -349.12   <2e-16 xxx
banner_pos   0.192534   0.007595   25.35   <2e-16 xxx
---
Signif. codes:  0 ‘xxx’ 0.001 ‘xx’ 0.01 ‘x’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 458848  on 499999  degrees of freedom
Residual deviance: 458215  on 499998  degrees of freedom
AIC: 458219

Number of Fisher Scoring iterations: 4

`class(logreg1)`>
[1] "glm" "lm" 

(3/3)Test dataset: ffdf object (created with ff package).

`class(df.test)` >   
[1] "ffdf"

Test dataset structure is identical to training dataset, with ~4.8m rows

`str(df.test)`>   

List of 3   
 $ virtual: 'data.frame':   26 obs. of  7 variables:   
 .. $ VirtualVmode     : chr  "integer" "integer" "integer" "integer" ...   
.. $ AsIs             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
.. $ VirtualIsMatrix  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
.. $ PhysicalIsMatrix : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
.. $ PhysicalElementNo: int  1 2 3 4 5 6 7 8 9 10 ...   
.. $ PhysicalFirstCol : int  1 1 1 1 1 1 1 1 1 1 ...   
.. $ PhysicalLastCol  : int  1 1 1 1 1 1 1 1 1 1 ...   
.. - attr(*, "Dim")= int  4769401 26   
.. - attr(*, "Dimorder")= int  1 2   
$ physical: List of 26   
…   

I could not succeed in predicting click outcome. I first tried to create a dataframe or ffdf object containing the banner_pos variable:

`modeldata <- df.test[["banner_pos"]]`

Then I tried to predict the outcome:

`predict.glm(object = logreg1, newdata = modeldata, type = "response")`

Error in as.data.frame.default(data) : 
  cannot coerce class "c("ff_vector", "ff")" to a data.frame

Is there something wrong in my code? Should I use other function leveraging other packages such as biglm?
Many thanks in advance for your views on that issue,
Best regards

cho7tom
  • 1,030
  • 2
  • 13
  • 30
  • You should not ask us to guess at the methods you used. Provide code and output of `str(ff_object)`. You would not expect to use "predict" functions until you had created model objects. That second request is suggesting to me that you have not really done this sort of R operation in the past on regular dataframe data. Voting to close as too broad. If you edit to make this more of a coding questions, perhaps others will disagree with me and you can get an answer. – IRTFM Nov 10 '14 at 23:54

1 Answers1

0

Something similar as this will score your ffdf alongside your glm.

require(ff)
df.test$score <- ff(as.numeric(NA), length = nrow(df.test))
chunks <- chunk(df.test)
for(chunkrangeindex in chunks){
  df.test$score[chunkrangeindex] <- predict(object = logreg1, newdata = df.test[chunkrangeindex, ], type = "response")
}
  • Thank you jwijffels. I have however a new error with your suggestion while running the for() loop: opening ff C:/Users/XXXXX/AppData/Local/Temp/RtmpCE64Sq/ffdf218891f61b7.ff Error: file.access(filename, 0) == 0 is not TRUE. I can't understand this error.. Do you have any idea? Many thanks in advance for your help – cho7tom Nov 12 '14 at 22:08
  • The error states that the ff file is no longer there on the disk where it was before. You probably closed R and forgot to save your df.test ffdf with either ffsave or ffdfsave. –  Nov 12 '14 at 22:40
  • Thank you jwijffels. I had indeed an issue while saving my ffdf objects. Using save.ffdf instead of ffsave helped! (your document [here](http://www.bnosac.be/images/blog/user2013_presentation_ffbase.pdf) is very interesting by the way). I now have a new error once launching the ‘for’ loop, which is: **Error in `[[<-.ff`(`*tmp*`, chunkrangeindex, value = c(0.162272524618093, : i must have length 1** Would you have an idea? Many thanks in advance for your help – cho7tom Nov 14 '14 at 13:15
  • yes, I mistyped something. It should be df.test$score[chunkrangeindex] instead of df.test$score[[chunkrangeindex]]. Updated the answer accordingly. –  Nov 16 '14 at 14:47