1

I want to train a random forest model with a repeatedcv procedure using caret::train. My data has some missing values, so I want to use the preProcess="bagImpute" option within the train function. I do not want to use the preProcess function outside of train, because I want to bagImpute my data for each iteration of the repeatedcv procedure. However, when I attempt to do this, an error is thrown:

Error in { : task 1 failed - "'n' must be a positive integer >= 'x'"
In addition: There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In eval(expr, envir, enclos) :
  model fit failed for Fold01.Rep01: mtry=2 Error in na.fail.default(structure(list(Sepal.Length = c(5.1, 4.9, 4.7,  : 
  missing values in object

Below is a minimal reproducible example using the iris data. I borrowed the initial code for the dataset prep from Minkoo at his website: http://mkseo.pe.kr/stats/?p=719. Many thanks Minkoo!

library(caret)

data(iris)
inTrain <- createDataPartition(iris$Species, p=0.8, list=FALSE)
training <- iris[inTrain, ]


fillInNa <- function(d) {
      naCount <- NROW(d) * 0.1
      for (i in sample(NROW(d), naCount)) {
            d[i, sample(4, 1)] <- NA
       }
      return(d)
 }

 training <- fillInNa(training)

tc<-trainControl("repeatedcv", repeats=30, selectionFunction="oneSE",returnData=T, 
classProbs = T,num=10, preProcOptions ="bagImpute", 
summaryFunction=multiClassSummary, savePredictions = T)

training.x<-training[,1:4]
training.y<-training[,5]

rfTri_Bag<- train(training.x,training.y, 
              method="rf", 
              trControl=tc, 
              preProcess= c("bagImpute"),
              tuneLength=10,
              control=rpart.control(usesurrogate=0),
              ntree=250,
              proximity=T)

Edit: Here is my session info:

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_UnitedStates.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ipred_0.9-5         e1071_1.6-7         latticeExtra_0.6-28 RColorBrewer_1.1-2  randomForest_4.6-12 caret_6.0-71       
 [7] rpart_4.1-10        party_1.0-25        strucchange_1.5-1   sandwich_2.3-4      zoo_1.7-13          modeltools_0.2-21  
[13] mvtnorm_1.0-5       gdata_2.17.0        DMwR_0.4.1          pROC_1.8            Metrics_0.1.1       raster_2.5-8       
[19] sp_1.2-3            gridExtra_2.2.1     readr_1.0.0         tidyr_0.6.0         tibble_1.2          tidyverse_1.0.0    
[25] MuMIn_1.15.6        merTools_0.2.2      devtools_1.12.0     plyr_1.8.4          arm_1.9-1           lattice_0.20-33    
[31] MASS_7.3-45         xtable_1.8-2        lmerTest_2.0-32     lme4_1.1-12         Matrix_1.2-6        xlsx_0.5.7         
[37] xlsxjars_0.6.1      rJava_0.9-8         AICcmodavg_2.0-4    pander_0.6.0        ggplot2_2.1.0       purrr_0.2.2        
[43] dplyr_0.5.0         broom_0.4.1        

loaded via a namespace (and not attached):
 [1] TH.data_1.0-7      VGAM_1.0-2         minqa_1.2.4        colorspace_1.2-6   class_7.3-14       MatrixModels_0.4-1
 [7] DT_0.2             prodlim_1.5.7      coin_1.1-2         codetools_0.2-14   splines_3.3.1      mnormt_1.5-4      
[13] knitr_1.14         Formula_1.2-1      nloptr_1.0.4       pbkrtest_0.4-6     cluster_2.0.4      shiny_0.14        
[19] compiler_3.3.1     httr_1.2.1         assertthat_0.1     lazyeval_0.2.0     acepack_1.3-3.3    htmltools_0.3.5   
[25] quantreg_5.29      tools_3.3.1        coda_0.18-1        gtable_0.2.0       reshape2_1.4.1     Rcpp_0.12.7       
[31] nlme_3.1-128       iterators_1.0.8    psych_1.6.6        stringr_1.1.0      mime_0.5           gtools_3.5.0      
[37] scales_0.4.0       parallel_3.3.1     SparseM_1.7        yaml_2.1.13        quantmod_0.4-6     curl_1.2          
[43] memoise_1.0.0      reshape_0.8.5      stringi_1.1.1      foreach_1.4.3      blme_1.0-4         TTR_0.23-1        
[49] caTools_1.17.1     boot_1.3-18        lava_1.4.4         chron_2.3-47       bitops_1.0-6       evaluate_0.9      
[55] ROCR_1.0-7         htmlwidgets_0.7    labeling_0.3       magrittr_1.5       R6_2.1.3           gplots_3.0.1      
[61] Hmisc_3.17-4       multcomp_1.4-6     DBI_0.5            foreign_0.8-66     withr_1.0.2        mgcv_1.8-12       
[67] xts_0.9-7          survival_2.39-4    abind_1.4-5        nnet_7.3-12        car_2.1-3          KernSmooth_2.23-15
[73] rmarkdown_1.0      data.table_1.9.6   git2r_0.15.0       digest_0.6.10      httpuv_1.3.3       munsell_0.4.3     
[79] unmarked_0.11-0   

Edit 2: An almost identical question has been asked here https://stackoverflow.com/a/20081954/5617640 , but the answer given simply shows how to predict from a preProcess() object outside of the train() function. As @Misconstruction points out in a comment, with this method the imputation is "not included inside the CV loop." - My thoughts exactly.

Community
  • 1
  • 1
jlab
  • 252
  • 2
  • 18

1 Answers1

0

This is not the solution to the error message, but will hopefully resolve your question.

If you are running a random forest model it inherently "cross-validates" itself in a sense with the out-of-bag (OOB) error estimate. There is no need for any kind of cross validation when using random forests as seen in this Berkeley article:

"In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run..." (https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm)

ttkwan
  • 1