12

I am trying to impute values by passing "knnImpute" to the preProcess argument of Caret's train() method. Based on the following example, it appears that the values are not imputed, remain as NA and are then ignored. What am I doing wrong?

Any help is much appreciated.

library("caret")

set.seed(1234)
data(iris)

# mark 8 of the cells as NA, so they can be imputed
row <- sample (1:nrow (iris), 8)
iris [row, 1] <- NA

# split test vs training
train.index <- createDataPartition (y = iris[,5], p = 0.80, list = F)
train <- iris [ train.index, ]
test  <- iris [-train.index, ]

# train the model after imputing the missing data
fit <- train (Species ~ ., 
              train, 
              preProcess = c("knnImpute"), 
              na.action  = na.pass, 
              method     = "rpart" )
test$species.hat <- predict (fit, test)

# there is 1 obs. (of 30) in the test set equal to NA  
# this 1 obs. was not returned from predict
Error in `$<-.data.frame`(`*tmp*`, "species.hat", value = c(1L, 1L, 1L,  : 
  replacement has 29 rows, data has 30

UPDATE: I have been able to use the preProcess function directly to impute the values. I still don't understand why this does not seem to occur within the train function.

# attempt to impute using nearest neighbors
x <- iris [, 1:4]
pp <- preProcess (x, method = c("knnImpute"))
x.imputed <- predict (pp, newdata = x)

# expect all NAs were populated with an imputed value
stopifnot( all (!is.na (x.imputed)))
stopifnot( length (x) == length (x.imputed))
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
Nick Allen
  • 1,443
  • 1
  • 11
  • 29

1 Answers1

4

See ?predict.train:

 ## S3 method for class 'train'
 predict(object, newdata = NULL, type = "raw", na.action = na.omit, ...)

There is an na.omit here too:

 > length(predict (fit, test))
 [1] 29
 > length(predict (fit, test, na.action = na.pass))
 [1] 30

Max

topepo
  • 13,534
  • 3
  • 39
  • 52
  • 2
    This shows how to handle NA's directly using the predict function - is there any way to specify the handling of missing values inside the train() function? Otherwise it's not included inside the CV loop. – Misconstruction Dec 08 '15 at 12:05
  • 1
    @Misconstruction Remember to include `na.action = na.pass` in both `train` and `predict`. – adatum Jan 23 '17 at 20:38