3

I have the following when error when trying to use the preProcess function from the caret package. The predict function works for knn and median imputation, but gives an error for bagging. How should I edit my call to the predict function.

Reproducible example:

data = iris
set.seed(1)
data = as.data.frame(lapply(data, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(cc), replace = TRUE) ]))

preprocess_values = preProcess(data, method = c("bagImpute"), verbose = TRUE)
data_new = predict(preprocess_values, data)

This gives the following error:

> data_new = predict(preprocess_values, data)
Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "NULL"
Aveshen Pillay
  • 431
  • 3
  • 13

1 Answers1

1

The preprocessing/imputation functions in caret work only for numerical variables. As stated in the help of preProcess

x a matrix or data frame. Non-numeric predictors are allowed but will be ignored.

You most likely found a bug in the part that should ignore the non numerical variables which throws an uninformative error instead of ignoring them.

If you remove the factor variable the imputation works

library(caret)

df <- iris
set.seed(1)
df <- as.data.frame(lapply(data, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(cc), replace = TRUE) ]))
df <- df[,-5] #remove factor variable
           
preprocess_values <- preProcess(df, method = c("bagImpute"), verbose = TRUE)

data_new <- predict(preprocess_values, df)

The last line of code works but results in a bunch of warnings:

In cprob[tindx] + pred :
  longer object length is not a multiple of shorter object length

These warnings are not from caret but from the internal call to ipred::bagging which is called internally by caret::preProcess. The cause for these errors are instances in the data where there are 3 NA values in a row, when they are removed

df <- df[rowSums(sapply(df, is.na)) != 3,]

preprocess_values <- preProcess(df, method = c("bagImpute"), verbose = TRUE)

data_new <- predict(preprocess_values, df)

the warnings disappear.

You should check out recipes, and specifically step_bagimpute, to overcome the above mentioned limitations.

missuse
  • 19,056
  • 3
  • 25
  • 47