0

I struggle a bit with missing values in a Date column. In my pre-processing pipeline (recipe-object) I used the step_impute_knn function to fill missing values in all my Date columns. Unfortunately I got the following error:

Assigned data pred_vals must be compatible with existing data.? Error occurred for column avg_begin_first_contract .x Can't convert double to date

Here is a reprex for a version where I impute values in multiple columns, including a Date column. It did not matter for me, if I imputed values only to the Date column. The result was the same. Below there is a reprex, which does not through an error, because no Datecolumn is used.

Has someone had this issue before?

library(tidyverse)
library(tidymodels)

iris <- iris %>%
  mutate(Plucked = sample(seq(as.Date("1999/01/01"), as.Date("2000/01/01"),
    by = "day"
  ), size = 150))

iris[45, 2] <- as.numeric(NA)
iris[37, 3] <- as.numeric(NA)
iris[78, 4] <- as.numeric(NA)
iris[9, 5] <- as.numeric(NA)
iris[15, 6] <- as.factor(NA)

set.seed(456)

iris_split <- iris %>%
  initial_split(strata = Sepal.Length)


iris_training <- training(iris_split)
iris_testing <- testing(iris_split)

iris_rf_model <- rand_forest(
  mtry = 10,
  min_n = 10,
  trees = 500
) %>%
  set_engine("ranger") %>%
  set_mode("regression")


base_rec <- recipe(Sepal.Length ~ .,
  data = iris_training
) %>%
  step_impute_knn(Sepal.Width, Petal.Length, Petal.Width, Species, Plucked) %>%
  step_date(Plucked) %>%
  step_dummy(Species)

iris_workflow <- workflow() %>%
  add_model(iris_rf_model) %>%
  add_recipe(base_rec)

iris_rf_wkfl_fit <- iris_workflow %>%
  last_fit(iris_split)
#> x train/test split: preprocessor 1/1: Error: Assigned data `pred_vals` must be compatible wi...
#> Warning: All models failed. See the `.notes` column.
Created on 2021-06-15 by the reprex package (v2.0.0)

Here is the reprex, which does not through an error:

library(tidyverse)
library(tidymodels)

iris[45, 2] <- as.numeric(NA)
iris[37 ,3] <- as.numeric(NA)
iris[78, 4] <- as.numeric(NA)
iris[9, 5] <- as.numeric(NA)

set.seed(123)

iris_split <- iris %>% 
  initial_split(strata = Sepal.Length)

iris_training <- training(iris_split)
iris_testing <- testing(iris_split)

iris_rf_model <- rand_forest(
  mtry = 5,
  min_n = 5,
  trees = 500) %>%
  set_engine("ranger") %>%
  set_mode("regression")


base_rec <- recipe(Sepal.Length ~ .,
                   data = iris_training) %>% 
  step_impute_knn(Sepal.Width, Petal.Length, Petal.Width, Species) %>%
  step_dummy(Species)

iris_workflow <- workflow() %>% 
  add_model(iris_rf_model) %>% 
  add_recipe(base_rec)

iris_rf_wkfl_fit <- iris_workflow %>%
  last_fit(split = iris_split)
Created on 2021-06-15 by the reprex package (v2.0.0)

Thanks in advance! M.

Mischa
  • 137
  • 8
  • Hard to say without a minimal reproducible example but _maybe_ you just need an `as.Date()` call wrapped around your imputation. For example, `as,Date(1,23)` gives us Jan 2, 1970, as expected (given that the epoch of zero is Jan 1, 1970). If that is the right thing to do for your model only you can tell but it should give you the right _type_. – Dirk Eddelbuettel Jun 14 '21 at 12:53
  • Thanks for your answer, really appreciate. I updated the question and provided a `reprex` not on the original data, but I could reproduce the error. – Mischa Jun 15 '21 at 12:35

1 Answers1

0

I guess I found an answer and want to share it with you. The key was to turn the Date into a numeric value. Then the imputation was easy. Here is a reprex.

library(tidyverse)
library(tidymodels)

iris <- iris %>%
  mutate(Plucked = sample(seq(as.Date("1999/01/01"), as.Date("2000/01/01"),
    by = "day"
  ), size = 150))

iris[45, 2] <- as.numeric(NA)
iris[37, 3] <- as.numeric(NA)
iris[78, 4] <- as.numeric(NA)
iris[9, 5] <- as.numeric(NA)
iris[15, 6] <- as.factor(NA)

set.seed(456)

iris_split <- iris %>%
  initial_split(strata = Sepal.Length)


iris_training <- training(iris_split)
iris_testing <- testing(iris_split)

iris_rf_model <- rand_forest(
  mtry = 10,
  min_n = 10,
  trees = 500
) %>%
  set_engine("ranger") %>%
  set_mode("regression")


base_rec <- recipe(Sepal.Length ~ .,
  data = iris_training
) %>% 
  step_mutate_at(
    where(lubridate::is.Date),
    fn = ~ as.numeric(lubridate::ymd(.x))
  ) %>%
  step_impute_bag(c("Plucked")) %>% 
  step_impute_knn(Sepal.Width, Petal.Length, Petal.Width, Species) %>%
  step_dummy(Species)

iris_workflow <- workflow() %>%
  add_model(iris_rf_model) %>%
  add_recipe(base_rec)

iris_rf_wkfl_fit <- iris_workflow %>%
  last_fit(iris_split)
#> ! train/test split: preprocessor 1/1, model 1/1: 10 columns were requested but there were 6 ...
Created on 2021-06-29 by the reprex package (v2.0.0)

If you want to revert from numerics back to Dates before the fitting, you can do so by adding the following line to your code:

step_mutate_at(c("Plucked"), fn = ~ as.Date(.x, origin = "1970-01-01 UTC"))

Thanks again, M.

Mischa
  • 137
  • 8