0

this question is a duplicate of Tidymodels: What is the correct way to impute missing values in a Date column? As the question was closed I provided a reprex and raise the question again.

I struggle a bit with missing values in a Date column. In my pre-processing pipeline (recipe-object) I used the step_impute_knn function to fill missing values in all my Date columns. Unfortunately I got the following error:

Assigned data pred_vals must be compatible with existing data.? Error occurred for column avg_begin_first_contract .x Can't convert double to date

Here is a reprex for a version where I impute values in multiple columns, including a Date column. It did not matter for me, if I imputed values only to the Date column. The result was the same. Below there is a reprex, which does not through an error, because no Datecolumn is used.

Has someone had this issue before?

library(tidyverse)
library(tidymodels)

iris <- iris %>%
  mutate(Plucked = sample(seq(as.Date("1999/01/01"), as.Date("2000/01/01"),
    by = "day"
  ), size = 150))

iris[45, 2] <- as.numeric(NA)
iris[37, 3] <- as.numeric(NA)
iris[78, 4] <- as.numeric(NA)
iris[9, 5] <- as.numeric(NA)
iris[15, 6] <- as.factor(NA)

set.seed(456)

iris_split <- iris %>%
  initial_split(strata = Sepal.Length)


iris_training <- training(iris_split)
iris_testing <- testing(iris_split)

iris_rf_model <- rand_forest(
  mtry = 10,
  min_n = 10,
  trees = 500
) %>%
  set_engine("ranger") %>%
  set_mode("regression")


base_rec <- recipe(Sepal.Length ~ .,
  data = iris_training
) %>%
  step_impute_knn(Sepal.Width, Petal.Length, Petal.Width, Species, Plucked) %>%
  step_date(Plucked) %>%
  step_dummy(Species)

iris_workflow <- workflow() %>%
  add_model(iris_rf_model) %>%
  add_recipe(base_rec)

iris_rf_wkfl_fit <- iris_workflow %>%
  last_fit(iris_split)
#> x train/test split: preprocessor 1/1: Error: Assigned data `pred_vals` must be compatible wi...
#> Warning: All models failed. See the `.notes` column.
Created on 2021-06-15 by the reprex package (v2.0.0)

Here is the reprex, which does not through an error:

library(tidyverse)
library(tidymodels)

iris[45, 2] <- as.numeric(NA)
iris[37 ,3] <- as.numeric(NA)
iris[78, 4] <- as.numeric(NA)
iris[9, 5] <- as.numeric(NA)

set.seed(123)

iris_split <- iris %>% 
  initial_split(strata = Sepal.Length)

iris_training <- training(iris_split)
iris_testing <- testing(iris_split)

iris_rf_model <- rand_forest(
  mtry = 5,
  min_n = 5,
  trees = 500) %>%
  set_engine("ranger") %>%
  set_mode("regression")


base_rec <- recipe(Sepal.Length ~ .,
                   data = iris_training) %>% 
  step_impute_knn(Sepal.Width, Petal.Length, Petal.Width, Species) %>%
  step_dummy(Species)

iris_workflow <- workflow() %>% 
  add_model(iris_rf_model) %>% 
  add_recipe(base_rec)

iris_rf_wkfl_fit <- iris_workflow %>%
  last_fit(split = iris_split)
Created on 2021-06-15 by the reprex package (v2.0.0)

Thanks in advance! M.

Mischa
  • 137
  • 8
  • I don't think that `step_impute_knn()` works on dates, but I believe that [`step_impute_linear()`](https://recipes.tidymodels.org/reference/step_impute_linear.html) will. Give that a try! – Julia Silge Jun 17 '21 at 23:01
  • Hi @JuliaSilge! Thank you for your comment (and your wonderful screencasts). if I use `step_impute_linear` on the column `Plucked`, unfortunately there is still an error, though a different one. I get: `"preprocessor 1/1: Error: Variable 'Plucked' chosen for linear regression imputation must be of type numeric."` – Mischa Jun 18 '21 at 08:30

1 Answers1

0

I suspect that step_impute_knn doesn't work on date format. You might have to convert it first into a factor. Can you try the below code?

iris_n <- iris %>%
  mutate(Plucked = sample(seq(as.Date("1999/01/01"), as.Date("2000/01/01"),
    by = "day"
  ), size = 150))  %>% 
  mutate(Plucked = as.factor(Plucked)) #convert date into factor

iris_n[45, 2] <- NA
iris_n[37, 3] <- NA
iris_n[78, 4] <- NA
iris_n[9, 5] <- NA
iris_n[15, 6] <- NA

set.seed(456)

iris_split <- iris_n %>%
  initial_split(strata = Sepal.Length)


iris_training <- training(iris_split)
iris_testing <- testing(iris_split)

iris_rf_model <- rand_forest(
  mtry = 10,
  min_n = 10,
  trees = 500
) %>%
  set_engine("ranger") %>%
  set_mode("regression")


base_rec <- recipe(Sepal.Length ~ .,
  data = iris_training
) %>%
  step_impute_knn(Sepal.Width, Petal.Length, Petal.Width, Species, Plucked) %>%
  #step_date(Plucked) %>% #might not need this step anymore
  step_dummy(Species)

iris_workflow <- workflow() %>%
  add_model(iris_rf_model) %>%
  add_recipe(base_rec)

iris_rf_wkfl_fit <- iris_workflow %>%
  last_fit(iris_split)
marqui
  • 31
  • 2
  • Hi @marqui, thank you for your answer. Unfortunately your code fails for me. The last line of Code causes this error: `Error in summary.connection(connection) : invalid connection` – Mischa Jun 17 '21 at 11:08
  • That's strange, I can access the `iris_rf_wkfl_fit ` object without issues. From a quick google search your error might be related to some parallel computing jobs. Can you try a)Clean your workspace, reinitiate RStudio and run again the workflow b)maybe also try to get the latest tidymodel version (I am using 0.1.3.9000). – marqui Jun 17 '21 at 15:43
  • Thanks, with an update the error disappeared. I am not sure how I will continue. Since my dataset is pretty big it is not really an option to convert the date column to factor and do a one-hot-encoding (or similar stuff). But thank you anyway. – Mischa Jun 18 '21 at 12:38
  • You're welcome. Depending on your use case, you can also think about converting it into a numeric e.g. milliseconds etc. or whatever since some time origin. – marqui Jun 18 '21 at 15:25