0

I have data located here: data_tbl.xlsx I cannot upload data here or don't know how.

The issue is that I am getting an error when trying to fit my training data to a workflow. I don't understand the issue, nor why I am getting it.

Here is my juiced data (recipe_num_only recipe): juiced_recipe.xlsx

Here is my splits object:

splits <- initial_time_split(
  data_final_tbl
  , prop = 0.8
  , cumulative = TRUE
)

Here are my recipes (the one in question is recipe_num_only)

# Features ----------------------------------------------------------------

recipe_base <- recipe(value ~ ., data = training(splits))

recipe_date <- recipe_base %>%
  step_timeseries_signature(date_col) %>%
  step_rm(matches("(iso$)|(xts$)|(hour)|(min)|(sec)|(am.pm)")) %>%
  step_normalize(contains("index.num"), contains("date_col_year"))

recipe_fourier <- recipe_date %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  step_fourier(date_col, period = 365/12, K = 1) %>%
  step_YeoJohnson(value, limits = c(0,1))

recipe_fourier_final <- recipe_fourier %>%
  step_nzv(all_predictors())

recipe_pca <- recipe_base %>%
  step_timeseries_signature(date_col) %>%
  step_rm(matches("(iso$)|(xts$)|(hour)|(min)|(sec)|(am.pm)")) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  step_normalize(value) %>%
  step_fourier(date_col, period = 365/52, K = 1) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_nzv(all_predictors()) %>%
  step_pca(
    all_numeric_predictors(), -date_col_index.num
    , threshold = .95
  )

recipe_num_only <- recipe_pca %>%
  step_rm(-all_numeric_predictors())

Here is my XGBoost Model Spec

# XGBoost -----------------------------------------------------------------

model_spec_boost <- boost_tree(
  mode  = "regression"
  # , mtry  = 25
  # , trees = 25
  # , min_n = 10
  # , tree_depth = 2
  # , learn_rate = 0.3
  # , loss_reduction = 0.01
) %>%
  set_engine("xgboost")

# * * Testing ----
set.seed(123)
workflow() %>%
  add_model(model_spec_boost) %>%
  add_recipe(recipe_num_only) %>%
  fit(training(splits))
# * * End Test ----

The error I get is the following:

> workflow() %>%
+   add_model(model_spec_boost) %>%
+   add_recipe(recipe_num_only) %>%
+   fit(training(splits))
Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) : 
  The length of labels must equal to the number of rows in the input data
Timing stopped at: 0 0 0

Everything works until I get to the fit(training(splits))

> workflow() %>%
+   add_model(model_spec_boost) %>%
+   add_recipe(recipe_num_only)
== Workflow ==========================================================================================
Preprocessor: Recipe
Model: boost_tree()

-- Preprocessor --------------------------------------------------------------------------------------
9 Recipe Steps

* step_timeseries_signature()
* step_rm()
* step_dummy()
* step_normalize()
* step_fourier()
* step_normalize()
* step_nzv()
* step_pca()
* step_rm()

-- Model ---------------------------------------------------------------------------------------------
Boosted Tree Model Specification (regression)

Computational engine: xgboost 

At a bit of a loss here

MCP_infiltrator
  • 3,961
  • 10
  • 45
  • 82

1 Answers1

0

I have accidentally removed my predictor during my recipe creation.

I had this:

recipe_num_only <- recipe_pca %>%
  step_rm(-all_numeric_predictors())

I changed it to this:

recipe_num_only <- recipe_pca %>%
  step_rm(-value, -all_numeric_predictors())
MCP_infiltrator
  • 3,961
  • 10
  • 45
  • 82