While trying to add a recipe column to a tibble, following the steps of this Rsample Tidymodels article, I got the following error message:
Error: Not all variables in the recipe are present in the supplied training set: 'ticker', 'ret_3m', 'lead_ret', 'p_l', 'vpa', 'lpa', 'roe', 'payout', 'dy_12m', 'p_vpa', 'ativo_circulante', 'lc', 'divida_bruta', 'qt_on', 'ret_ibov_3m', 'volume_3m', 'volat_3m', 'alvo'.
Here a sample of the data I'm using: sample_data
First step was to nest the tibble by quarter:
df_tri_nested <- sample_data %>%
nest(-quarter)
> df_tri_nested
# A tibble: 55 x 2
quarter data
<fct> <list>
1 2017.1 <tibble [160 x 19]>
2 2017.2 <tibble [162 x 19]>
3 2017.3 <tibble [160 x 19]>
4 2017.4 <tibble [161 x 19]>
5 2018.1 <tibble [173 x 19]>
6 2018.2 <tibble [165 x 19]>
7 2018.3 <tibble [158 x 19]>
8 2018.4 <tibble [169 x 19]>
9 2019.1 <tibble [167 x 19]>
10 2019.2 <tibble [164 x 19]>
# ... with 45 more rows
Second step was to create rolling tibbles for every 8 periods, as demonstrated here:
df_tri_roll <- rolling_origin(df_tri_nested, initial = 8, cumulative = FALSE)
> df_tri_roll
# Rolling origin forecast resampling
# A tibble: 47 x 2
splits id
<list> <chr>
1 <split [8/1]> Slice01
2 <split [8/1]> Slice02
3 <split [8/1]> Slice03
4 <split [8/1]> Slice04
5 <split [8/1]> Slice05
6 <split [8/1]> Slice06
7 <split [8/1]> Slice07
8 <split [8/1]> Slice08
9 <split [8/1]> Slice09
10 <split [8/1]> Slice10
# ... with 37 more rows
Third, the recipe:
recipe <- recipe(alvo ~ .,
data = sample_data) %>%
update_role(ticker, data, ret_3m, lead_ret,
ret_ibov_3m, volume_3m, volat_3m, quarter,
new_role = "ID") %>%
step_log(c(ativo_circulante, divida_bruta, dy_12m, lc, qt_on),
signed = TRUE) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
Finally, when I try to add a recipe column to the tibble, using the command below, I get the error message above.
df_tri_roll$recipe <- map(df_tri_roll$splits, prepper, recipe = recipe)
Searching for the problem, I encountered the source of the function (check_training_set)
that throws the error here.
I redid the function checks and got the following results:
> (vars <- unique(recipe$var_info$variable))
[1] "ticker" "data" "quarter" "ret_3m"
[5] "lead_ret" "p_l" "vpa" "lpa"
[9] "roe" "payout" "dy_12m" "p_vpa"
[13] "ativo_circulante" "lc" "divida_bruta" "qt_on"
[17] "ret_ibov_3m" "volume_3m" "volat_3m" "alvo"
> (in_data <- vars %in% colnames(sample_data))
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[19] TRUE TRUE
It also does not seem to be a problem with the roles of the variables:
> summary(recipe)
# A tibble: 20 x 4
variable type role source
<chr> <chr> <chr> <chr>
1 ticker nominal ID original
2 data date ID original
3 quarter nominal ID original
4 ret_3m numeric ID original
5 lead_ret numeric ID original
6 p_l numeric predictor original
7 vpa numeric predictor original
8 lpa numeric predictor original
9 roe numeric predictor original
10 payout numeric predictor original
11 dy_12m numeric predictor original
12 p_vpa numeric predictor original
13 ativo_circulante numeric predictor original
14 lc numeric predictor original
15 divida_bruta numeric predictor original
16 qt_on numeric predictor original
17 ret_ibov_3m numeric ID original
18 volume_3m numeric ID original
19 volat_3m numeric ID original
20 alvo nominal outcome original
sessionInfo
:
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
other attached packages:
[1] vctrs_0.3.8 rlang_0.4.11 forcats_0.5.1 stringr_1.4.0
[5] readr_1.4.0 tidyverse_1.3.1 yardstick_0.0.8 workflowsets_0.0.2
[9] workflows_0.2.2 tune_0.1.5 tidyr_1.1.3 tibble_3.1.2
[13] rsample_0.1.0 recipes_0.1.16 purrr_0.3.4 parsnip_0.1.6
[17] modeldata_0.1.0 infer_0.5.4 ggplot2_3.3.3 dplyr_1.0.6
[21] dials_0.0.9 scales_1.1.1 broom_0.7.6 tidymodels_0.1.3
I'm really out of ideas and hope you can help me, otherwise I'm gonna have to fit ~ 270 models by hand.