0

While trying to add a recipe column to a tibble, following the steps of this Rsample Tidymodels article, I got the following error message:

Error: Not all variables in the recipe are present in the supplied training set: 'ticker', 'ret_3m', 'lead_ret', 'p_l', 'vpa', 'lpa', 'roe', 'payout', 'dy_12m', 'p_vpa', 'ativo_circulante', 'lc', 'divida_bruta', 'qt_on', 'ret_ibov_3m', 'volume_3m', 'volat_3m', 'alvo'.

Here a sample of the data I'm using: sample_data

First step was to nest the tibble by quarter:

df_tri_nested <- sample_data %>%
     nest(-quarter)
> df_tri_nested
# A tibble: 55 x 2
   quarter data               
   <fct>   <list>             
 1 2017.1  <tibble [160 x 19]>
 2 2017.2  <tibble [162 x 19]>
 3 2017.3  <tibble [160 x 19]>
 4 2017.4  <tibble [161 x 19]>
 5 2018.1  <tibble [173 x 19]>
 6 2018.2  <tibble [165 x 19]>
 7 2018.3  <tibble [158 x 19]>
 8 2018.4  <tibble [169 x 19]>
 9 2019.1  <tibble [167 x 19]>
10 2019.2  <tibble [164 x 19]>
# ... with 45 more rows

Second step was to create rolling tibbles for every 8 periods, as demonstrated here:

df_tri_roll <- rolling_origin(df_tri_nested, initial = 8, cumulative = FALSE)
> df_tri_roll
# Rolling origin forecast resampling 
# A tibble: 47 x 2
   splits        id     
   <list>        <chr>  
 1 <split [8/1]> Slice01
 2 <split [8/1]> Slice02
 3 <split [8/1]> Slice03
 4 <split [8/1]> Slice04
 5 <split [8/1]> Slice05
 6 <split [8/1]> Slice06
 7 <split [8/1]> Slice07
 8 <split [8/1]> Slice08
 9 <split [8/1]> Slice09
10 <split [8/1]> Slice10
# ... with 37 more rows

Third, the recipe:

recipe <- recipe(alvo ~ .,
                 data = sample_data) %>%
     update_role(ticker, data, ret_3m, lead_ret,
                   ret_ibov_3m, volume_3m, volat_3m, quarter,
                 new_role = "ID") %>%
     step_log(c(ativo_circulante, divida_bruta, dy_12m, lc, qt_on), 
              signed = TRUE) %>%
     step_center(all_predictors()) %>%
     step_scale(all_predictors())

Finally, when I try to add a recipe column to the tibble, using the command below, I get the error message above.

df_tri_roll$recipe <- map(df_tri_roll$splits, prepper, recipe = recipe)

Searching for the problem, I encountered the source of the function (check_training_set) that throws the error here. I redid the function checks and got the following results:

> (vars <- unique(recipe$var_info$variable))
 [1] "ticker"           "data"             "quarter"          "ret_3m"          
 [5] "lead_ret"         "p_l"              "vpa"              "lpa"             
 [9] "roe"              "payout"           "dy_12m"           "p_vpa"           
[13] "ativo_circulante" "lc"               "divida_bruta"     "qt_on"           
[17] "ret_ibov_3m"      "volume_3m"        "volat_3m"         "alvo"            
> (in_data <- vars %in% colnames(sample_data))
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[19] TRUE TRUE

It also does not seem to be a problem with the roles of the variables:

> summary(recipe)
# A tibble: 20 x 4
   variable         type    role      source  
   <chr>            <chr>   <chr>     <chr>   
 1 ticker           nominal ID        original
 2 data             date    ID        original
 3 quarter          nominal ID        original
 4 ret_3m           numeric ID        original
 5 lead_ret         numeric ID        original
 6 p_l              numeric predictor original
 7 vpa              numeric predictor original
 8 lpa              numeric predictor original
 9 roe              numeric predictor original
10 payout           numeric predictor original
11 dy_12m           numeric predictor original
12 p_vpa            numeric predictor original
13 ativo_circulante numeric predictor original
14 lc               numeric predictor original
15 divida_bruta     numeric predictor original
16 qt_on            numeric predictor original
17 ret_ibov_3m      numeric ID        original
18 volume_3m        numeric ID        original
19 volat_3m         numeric ID        original
20 alvo             nominal outcome   original

sessionInfo:

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

other attached packages:
 [1] vctrs_0.3.8        rlang_0.4.11       forcats_0.5.1      stringr_1.4.0     
 [5] readr_1.4.0        tidyverse_1.3.1    yardstick_0.0.8    workflowsets_0.0.2
 [9] workflows_0.2.2    tune_0.1.5         tidyr_1.1.3        tibble_3.1.2      
[13] rsample_0.1.0      recipes_0.1.16     purrr_0.3.4        parsnip_0.1.6     
[17] modeldata_0.1.0    infer_0.5.4        ggplot2_3.3.3      dplyr_1.0.6       
[21] dials_0.0.9        scales_1.1.1       broom_0.7.6        tidymodels_0.1.3

I'm really out of ideas and hope you can help me, otherwise I'm gonna have to fit ~ 270 models by hand.

  • Please don't post external links to your data. Give an example using `dput(YourData)` or `dput(head(YourData))`. – Martin Gal Jun 06 '21 at 18:48

1 Answers1

0

It doesn't work because your data is nested, and when you apply the recipe, it only sees the nested dataframe. First it's easier if you sort your dataframe first:

sample_data = sample_data[order(sample_data$data),]

This is what prepper will see:

df_tri_nested <- sample_data %>% nest(-quarter)
df_tri_roll <- rolling_origin(df_tri_nested, initial = 8, cumulative = FALSE)

head(analysis(df_tri_roll$splits[[1]]))
# A tibble: 6 x 2
  quarter data              
  <fct>   <list>            
1 2007.1  <tibble [32 × 19]>
2 2007.2  <tibble [26 × 19]>
3 2007.3  <tibble [36 × 19]>
4 2007.4  <tibble [45 × 19]>
5 2008.1  <tibble [43 × 19]>
6 2008.2  <tibble [52 × 19]>

Hence, there's no columns that fit the formula and you get the error. To run a model on the above, you need to unnest and prepper is not meant for that.

If you have a time component, most likely you want to use one of the time sampling functions:

sliding_df = sliding_period(sample_data,index="data",
period="quarter",lookback=7)

We can check the dataset are the same, compared to the original split you have:

dim(analysis(sliding_df$splits[[1]]))
[1] 324  20

dim(unnest(analysis(df_tri_roll$splits[[1]]),cols=c(data)))
[1] 324  20

Then run:

recipe <- recipe(alvo ~ .,
                 data = sample_data) %>%
     update_role(ticker, data, ret_3m, lead_ret,
                   ret_ibov_3m, volume_3m, volat_3m, quarter,
                 new_role = "ID") %>%
     step_log(c(ativo_circulante,divida_bruta, dy_12m, lc, qt_on), 
              signed = TRUE) %>%
     step_center(all_predictors()) %>%
     step_scale(all_predictors())

map(sliding_df$splits[1:2], prepper, recipe = recipe)


[[1]]
Data Recipe

Inputs:

      role #variables
        ID          8
   outcome          1
 predictor         11

Training data contained 324 data points and no missing data.

Operations:

Signed log transformation on ativo_circulante, divida_bruta, dy_12m, ... [trained]
Centering for p_l, vpa, lpa, roe, payout, dy_12m, ... [trained]
Scaling for p_l, vpa, lpa, roe, payout, dy_12m, ... [trained]

[[2]]
Data Recipe

Inputs:

      role #variables
        ID          8
   outcome          1
 predictor         11

Training data contained 337 data points and no missing data.

Operations:

Signed log transformation on ativo_circulante, divida_bruta, dy_12m, ... [trained]
Centering for p_l, vpa, lpa, roe, payout, dy_12m, ... [trained]
Scaling for p_l, vpa, lpa, roe, payout, dy_12m, ... [trained]
StupidWolf
  • 45,075
  • 17
  • 40
  • 72