1

I noted that when training with certain engines (e.g. keras and xgboost) the recipe returns more ys than Xs.

Here you'll find a minimal reproducible example:

library(themis)
library(recipes)
library(tune)
library(parsnip)
library(workflows)
library(dials)
library(rsample)

xg_mod <- parsnip::boost_tree(mode = "classification",
                              trees = tune(),    
                              tree_depth = tune(),    
                              min_n = tune(),         
                              loss_reduction = tune(),
                              learn_rate = tune()) %>%
    set_engine("xgboost")

xg_grid <- grid_latin_hypercube(over_ratio(range = c(0,1)),
                                trees(),
                                tree_depth(),
                                min_n(),
                                loss_reduction(),
                                learn_rate(),
                                size = 5)

my_recipe <- recipe(class ~ ., data = circle_example) %>%
    step_rose(class, over_ratio = tune())

workflow() %>%
    add_model(xg_mod) %>%
    add_recipe(my_recipe) %>%
    tune_grid(resamples = mc_cv(circle_example, strata = class),
                        grid = xg_grid)

The resulting error is Error in data.frame(ynew, Xnew): arguments imply differing number of rows: 385, 386

Marco Repetto
  • 336
  • 2
  • 15

1 Answers1

0

It is related to tuning the over_ratio. If you skip tuning it, the example will work with no errors.

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 0.1.1   
library(themis)
data(iris)

iris_imbalance <- iris %>%
  filter(Species != "setosa") %>% 
  slice_sample(n = 60, weight_by = case_when(
                                    Species == "virginica" ~ 60,
                                    TRUE ~ 1)) %>% 
  mutate(Species = factor(Species))

xg_mod <- parsnip::boost_tree(mode = "classification",
                             trees = tune(),    
                             tree_depth = tune(),    
                             min_n = tune(),         
                             loss_reduction = tune(),
                             learn_rate = tune()) %>%
  set_engine("xgboost")

xg_grid <- grid_latin_hypercube(#over_ratio(range = c(0,1)),
                                trees(),
                                tree_depth(),
                                min_n(),
                                loss_reduction(),
                                learn_rate(),
                                size = 5)

my_recipe <- recipe(Species ~ ., data = iris_imbalance) %>%
  step_rose(Species) #, over_ratio = tune())

workflow() %>%
  add_model(xg_mod) %>%
  add_recipe(my_recipe) %>%
  tune_grid(resamples = mc_cv(iris_imbalance, strata = Species),
            grid = xg_grid)
#> # Tuning results
#> # Monte Carlo cross-validation (0.75/0.25) with 25 resamples  using stratification 
#> # A tibble: 25 x 4
#>    splits          id         .metrics          .notes          
#>    <list>          <chr>      <list>            <list>          
#>  1 <split [46/14]> Resample01 <tibble [10 × 9]> <tibble [0 × 1]>
#>  2 <split [46/14]> Resample02 <tibble [10 × 9]> <tibble [0 × 1]>
#>  3 <split [46/14]> Resample03 <tibble [10 × 9]> <tibble [0 × 1]>
#>  4 <split [46/14]> Resample04 <tibble [10 × 9]> <tibble [0 × 1]>
#>  5 <split [46/14]> Resample05 <tibble [10 × 9]> <tibble [0 × 1]>
#>  6 <split [46/14]> Resample06 <tibble [10 × 9]> <tibble [0 × 1]>
#>  7 <split [46/14]> Resample07 <tibble [10 × 9]> <tibble [0 × 1]>
#>  8 <split [46/14]> Resample08 <tibble [10 × 9]> <tibble [0 × 1]>
#>  9 <split [46/14]> Resample09 <tibble [10 × 9]> <tibble [0 × 1]>
#> 10 <split [46/14]> Resample10 <tibble [10 × 9]> <tibble [0 × 1]>
#> # … with 15 more rows

Created on 2020-11-13 by the reprex package (v0.3.0)

hnagaty
  • 796
  • 5
  • 13
  • I know, the problem is that i would like to tune the over_ratio – Marco Repetto Nov 13 '20 at 09:37
  • I guess this is inherent to the way that `tune_grid()` works; you can't tune a parameter that changes the number of rows. This is just my guess or assumption. – hnagaty Nov 13 '20 at 10:53
  • The thing is that if I do it with step_smote or simply step_upsample it works. – Marco Repetto Nov 13 '20 at 11:15
  • 2
    Yeah, something strange is going on here. I saw that [you posted a GitHub issue](https://github.com/tidymodels/themis/issues/36), and we can follow up there. Thank you for bringing this to our attention! – Julia Silge Nov 14 '20 at 00:54