0

I am applying WOE (weight of evidence) transformation for my features (using 'step_woe' from the 'embed' package) within the 'recipes' framework, but by default it takes the 0 value as reference and thus the WOE values are reversed.

I am trying to relevel the target to set "1" as reference but the results are the same (no change in the direction of woe values). Any idea how to get it right?

Here is an example, first I create example dataset with a target (0's and 1's) and one feature ('yes', 'no') in perfect relationship with each other. Then I apply step_woe transformation while setting the reference level either '0' or '1' to compare the results with no difference.


library(tidyverse)
library(recipes)
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#> 
#>     fixed
#> The following object is masked from 'package:stats':
#> 
#>     step
library(embed)
  
example_df <- 
  tibble(
    target  = rbinom(1000, 1, 0.5),
    feature = ifelse(target == 1, "yes", "no")
  ) %>% 
  mutate_all(as.factor) %>% 
  print()
#> # A tibble: 1,000 x 2
#>    target feature
#>    <fct>  <fct>  
#>  1 0      no     
#>  2 1      yes    
#>  3 0      no     
#>  4 0      no     
#>  5 1      yes    
#>  6 0      no     
#>  7 1      yes    
#>  8 1      yes    
#>  9 0      no     
#> 10 0      no     
#> # … with 990 more rows

woe_recipe_0 <- 
  recipe(target ~ feature, data = example_df) %>% 
  step_relevel(target, ref_level = "0") %>% 
  embed::step_woe(all_nominal_predictors(), outcome = "target") %>% 
  prep(., retain = FALSE)

tidy(woe_recipe_0, number = 2)
#> # A tibble: 2 x 10
#>   terms   value n_tot   n_0   n_1   p_0   p_1   woe outcome id       
#>   <chr>   <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>   <chr>    
#> 1 feature no      493   493     0     1     0  20.0 target  woe_nY7AB
#> 2 feature yes     507     0   507     0     1 -20.0 target  woe_nY7AB

woe_recipe_1 <- 
  recipe(target ~ feature, data = example_df) %>% 
  step_relevel(target, ref_level = "1") %>% 
  embed::step_woe(all_nominal_predictors(), outcome = "target") %>% 
  prep(., retain = FALSE)

tidy(woe_recipe_1, number = 2)
#> # A tibble: 2 x 10
#>   terms   value n_tot   n_0   n_1   p_0   p_1   woe outcome id       
#>   <chr>   <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>   <chr>    
#> 1 feature no      493   493     0     1     0  20.0 target  woe_Lt6pK
#> 2 feature yes     507     0   507     0     1 -20.0 target  woe_Lt6pK

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Red Hat Enterprise Linux
#> 
#> Matrix products: default
#> BLAS: /opt/R/3.5.1/lib64/R/lib/libRblas.so
#> LAPACK: /opt/R/3.5.1/lib64/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] embed_0.1.5     recipes_0.1.17  forcats_0.4.0   stringr_1.4.0  
#>  [5] dplyr_1.0.7     purrr_0.3.4     readr_1.3.1     tidyr_1.1.2    
#>  [9] tibble_3.0.4    ggplot2_3.3.5   tidyverse_1.3.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] httr_1.4.1            jsonlite_1.6          splines_3.5.1        
#>  [4] prodlim_2019.11.13    modelr_0.1.5          RcppParallel_5.0.2   
#>  [7] assertthat_0.2.1      highr_0.8             cellranger_1.1.0     
#> [10] yaml_2.2.0            ipred_0.9-12          pillar_1.6.2         
#> [13] backports_1.2.1       lattice_0.20-35       glue_1.5.1           
#> [16] reticulate_1.13       digest_0.6.27         rvest_0.3.5          
#> [19] colorspace_2.0-0      htmltools_0.4.0       Matrix_1.2-14        
#> [22] timeDate_3043.102     pkgconfig_2.0.3       broom_0.7.6          
#> [25] haven_2.2.0           scales_1.1.0          whisker_0.4          
#> [28] gower_0.2.1           lava_1.6.6            generics_0.1.0       
#> [31] ellipsis_0.3.2        withr_2.4.1           keras_2.2.5.0        
#> [34] nnet_7.3-12           cli_2.4.0             survival_2.42-3      
#> [37] magrittr_2.0.1        crayon_1.4.1          readxl_1.3.1         
#> [40] evaluate_0.14         fs_1.3.1              fansi_0.4.2          
#> [43] MASS_7.3-51.4         xml2_1.2.2            class_7.3-14         
#> [46] tools_3.5.1           hms_1.1.1             lifecycle_1.0.1      
#> [49] munsell_0.5.0         reprex_0.3.0          compiler_3.5.1       
#> [52] rlang_0.4.12          grid_3.5.1            rstudioapi_0.11      
#> [55] base64enc_0.1-3       rmarkdown_1.18        gtable_0.3.0         
#> [58] DBI_1.1.1             R6_2.5.0              tfruns_1.4           
#> [61] lubridate_1.7.4       knitr_1.26            tensorflow_2.0.0     
#> [64] uwot_0.1.5            utf8_1.2.1            zeallot_0.1.0        
#> [67] stringi_1.4.3         Rcpp_1.0.7            vctrs_0.3.8          
#> [70] rpart_4.1-15          dbplyr_2.1.1          tidyselect_1.1.1.9000
#> [73] xfun_0.11

Created on 2022-02-02 by the reprex package (v0.3.0)

  • in my real project i have 100s of features most with many levels, it still does not work with step_relevel or even with standard relevel functions...i will try your example, but it should not matter how many levels you have, it still should reverse the woe values depending on the reference level. – martin_hulin Feb 02 '22 at 16:31
  • Sorry, there was a mistake in my example as I used `sample` repeated without set.seed – akrun Feb 02 '22 at 16:43
  • This [is a bug](https://github.com/tidymodels/embed/issues/109) in how the WOE dictionary is computed. Thank you so much for this report! You may be able to [use the custom dictionary functionality](https://embed.tidymodels.org/reference/dictionary.html) as a workaround for now. – Julia Silge Feb 23 '22 at 22:36

0 Answers0