0

I have a dataframe where I want to divide a specific set of predictors by the number of predictors larger than zero (also from that specific set). When I try to include this operation in a recipe, it seems to divide by the total number of predictors in the specific set, ignoring the condition it should be larger than zero.

Example:

df <- data.frame(matrix(c(16, 8, 4, 2, 32, 16, 8, 4, 0, 32, 16, 8, 0, 0, 32, 16, 0, 0, 0, 32), 4, 5))

  X1 X2 X3 X4 X5
1 16 32  0  0  0
2  8 16 32  0  0
3  4  8 16 32  0
4  2  4  8 16 32

vars <- names(df)[-1]

df_temp <- df %>% 
  mutate(pos_count = rowSums(df %>% select(all_of(vars)) > 0))

df_temp <- df_temp %>% 
  mutate(across(all_of(vars), .fns = ~./pos_count))

lm_recipe <- 
  recipe(X1 ~ X2 + X3 + X4 + X5, data = df_temp) 

lm_model <- 
  linear_reg(penalty = 0) %>%  
  set_engine("glmnet", lower.limits = rep(0, 5), upper.limits = rep(1, 5), intercept = FALSE)

lm_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>%
  add_recipe(lm_recipe)

lm_fit <- fit(lm_wflow,  df_temp)
lm_fit %>% tidy()

  term        estimate penalty
1 (Intercept)   0            0
2 X2            0.492        0
3 X3            0.240        0
4 X4            0.112        0
5 X5            0.0256       0

This seems to work more or less (the estimates should be 0, 1/2, 1/4, 1/8 and 1/16).

But when I incorporate the data prep in the recipe, all the predictors are divided by the total number of predictors (in this case four):

lm_recipe <- 
  recipe(X1 ~ X2 + X3 + X4 + X5, data = df) %>% 
  step_mutate(pos_count = sum(all_of(vars) > 0)) %>%
  step_mutate(across(all_of(vars), .fns = ~./pos_count)) 

lm_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>%
  add_recipe(lm_recipe)

lm_fit <- fit(lm_wflow,  df)

lm_fit %>% tidy()

  term        estimate penalty
1 (Intercept)    0           0
2 X2             1           0
3 X3             0.478       0
4 X4             0           0
5 X5             0           0
6 pos_count      0           0

augment(lm_fit, df)

     X1    X2    X3    X4    X5 .pred
1    16    32     0     0     0  8   
2     8    16    32     0     0  7.82
3     4     8    16    32     0  3.91
4     2     4     8    16    32  1.96

How do I need to change the recipe to fix this? Thanks!

ThePhil
  • 11
  • 2

1 Answers1

0

Your problem came because you used sum() inside step_mutate() instead of rowSums() that you used earlier.

df <- data.frame(matrix(c(16, 8, 4, 2, 32, 16, 8, 4, 0, 32, 16, 8, 0, 0, 32, 16, 0, 0, 0, 32), 4, 5))

vars <- names(df)[-1]

library(recipes)

lm_recipe <- 
  recipe(X1 ~ X2 + X3 + X4 + X5, data = df) %>% 
  step_mutate(pos_count = rowSums(pick(any_of(vars)) > 0)) %>%
  step_mutate(across(any_of(vars), .fns = ~./pos_count))

prep(lm_recipe) |>
  bake(new_data = NULL)
#> # A tibble: 4 × 6
#>      X2    X3    X4    X5    X1 pos_count
#>   <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
#> 1 32     0      0       0    16         1
#> 2  8    16      0       0     8         2
#> 3  2.67  5.33  10.7     0     4         3
#> 4  1     2      4       8     2         4

Created on 2023-02-17 with reprex v2.0.2

EmilHvitfeldt
  • 2,555
  • 1
  • 9
  • 12
  • Thanks, that worked! I did try rowSums, but with 'select' instead of 'pick' which did not work :) – ThePhil Feb 20 '23 at 09:44