I have a dataframe where I want to divide a specific set of predictors by the number of predictors larger than zero (also from that specific set). When I try to include this operation in a recipe, it seems to divide by the total number of predictors in the specific set, ignoring the condition it should be larger than zero.
Example:
df <- data.frame(matrix(c(16, 8, 4, 2, 32, 16, 8, 4, 0, 32, 16, 8, 0, 0, 32, 16, 0, 0, 0, 32), 4, 5))
X1 X2 X3 X4 X5
1 16 32 0 0 0
2 8 16 32 0 0
3 4 8 16 32 0
4 2 4 8 16 32
vars <- names(df)[-1]
df_temp <- df %>%
mutate(pos_count = rowSums(df %>% select(all_of(vars)) > 0))
df_temp <- df_temp %>%
mutate(across(all_of(vars), .fns = ~./pos_count))
lm_recipe <-
recipe(X1 ~ X2 + X3 + X4 + X5, data = df_temp)
lm_model <-
linear_reg(penalty = 0) %>%
set_engine("glmnet", lower.limits = rep(0, 5), upper.limits = rep(1, 5), intercept = FALSE)
lm_wflow <-
workflow() %>%
add_model(lm_model) %>%
add_recipe(lm_recipe)
lm_fit <- fit(lm_wflow, df_temp)
lm_fit %>% tidy()
term estimate penalty
1 (Intercept) 0 0
2 X2 0.492 0
3 X3 0.240 0
4 X4 0.112 0
5 X5 0.0256 0
This seems to work more or less (the estimates should be 0, 1/2, 1/4, 1/8 and 1/16
).
But when I incorporate the data prep in the recipe, all the predictors are divided by the total number of predictors (in this case four):
lm_recipe <-
recipe(X1 ~ X2 + X3 + X4 + X5, data = df) %>%
step_mutate(pos_count = sum(all_of(vars) > 0)) %>%
step_mutate(across(all_of(vars), .fns = ~./pos_count))
lm_wflow <-
workflow() %>%
add_model(lm_model) %>%
add_recipe(lm_recipe)
lm_fit <- fit(lm_wflow, df)
lm_fit %>% tidy()
term estimate penalty
1 (Intercept) 0 0
2 X2 1 0
3 X3 0.478 0
4 X4 0 0
5 X5 0 0
6 pos_count 0 0
augment(lm_fit, df)
X1 X2 X3 X4 X5 .pred
1 16 32 0 0 0 8
2 8 16 32 0 0 7.82
3 4 8 16 32 0 3.91
4 2 4 8 16 32 1.96
How do I need to change the recipe to fix this? Thanks!