I am using a lasso regression to classify some text as either related to AI or not. When I calculate variable importance using vip
and tidymodels
, the sign is opposite of expected -- words like "machine", "learning", and "algorithm" have a negative sign.
Apologies for the lack of reprex, but here is my code:
fy21_raw %>%
sample_n(5)
# A tibble: 5 x 3
# prog_title text artificial_intel
# <chr> <chr> <fct>
#1 Advanced Batt~ "ABMS l~ not
#2 Energy Effici~ "This e~ not
#3 Development o~ "This P~ artificial_intel
#4 Unmanned Logi~ "This U~ artificial_intel
#5 FY 2020 SBIR/~ "Fundin~ not
# Note: the artificial_intel column is a factor with 2 levels: "artificial_intel" and "not"
set.seed(123)
budget_split <- initial_split(fy21_raw, strata = artificial_intel)
budget_train <- training(budget_split)
budget_test <- testing(budget_split)
set.seed(234)
budget_folds <- vfold_cv(budget_train, strata = artificial_intel, v = 5)
budget_rec <- recipe(artificial_intel ~ ., data = budget_train) %>% # update dv with actual name
update_role(prog_title, new_role = "id") %>%
step_tokenize(text) %>%
step_tokenfilter(text, max_tokens = 1000) %>%
step_upsample(artificial_intel) %>% # update dv with actual name
step_tfidf(text) %>%
step_normalize(recipes::all_predictors())
budget_wf <- workflow() %>%
add_recipe(budget_rec)
lasso_spec <- logistic_reg(penalty = 0.1, mixture = 1) %>%
set_mode("classification") %>%
set_engine("glmnet")
all_cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
set.seed(1234)
lasso_res <- budget_wf %>%
add_model(lasso_spec) %>%
fit_resamples(resamples = budget_folds,
metrics = metric_set(roc_auc, accuracy, sens, spec),
control = control_grid(save_pred = TRUE, pkgs = c('textrecipes')))
set.seed(123)
budget_imp <- budget_wf %>%
add_model(lasso_spec) %>%
fit(budget_train) %>%
pull_workflow_fit() %>%
vi()
# A tibble: 1,000 x 3
# Variable Importance Sign
# <chr> <dbl> <chr>
# 1 tfidf_text_machine -6.82 NEG
# 2 tfidf_text_artificial -5.84 NEG
# 3 tfidf_text_learning -3.69 NEG
Is it calculating the importance relative to the "not" outcome rather than "artificial_intel"?