0

I am using a lasso regression to classify some text as either related to AI or not. When I calculate variable importance using vip and tidymodels, the sign is opposite of expected -- words like "machine", "learning", and "algorithm" have a negative sign.

Apologies for the lack of reprex, but here is my code:

fy21_raw %>%
    sample_n(5)

# A tibble: 5 x 3
#  prog_title     text     artificial_intel
#  <chr>          <chr>    <fct>           
#1 Advanced Batt~ "ABMS l~ not             
#2 Energy Effici~ "This e~ not             
#3 Development o~ "This P~ artificial_intel
#4 Unmanned Logi~ "This U~ artificial_intel
#5 FY 2020 SBIR/~ "Fundin~ not 

# Note: the artificial_intel column is a factor with 2 levels: "artificial_intel" and "not"

set.seed(123)
budget_split <- initial_split(fy21_raw, strata = artificial_intel) 
budget_train <- training(budget_split)
budget_test  <- testing(budget_split)

set.seed(234)
budget_folds <- vfold_cv(budget_train, strata = artificial_intel, v = 5) 

budget_rec <- recipe(artificial_intel ~ ., data = budget_train) %>% # update dv with actual name
    update_role(prog_title, new_role = "id") %>%
    step_tokenize(text) %>%
    step_tokenfilter(text, max_tokens = 1000) %>%
    step_upsample(artificial_intel) %>% # update dv with actual name
    step_tfidf(text) %>%
    step_normalize(recipes::all_predictors())

budget_wf <- workflow() %>%
    add_recipe(budget_rec)

lasso_spec <- logistic_reg(penalty = 0.1, mixture = 1) %>%
    set_mode("classification") %>%
    set_engine("glmnet")

all_cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

set.seed(1234)
lasso_res <- budget_wf %>%
    add_model(lasso_spec) %>%
    fit_resamples(resamples = budget_folds,
                  metrics = metric_set(roc_auc, accuracy, sens, spec),
                  control = control_grid(save_pred = TRUE, pkgs = c('textrecipes')))

set.seed(123)
budget_imp <- budget_wf %>%
    add_model(lasso_spec) %>%
    fit(budget_train) %>%
    pull_workflow_fit() %>%
    vi()

# A tibble: 1,000 x 3
#   Variable              Importance Sign 
#   <chr>                      <dbl> <chr>
# 1 tfidf_text_machine        -6.82  NEG  
# 2 tfidf_text_artificial     -5.84  NEG  
# 3 tfidf_text_learning       -3.69  NEG

Is it calculating the importance relative to the "not" outcome rather than "artificial_intel"?

CGP
  • 129
  • 8
  • Without data to check it is hard to say for sure, but I expect that the levels of `artificial_intel` are the opposite of what you expect, in terms of which is the positive vs. negative event. You can [control this in tidymodels with the `event_level` argument](https://yardstick.tidymodels.org/dev/reference/roc_auc.html). – Julia Silge Sep 20 '20 at 20:07

1 Answers1

2

From the glmnet vignette:

Note that for "binomial" models, results are returned only for the class corresponding to the second level of the factor response.

So if you want the right coefficient sign, the positive level with glmnet must be the second. If you use glmnet with yardstick, keep in mind that yardstick uses the first factor-level as default. Therefore, you need to set yardstick.event_first = FALSE

Fran_civ
  • 31
  • 5