0

I would like to extract feature_importances from my model in SparklyR. So far I have the following reproducible code that is working:

library(sparklyr)
library(dplyr)

sc <- spark_connect(method = "databricks")

dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "Chinese Macao",
                              "Tokyo Japan Chinese"),
                     doc_id = 1:4,
                     class = c(1, 1, 1, 0))

dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)

pipeline <- ml_pipeline(
  ft_tokenizer(sc, input_col = "text", output_col = "tokens"),
  ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'),
  ml_decision_tree_classifier(sc, label_col = "class", 
                 features_col = "myvocab", 
                 prediction_col = "pcol",
                 probability_col = "prcol", 
                 raw_prediction_col = "rpcol")
)

model <- ml_fit(pipeline, dtrain_spark)

When I try to run the ml_stage step below, I find that I cannot extract a vector of feature_importances, but rather it is a function. A prior post (how to extract the feature importances in Sparklyr?) displays it as a vector which I would like to obtain. What could be my error here? Is there another step I need to take to unwrap the function and get a vector of values here?

ml_stage(model, 3)$feature_importances

Here is what my output to the ml_stage looks like (instead of a vector of values):

function (...) 
{
    tryCatch(.f(...), error = function(e) {
        if (!quiet) 
            message("Error: ", e$message)
        otherwise
    }, interrupt = function(e) {
        stop("Terminated by user", call. = FALSE)
    })
}
<bytecode: 0x559a0d438278>
<environment: 0x559a0ce8e840>
piper180
  • 329
  • 2
  • 12
  • 1
    I'm not savvy on `sparklyr`, but have you tried `ml_feature_importances(model)`. As far as the output, from `sparklyr`, you'll get a data frame from the model and a vector (as you said you wanted) if you use an object created from `ml_prediction_model` in the function `ml_feature_importances()`. – Kat Mar 17 '22 at 02:32
  • Thanks @Kat I actually have tried using `ml_feature_importances(model)`, however I'm unable to in this case since the model is an `ml_pipeline_model` I get the following error: "no applicable method for 'ml_feature_importances' applied to an object of class "c('ml_pipeline_model', 'ml_transformer', 'ml_pipeline_stage')". I'm open to any solutions that would let me use `ml_feature_importanes()` in this case though, if you know of a way to implement it in this case. – piper180 Mar 17 '22 at 13:26

1 Answers1

2

I am not sure if this is what you want, but could combine the vectorizer model and vocaculary to extract the feature_importances of your model which will results in a table with the importances of your text. You could use the following code:

library(sparklyr)
library(dplyr)

sc <- spark_connect(method = "databricks")

dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "Chinese Macao",
                              "Tokyo Japan Chinese"),
                     doc_id = 1:4,
                     class = c(1, 1, 1, 0))

dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)

pipeline <- ml_pipeline(
  ft_tokenizer(sc, input_col = "text", output_col = "tokens"),
  ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'),
  ml_decision_tree_classifier(sc, label_col = "class", 
                              features_col = "myvocab", 
                              prediction_col = "pcol",
                              probability_col = "prcol", 
                              raw_prediction_col = "rpcol")
)

model <- ml_fit(pipeline, dtrain_spark)

tibble(
  token = unlist(ml_stage(model, 'count_vectorizer')$vocabulary),
  importance = ml_stage(model, 'decision_tree_classifier')$feature_importances
)
Quinten
  • 35,235
  • 5
  • 20
  • 53