how to extract the feature importances in Sparklyr?

Question

Consider this simple example

dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "Chinese Macao",
                              "Tokyo Japan Chinese"),
                     doc_id = 1:4,
                     class = c(1, 1, 1, 0))

dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)

> dtrain_spark
# Source:   table<dtrain> [?? x 3]
# Database: spark_connection
  text                     doc_id class
  <chr>                     <int> <dbl>
1 Chinese Beijing Chinese       1     1
2 Chinese Chinese Shanghai      2     1
3 Chinese Macao                 3     1
4 Tokyo Japan Chinese           4     0

I can train a decision_tree_classifier easily with the following pipeline

pipeline <- ml_pipeline(
  ft_tokenizer(sc, input.col = "text", output.col = "tokens"),
  ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'),
  ml_decision_tree_classifier(sc, label_col = "class", 
                 features_col = "myvocab", 
                 prediction_col = "pcol",
                 probability_col = "prcol", 
                 raw_prediction_col = "rpcol")
)

model <- ml_fit(pipeline, dtrain_spark)

Now the issue is that I cannot extract in a meaningful way the feature_importances.

Running

> ml_stage(model, 'decision_tree_classifier')$feature_importances
[1] 0 0 1 0 0 0

But what I want is tokens! In my real life example i have thousands of them and shown it is hard to understand anything.

Is there any way to back out the tokens from the matrix representation above?

Thanks!

I'm trying to reproduce this. Everything you did up until the `ml_stage` step is working. When I run `ml_stage` as you did, I get a function value instead of the set of 0's and 1's like in your output (reads like "function (...) { tryCatch...etc."). The accepted solution seems to give me trouble too. Are you familiar with how to correct for this? — piper180, Mar 16 '22 at 21:54

score 3 · Accepted Answer · answered Jun 08 '18 at 14:33

3

You can easily combine CountVectorizerModel vocabulary and feature_importances:

tibble(
  token = unlist(ml_stage(model, 'count_vectorizer')$vocabulary),
  importance = ml_stage(model, 'decision_tree_classifier')$feature_importances
)

# A tibble: 6 x 2
  token    importance
  <chr>         <dbl>
1 chinese           0
2 japan             1
3 shanghai          0
4 beijing           0
5 tokyo             0
6 macao             0

answered Jun 08 '18 at 14:33

zero323

322,348
103
959
935

ha!!! thats great!!! thank you so much. on a side note, do you know what is the algo used by spark to compute the feature importance? – ℕʘʘḆḽḘ Jun 08 '18 at 14:40
1

https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3d9d50ab7fa/mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala#L185-L193 and https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3d9d50ab7fa/mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala#L126-L143 – zero323 Jun 08 '18 at 15:16

how to extract the feature importances in Sparklyr?

1 Answers1

Linked