2

Consider this simple example

dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "Chinese Macao",
                              "Tokyo Japan Chinese"),
                     doc_id = 1:4,
                     class = c(1, 1, 1, 0))

dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)

> dtrain_spark
# Source:   table<dtrain> [?? x 3]
# Database: spark_connection
  text                     doc_id class
  <chr>                     <int> <dbl>
1 Chinese Beijing Chinese       1     1
2 Chinese Chinese Shanghai      2     1
3 Chinese Macao                 3     1
4 Tokyo Japan Chinese           4     0

I can train a decision_tree_classifier easily with the following pipeline

pipeline <- ml_pipeline(
  ft_tokenizer(sc, input.col = "text", output.col = "tokens"),
  ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'),
  ml_decision_tree_classifier(sc, label_col = "class", 
                 features_col = "myvocab", 
                 prediction_col = "pcol",
                 probability_col = "prcol", 
                 raw_prediction_col = "rpcol")
)

model <- ml_fit(pipeline, dtrain_spark)

Now the issue is that I cannot extract in a meaningful way the feature_importances.

Running

> ml_stage(model, 'decision_tree_classifier')$feature_importances
[1] 0 0 1 0 0 0

But what I want is tokens! In my real life example i have thousands of them and shown it is hard to understand anything.

Is there any way to back out the tokens from the matrix representation above?

Thanks!

Community
  • 1
  • 1
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • I'm trying to reproduce this. Everything you did up until the `ml_stage` step is working. When I run `ml_stage` as you did, I get a function value instead of the set of 0's and 1's like in your output (reads like "function (...) { tryCatch...etc."). The accepted solution seems to give me trouble too. Are you familiar with how to correct for this? – piper180 Mar 16 '22 at 21:54

1 Answers1

3

You can easily combine CountVectorizerModel vocabulary and feature_importances:

tibble(
  token = unlist(ml_stage(model, 'count_vectorizer')$vocabulary),
  importance = ml_stage(model, 'decision_tree_classifier')$feature_importances
)
# A tibble: 6 x 2
  token    importance
  <chr>         <dbl>
1 chinese           0
2 japan             1
3 shanghai          0
4 beijing           0
5 tokyo             0
6 macao             0
zero323
  • 322,348
  • 103
  • 959
  • 935
  • ha!!! thats great!!! thank you so much. on a side note, do you know what is the algo used by spark to compute the feature importance? – ℕʘʘḆḽḘ Jun 08 '18 at 14:40
  • 1
    https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3d9d50ab7fa/mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala#L185-L193 and https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3d9d50ab7fa/mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala#L126-L143 – zero323 Jun 08 '18 at 15:16