4

Consider the following dataframe

outcome <- c(1,0,0,1,1)
string <- c('I love pasta','hello world', '1+1 = 2','pasta madness', 'pizza madness')

df = df=data.frame(outcome,string)


> df
  outcome        string
1       1  I love pasta
2       0   hello world
3       0       1+1 = 2
4       1 pasta madness
5       1 pizza madness

Here I would like to use random forests to understand which words in the sentences contained in the string variable are strong predictors of the outcome variable.

Is there a (simple) way to do that in R?

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • 1
    You will probably need quite a lot of text preprocessing, e.g. for all pasta and pizza related texts you can create a binary column "is_food" and voila, 100% accuracy (of course this is an extremely simplified example). – m-dz Oct 21 '16 at 14:23
  • thanks! but i actually dont need pre processing. I am just interested in the forecasting power of single words, such as pizza, pasta etc – ℕʘʘḆḽḘ Oct 21 '16 at 14:29
  • 1
    Split by space, transform into columns then indicators, run rf? – Vlo Oct 21 '16 at 14:37
  • interesting. do you know how to code that up in the example above? – ℕʘʘḆḽḘ Oct 21 '16 at 14:38
  • 2
    @Noobie, I am still convinced you need some preprocessing like removing the stop words, converting other words to their roots (https://en.wikipedia.org/wiki/Root_(linguistics)) etc. – m-dz Oct 23 '16 at 22:36

1 Answers1

5

What you want is the variable importance measures as produced by randomForest. This is obtained from the importance function. Here is some code that should get you started:

outcome <- c(1,0,0,1,1)
string <- c('I love pasta','hello world', '1+1 = 2','pasta madness', 'pizza madness')

Step 1: We want outcome to be a factor so that randomForest will do classification and string as character vectors.

df <- data.frame(outcome=factor(outcome,levels=c(0,1)),string, stringsAsFactors=FALSE)

Step 2: Tokenize the string column into words. Here, I'm using dplyr and tidyr just for convenience. The key is to have just word tokens that you want as your predictor variable.

library(dplyr)
library(tidyr)
inp <- df %>% mutate(string=strsplit(string,split=" ")) %>% unnest(string)
##   outcome  string
##1        1       I
##2        1    love
##3        1   pasta
##4        0   hello
##5        0   world
##6        0     1+1
##7        0       =
##8        0       2
##9        1   pasta
##10       1 madness
##11       1   pizza
##12       1 madness

Step 3: Construct a model matrix and feed it to randomForest:

library(randomForest)
mm <- model.matrix(outcome~string,inp)
rf <- randomForest(mm, inp$outcome, importance=TRUE)
imp <- importance(rf)
##                     0        1 MeanDecreaseAccuracy MeanDecreaseGini
##(Intercept)   0.000000 0.000000             0.000000        0.0000000
##string1+1     0.000000 0.000000             0.000000        0.3802400
##string2       0.000000 0.000000             0.000000        0.4514319
##stringhello   0.000000 0.000000             0.000000        0.4152465
##stringI       0.000000 0.000000             0.000000        0.2947108
##stringlove    0.000000 0.000000             0.000000        0.2944955
##stringmadness 4.811252 5.449195             5.610477        0.5733814
##stringpasta   4.759957 5.281133             5.368852        0.6651675
##stringpizza   0.000000 0.000000             0.000000        0.3025495
##stringworld   0.000000 0.000000             0.000000        0.4183821

As you can see, pasta and madness are key words to predict the outcome.

Please Note: There are many parameters to randomForest that will be relevant for tackling the real-problem of scale. This is by no means a complete solution to your problem. It is only meant to illustrate the use of the importance function in answering your question. You may want to ask appropriate questions on Cross Validated concerning the details of using randomForest.

aichao
  • 7,375
  • 3
  • 16
  • 18
  • 1
    in your experience, does this algo works nice with very large and sparse matrices ( a lot of words, but many of them have 1 or 2 occurrences) – ℕʘʘḆḽḘ Oct 21 '16 at 14:51
  • 2
    @Noobie: what will happen there is that your input to random forest will have a lot of predictors since the model matrix, which is a contrast matrix of your words in `string`, will have a lot of columns. Random forest scales well to number of predictors, but as with anything parallelization is required for truly large problems. A better alternative may be `xgboost`, which also has an `importance` capability. – aichao Oct 21 '16 at 15:01
  • 1
    @Noobie: one other point. For really large and sparse model matrices, you may want/need to use `sparse.model.matrix` from the `Matrix` package. Unfortunately, I don't believe `randomForest` supports a sparse matrix input, so that is a limitation. However, `xgboost` does and you can use its `importance` function to do the same analysis. So, I would suggest you try `xgboost`. – aichao Oct 21 '16 at 15:29