What you want is the variable importance measures as produced by randomForest
. This is obtained from the importance
function. Here is some code that should get you started:
outcome <- c(1,0,0,1,1)
string <- c('I love pasta','hello world', '1+1 = 2','pasta madness', 'pizza madness')
Step 1: We want outcome
to be a factor so that randomForest
will do classification and string
as character vectors.
df <- data.frame(outcome=factor(outcome,levels=c(0,1)),string, stringsAsFactors=FALSE)
Step 2: Tokenize the string
column into words. Here, I'm using dplyr
and tidyr
just for convenience. The key is to have just word tokens that you want as your predictor variable.
library(dplyr)
library(tidyr)
inp <- df %>% mutate(string=strsplit(string,split=" ")) %>% unnest(string)
## outcome string
##1 1 I
##2 1 love
##3 1 pasta
##4 0 hello
##5 0 world
##6 0 1+1
##7 0 =
##8 0 2
##9 1 pasta
##10 1 madness
##11 1 pizza
##12 1 madness
Step 3: Construct a model matrix and feed it to randomForest
:
library(randomForest)
mm <- model.matrix(outcome~string,inp)
rf <- randomForest(mm, inp$outcome, importance=TRUE)
imp <- importance(rf)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
##(Intercept) 0.000000 0.000000 0.000000 0.0000000
##string1+1 0.000000 0.000000 0.000000 0.3802400
##string2 0.000000 0.000000 0.000000 0.4514319
##stringhello 0.000000 0.000000 0.000000 0.4152465
##stringI 0.000000 0.000000 0.000000 0.2947108
##stringlove 0.000000 0.000000 0.000000 0.2944955
##stringmadness 4.811252 5.449195 5.610477 0.5733814
##stringpasta 4.759957 5.281133 5.368852 0.6651675
##stringpizza 0.000000 0.000000 0.000000 0.3025495
##stringworld 0.000000 0.000000 0.000000 0.4183821
As you can see, pasta and madness are key words to predict the outcome
.
Please Note: There are many parameters to randomForest
that will be relevant for tackling the real-problem of scale. This is by no means a complete solution to your problem. It is only meant to illustrate the use of the importance
function in answering your question. You may want to ask appropriate questions on Cross Validated concerning the details of using randomForest
.