6

I'm using the rpart package for decision tree classification. I have a data frame with around 4000 features (columns). I want to use all features in rpart() for my model. How can I do that? Basically, rpart() will ask me to use the function in this way:

dt <- rpart(class ~ feature1 + feature2 + ....)

My features are words in documents so I have more than 4k features. Each feature is represented by a word. Is there any possibility to use all features without writing them?

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
user3430235
  • 419
  • 1
  • 4
  • 12

2 Answers2

7

I figured it out:

dt <- rpart(class ~ ., data)

"." represents all features.

user3430235
  • 419
  • 1
  • 4
  • 12
1

The caret library is really useful because you can easily apply different models and compare their performance. It can call rpart but uses a slightly different syntax to include all features.

library(caret)

library(data.table)

mt <- data.table(mtcars)

tr <- train(x=mt[,-'hp', with=FALSE], y = mt[, hp], method='rpart')

plot(tr$finalModel)
text(tr$finalModel)

Using all 4000 features for a decision tree could result in overfitting, especially if your number of observations is not huge. Caret provides built-in cross-validation. You might also want to look at model='rf' for random forests.

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134