0

I have a large dataset of almost 10000 rows and 10 columns. I want to do classification on this dataset using rpart package. But each columns has many (more than 50) classes. So the R just hangs.

What are my options to limit the scope of data or reduce the number of classes in each column?

Community
  • 1
  • 1
Sim101011
  • 305
  • 1
  • 13

1 Answers1

0

This is called stratified sampling where you want the proportion of the classes to remain same when reducing the dataset. Use createDataPartition from caret package.

table(iris$Species)
library(caret)
trainIndex <- createDataPartition(iris$Species, p = .8,list = FALSE,times = 1)
table(iris[trainIndex,]$Species)

Replace iris with your dataset name