I'm trying to predict the type of a vehicle (model) based on the vehicle identification number (VIN). The first 10 positions of the VIN says something about the type, so I use them as variables. See an example of the data below:
positie_1_tm_3 positie_4 positie_5 positie_6 positie_7 positie_8 positie_9 positie_10 MODEL
MBL B 7 5 L 7 A 6 SKODA YETI
JNF A A E 1 1 U 2 NISSAN NOTE
VWZ Z Z 5 Z Z 9 4 VOLKSWAGEN FOX
F1D Z 0 V 0 6 4 2 RENAULT MEGANE
NAK U 8 1 1 C A 5 KIA SORENTO
F1B R 1 J 0 H 4 1 RENAULT CLIO
I used this R code for it:
#make stratisfied train and test set:
library(caret)
train.index <- createDataPartition(VIN1$MODEL, p = .6, list = FALSE)
train <- VIN1[ train.index,]
overige_data <- VIN1[-train.index,]
test.index<-createDataPartition(overige_data$MODEL, p = .5, list = FALSE)
test<-overige_data[test.index,]
testset2<-overige_data[-test.index,]
#make decision three :
library(rpart)
library(rpart.plot)
library(rattle)
library(RColorBrewer)
tree<- rpart(MODEL ~., train, method="class")
But the last one, making the tree, is running for more than 2 weeks already. The dataset is around 3 million rows, so the trainingset is around 1,8 million rows. Is it running so long because it’s too much data for rpart or is there another problem?