0

I'm trying to predict the type of a vehicle (model) based on the vehicle identification number (VIN). The first 10 positions of the VIN says something about the type, so I use them as variables. See an example of the data below:

positie_1_tm_3 positie_4 positie_5 positie_6 positie_7 positie_8 positie_9 positie_10          MODEL
       MBL         B         7         5         L         7         A          6     SKODA YETI
       JNF         A         A         E         1         1         U          2    NISSAN NOTE
       VWZ         Z         Z         5         Z         Z         9          4 VOLKSWAGEN FOX
       F1D         Z         0         V         0         6         4          2 RENAULT MEGANE
       NAK         U         8         1         1         C         A          5    KIA SORENTO
       F1B         R         1         J         0         H         4          1   RENAULT CLIO

I used this R code for it:

#make stratisfied train and test set:
library(caret)
train.index <- createDataPartition(VIN1$MODEL, p = .6, list = FALSE)
train <- VIN1[ train.index,]
overige_data  <- VIN1[-train.index,]
test.index<-createDataPartition(overige_data$MODEL, p = .5, list = FALSE)
test<-overige_data[test.index,]
testset2<-overige_data[-test.index,]

#make decision three :
library(rpart)
library(rpart.plot)  
library(rattle)
library(RColorBrewer)
tree<- rpart(MODEL ~., train, method="class")

But the last one, making the tree, is running for more than 2 weeks already. The dataset is around 3 million rows, so the trainingset is around 1,8 million rows. Is it running so long because it’s too much data for rpart or is there another problem?

Donald
  • 145
  • 2
  • 4
  • 14
  • I suspect it's the size of the data set. I have seen 6 hour training runs on 50k x 20 dataframes. Can you train on a smaller set to benchmark, or split up the job to multiple machines? Are you running in parallel mode? – varontron Oct 03 '16 at 12:27
  • I have to train on all types of vehicles, so a smaller set is not an option and I cannot split up the job to multiple machines. I was not familiar with running in parallel mode. So I have read about it on internet. It’s hard for me to build my script to parallel mode, I’m not a pro in R. But I found something about the rxDTree Algorithm, that is doing better with big data and is doing something already parallel. Maybe I can try this algorithm. But i see that is a paid algorithm from Revolution. So if you have tips or example script to make my script running parallel I’m happy to hear. – Donald Oct 06 '16 at 06:48
  • To clarify, I was suggesting to train on a smaller set just to eval performance. You could possibly extrapolate the total runtime but training on 1000 rows. Also, I realize you've implied the answer to this question already (i.e, "no",) but is there any opportunity for dimensionality reduction? Clustering? Collinearity? – varontron Oct 06 '16 at 14:06

1 Answers1

0

No, something is obviously wrong. It may take long, but not 2 weeks.

The question - how many labels (classes there are)? Decision trees tend to be slow when the number of classes is large (by large I mean more than 50).

Tomasz R
  • 300
  • 1
  • 11