0

I am working on Titanic dataset. I am trying to fill in blanks in CABIN column. I have extracted first letter from CABIN column values, then put those in CABIN_NEW column. After that I use rpart for prediction, but somehow every time I run the code below, R takes a lot of time (haven't finished one time yet, every time, I have to terminate it).

DATAset has 1309 rows and the columns I am using are below in code. The system I am using is running on 4 GB Ram, i5 processor and Window 7.

combifit  <- rpart(Cabin_New ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title
                     + FamilySize + Surname + FamilyID,
                    data = combi[!is.na(combi$Cabin_New),], method = 'class')
lmo
  • 37,904
  • 9
  • 56
  • 69
Abhishek
  • 1,585
  • 2
  • 12
  • 15

1 Answers1

0

I see that you have used a lot of factor variables. Please check how many factor levels are present in each of the factor. If that is high, lets say for Surname, if it is 100, then R will have to create 100 variables and so on for all the other factors. So my guess is that because of these factor variables, rpart has to look into a lot of variables, to decide on split. Hence will take a lot of time.

Also read up a bit on rpart.control, as number of splits that rpart does depends on the parameters that are passed to the rpart.For example cp is one such parameter. Its default value is 0.01. Try changing its value to from 0.5 to 0.1. Play around similarly with other parameters, and you might be able to run rpart faster.

Kumar Manglam
  • 2,780
  • 1
  • 19
  • 28