R: A very big cross-validation error

Question

I have 303 data points in the train set (see the picture). Many of these points are equal to 0 on the Y axis. enter image description here

Now I want to train the GBM model to predict the Y value. Here is my model:

train.subset<- data.frame(yval=train$yval,
                               hour=train$hour,
                               daymoment=train$daymoment,
                               year=train$year,
                               log.windspeed=log(train$windspeed+1),
                               weather=train$weather,
                               workingday=train$workingday,
                               log.temp=log(train$temp+1),
                               log.atemp=log(train$atemp+1),
                               log.humidity=log(train$humidity+1))

inTrain <- caret::createDataPartition(train.subset$registered, 
                                      p = .85, list = FALSE)
train.registered <- train.subset[inTrain, ]

cv.registered <- train.subset[-inTrain, ]

fitControl <- trainControl(## 5-fold CV
                method = "repeatedcv",
                number = 10,
                ## repeated ten times
                repeats = 10)

gbmGrid <-  expand.grid(interaction.depth = c(1, 5, 9),
                        n.trees = (5:25)*50,
                        shrinkage = 0.1)

fit.registered <- train(registered ~., data=train.registered, method = "gbm",trControl = fitControl,verbose = FALSE,tuneGrid = gbmGrid)

prediction.registered<-predict(fit.registered, newdata = cv.registered)
prediction.registered[prediction.registered<0] <- min(prediction.registered[prediction.registered > 0])

RMSE <- sqrt(mean((prediction.registered - cv.registered$registered)^2))
RMSE

Then I get quite high value of RMSE: ~28.

Here is the plot that shows both predicted and actual yval for the cross-validation set.

enter image description here

I don't understand why there is such a big error for this relatively simple curve. Any idea? Maybe I should try another package using the tuning parameters found by caret?

Just in case if this info is helpful:

> summary(fit.registered)

                        var   rel.inf
hour                   hour 23.385420
log.atemp         log.atemp 12.959972
daymoment.C     daymoment.C 11.605700
log.humidity   log.humidity 10.972162
log.windspeed log.windspeed  9.627754
daymoment.L     daymoment.L  7.517074
daymoment^4     daymoment^4  4.658695
log.temp           log.temp  4.567798
workingday       workingday  4.135300
daymoment.Q     daymoment.Q  3.766462
year                   year  3.763452
weather             weather  3.040211

UPDATE:

Train set

Test set

I wouldn't jump to using another algorithm quite yet (I mean you can try it, sure, but why not try some diagnostics first?). Can you provide the data so I can reproduce it? If not, can you see what happens when you try different specifications of the depedent variable, such as cutting it into quintiles and making it a factor then generating a confusion matrix? It's easier to diagnose it as a classifier than in regression when you're using a machine learning package. Alternately, since you said that many y-values are zero, perhaps try a Zero-inflated Poisson Regression. — Hack-R, Mar 23 '15 at 14:47
This is really not a coding question in its current form. I was actually surprised that the curve was as close to the points as it was. That's not a lot of data points in support of building a complex model. — IRTFM, Mar 23 '15 at 16:27
@Hack-R: Thanks for sharing your ideas. I tried "zeroinfl" function, but it fails with the error message "System is computationally singular". I also performed some exploratory analysis, which showed dependencies between dependent and independent variables. I uploaded both train and test sets (see updated thread). It would be great if you can explain in details why all my models fail. — Klausos Klausos, Mar 24 '15 at 19:40
Sure, you're welcome. I'll try to take a stab at it later this evening. Colinearity (dependencies) between indep. and dependents vars is good, but colinearity between the depedent variables is bad, which is what leads to the singularity. So you'll probably need to figure out which of the dependent variables have multiple collinearity then get rid of the redundant ones. — Hack-R, Mar 24 '15 at 19:58

R: A very big cross-validation error

0 Answers0