The problem:
I have a dataset inputAll.data
. I want to use 80% of the data as model construction input and validate the model on the remaining 20% of data.
I have manually split the dataset into two smaller datasets input80.data
and input20.data
containing 80% and 20% of the data respectively.
Format of data in my datasets:
Name xvalues yvalues
Prog1 0.654219 59.70282
Prog2 0.149516 49.59548
Prog3 0.50577 50.53859
Prog4 0.77783 59.95499
Prog5 0.237923 49.61133
Prog6 0.756063 50.63021
Prog7 0.015625 53.77959
I am using 80% of the data to construct a non-linear regression model using nls
.
df = data.frame(input80.data)
yval = df$yvalues
xval = df$xvalues
model1 = nls(formula = yval ~ exp(xval + beta * xval), start = list(beta = 0))
sm1 = summary(model1)
fit1 = fitted.values(model1)
I am taking the remaining 20% data to obtain predicted values. I saved a copy of this data which contains the actual y
values in another file called input20Actual.data
, but input20.data
only contains the x
values.
dfNew = data.frame(input20.data)
xpred = dfNew$xvalues
dfVerify = data.frame(input20Actual.data)
yverify = dfVerify$yvalues
xverify = dfVerify$xvalues
obtainedPred = predict(model1, data.frame(xvalues = c(xpred) ))
I am then using a custom function called RMSE
to calculate the error between the prediction and the actual value.
RMSE <- function(fitted, actual){
sqrt(mean((fitted - actual)^2))
}
The error calculation is done by taking each predicted value and comparing it to the actual value that I had stored in input20Actual.data
. I am storing the output in a file.
sink("ErrorsOut.txt")
cat("\n\nRMSE:\n")
for (i in 1:13) {
#There are 13 values to be predicted in input20.data
corr = obtainedPred[[i]]
act = yverify[[i]]
err = RMSE(act, corr)
cat(err)
cat(" ")
}
cat("\n")
sink()
The problem is that I have split the input set manually. I would like to automate this, and do the same thing for different splits (different data each time) and obtain an average of the calculated errors.
What I tried:
I have read on StackOverflow about cross-validation in R. My understanding is that it iteratively takes some % of data for model creation and the remaining for testing. If I can use a cross validation function in nls
, I don't have to split my input data into two files.
I have searched on SO a lot for a solution. Many answers about cross-validation were for lm
. But I specifically require cross-validation for nls
. I also read about the caret
package, but I have tried to install it and but most of the time I end up getting package installation errors, like the one below:
Warning: dependency ‘plyr’ is not available
package ‘plyr’ is not available (for R version 3.0.2)
So I was hoping there was a direct way to perform cross-validation (in rkward) without installing more packages. Is there a function or API in R that I can use for iteratively creating models and testing them?
Please note that I am a complete newbie to R. Sorry if this is an obvious question.