1

The problem:

I have a dataset inputAll.data. I want to use 80% of the data as model construction input and validate the model on the remaining 20% of data.

I have manually split the dataset into two smaller datasets input80.data and input20.data containing 80% and 20% of the data respectively.

Format of data in my datasets:

Name      xvalues     yvalues
Prog1     0.654219    59.70282
Prog2     0.149516    49.59548
Prog3     0.50577     50.53859
Prog4     0.77783     59.95499
Prog5     0.237923    49.61133
Prog6     0.756063    50.63021
Prog7     0.015625    53.77959

I am using 80% of the data to construct a non-linear regression model using nls.

df = data.frame(input80.data)
yval = df$yvalues
xval = df$xvalues
model1 = nls(formula = yval ~ exp(xval + beta * xval), start = list(beta = 0))
sm1 = summary(model1)
fit1 = fitted.values(model1)

I am taking the remaining 20% data to obtain predicted values. I saved a copy of this data which contains the actual y values in another file called input20Actual.data, but input20.data only contains the x values.

dfNew = data.frame(input20.data)
xpred = dfNew$xvalues
dfVerify = data.frame(input20Actual.data)
yverify = dfVerify$yvalues
xverify = dfVerify$xvalues

obtainedPred = predict(model1, data.frame(xvalues = c(xpred) ))

I am then using a custom function called RMSE to calculate the error between the prediction and the actual value.

RMSE <- function(fitted, actual){
  sqrt(mean((fitted - actual)^2))
}

The error calculation is done by taking each predicted value and comparing it to the actual value that I had stored in input20Actual.data. I am storing the output in a file.

sink("ErrorsOut.txt")
cat("\n\nRMSE:\n")
for (i in 1:13) {
    #There are 13 values to be predicted in input20.data
    corr = obtainedPred[[i]]
    act = yverify[[i]]
    err = RMSE(act, corr)
    cat(err)
    cat(" ")
}
cat("\n")
sink()

The problem is that I have split the input set manually. I would like to automate this, and do the same thing for different splits (different data each time) and obtain an average of the calculated errors.

What I tried:

I have read on StackOverflow about cross-validation in R. My understanding is that it iteratively takes some % of data for model creation and the remaining for testing. If I can use a cross validation function in nls, I don't have to split my input data into two files.

I have searched on SO a lot for a solution. Many answers about cross-validation were for lm. But I specifically require cross-validation for nls. I also read about the caret package, but I have tried to install it and but most of the time I end up getting package installation errors, like the one below:

Warning: dependency ‘plyr’ is not available
package ‘plyr’ is not available (for R version 3.0.2)

So I was hoping there was a direct way to perform cross-validation (in rkward) without installing more packages. Is there a function or API in R that I can use for iteratively creating models and testing them?

Please note that I am a complete newbie to R. Sorry if this is an obvious question.

Kajal
  • 581
  • 11
  • 24
  • you need to update your R. You're in version 3.0.2 and we're currently on 3.3.0. Once you do that, you'll be able to install the ``caret`` package – Cyrus Mohammadian May 31 '16 at 11:25
  • @CyrusMohammadian But I installed R using `sudo apt-get install r-base`. I just repeated it and it says `r-base is already the newest version`. Is the 3.3.0 version some sort of package? I am using R with rkward and I had followed the steps I had seen here: http://www.r-bloggers.com/download-and-install-r-in-ubuntu/ – Kajal May 31 '16 at 11:34
  • Also, if I absolutely _have_ to update, does that mean that there is no way to perform cross validation on this version itself, and no function/API in R that I can use directly? – Kajal May 31 '16 at 11:36
  • You need to update your R, of course there are ways to perform it without an update but they may require you to hunt down and fetch previous versions of packages, which isn't easy. Your version of R is out of date. See here https://cloud.r-project.org – Cyrus Mohammadian May 31 '16 at 11:42
  • R code posted in questions on SO should be reproducible. See [mcve]. – G. Grothendieck May 31 '16 at 14:23
  • I had already added all the code I had. I edited my question to add the dataset format as well. I hope it is reproducible now. – Kajal Jun 03 '16 at 09:14

1 Answers1

3

Using the builtin data frame BOD try the simple model shown in fo below. First use sample to get the indexes of the in-sample rows and run the model on those. predict.nls is then used to get the predicted values using the out-of-sample data with the in-sample model. From that the residual sum of squares (RSS) and other results can be calculated. Each time this is run sample will generate a possibly different set of indexes (provided set.seed is not rerun). This could be packaged in a function and run repeatedly. No packages are used.

set.seed(123) # for reproducibility

n <- nrow(BOD)
frac <- 0.8
ix <- sample(n, frac * n) # indexes of in sample rows

fo <- demand ~ a + Time * b
fm <- nls(fo, BOD, start = c(a = 0, b = 0), subset = ix) # in sample model

BOD.out <- BOD[-ix, ] # out of sample data
pred <- predict(fm, new = BOD.out)
act <- BOD.out$demand
RSS <- sum( (pred - act)^2 )
RSS
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks! this does generate different data after repeating it (without repeating `set.seed`). I will tweak this to use it with my dataset. Thanks for the help. – Kajal Jun 03 '16 at 09:09