Compute model efficiency in a cross validation leave one subject out mode in R

Question

I have a dataframe df

structure(list(x = c(49, 50, 51, 52, 53, 54, 55, 56, 1, 2, 3, 
    4, 5, 14, 15, 16, 17, 163, 164, 165, 153, 154, 72, 38, 39, 40, 
    23, 13, 14, 15, 5, 6, 74, 75, 76, 77, 78, 79, 80, 81, 82, 127, 
    128, 129, 130, 131, 132, 71, 72, 73, 74, 75, 76, 2, 3, 4, 5, 
    6, 99, 100, 101, 10, 11, 3, 30, 50, 51, 52, 53, 54, 56, 64, 66, 
    67, 68, 69, 34, 35, 37, 39, 2, 46, 47, 17, 18, 99, 100, 102, 
    103, 84, 85, 86, 87, 88, 67, 70, 72), y = c(2268.14043972082, 
    2147.62290922552, 2269.1387550775, 2247.31983098201, 1903.39138268307, 
    2174.78291538358, 2359.51909126411, 2488.39004804939, 212.851575751527, 
    461.398994384333, 567.150629704352, 781.775113821961, 918.303706148872, 
    1107.37695799186, 1160.80594193377, 1412.61328924168, 1689.48879626486, 
    685.154353165934, 574.088067465695, 650.30821636616, 494.185166497016, 
    436.312162090908, 641.231373456365, 494.374217984441, 201.745910386788, 
    486.030122926459, 483.045965372879, 265.693628564011, 285.116345260642, 
    291.023782284086, 229.606221692753, 230.952761338012, 1089.06303295676, 
    1255.88808925333, 1087.75402177912, 1068.248897182, 1212.17254891642, 
    884.222588171535, 938.887718005513, 863.582247020735, 1065.91969416523, 
    790.570635510725, 834.500908313203, 710.755041345197, 814.002362551197, 
    726.814950022846, 828.559687148314, 611.564698476112, 603.238720579422, 
    524.322001078981, 565.296378873638, 532.431853589369, 597.174114277044, 
    260.737164468854, 306.72700499362, 283.410379620422, 366.813913489692, 
    387.570173754128, 606.075737104722, 686.408686154056, 705.914347674276, 
    388.602676983443, 477.858510450125, 128.198042456082, 535.519377609133, 
    1893.38468471169, 1819.83262739703, 1827.31409981102, 1640.5816780664, 
    1689.0365549922, 2112.67917439342, 1028.8780498564, 1098.54431357711, 
    1265.26965941035, 1129.58344809909, 820.922447928053, 749.343583476846, 
    779.678206156474, 646.575242339517, 733.953282899613, 461.156280127354, 
    1184.81825619942, 1281.2920902365, 906.813018662913, 798.186995701282, 
    831.365377249207, 764.519073183124, 672.076289062505, 669.879217186302, 
    1265.48484068702, 1193.29000986667, 1156.81486114406, 1199.7373066445, 
    1116.24029749935, 1341.47673353751, 1401.44881976186, 1640.27575962036
    ), ID = 1:97), .Names = c("x", "y", "ID"), row.names = c(NA, 
    -97L), class = "data.frame")

I now want to compute the model efficiency of one non-linear model based on a cross validation leave one ID out. I implemented this line of code.

library(stats)
library (hydroGOF)

id <- unique(df$ID)
for (i in id){
  fit1 <- try(nls(y~A*x^3+B*x^2+C*x+D, data = df[df$ID != i,], start = list(A=0.02, B=-0.6, C= 50, D=200)), silent=TRUE)
  Out <- if (inherits(fit1, "nls")) NSE(sim = predict(fit1, newdatadata=df[df$ID==i,]), obs = df$y, na.rm=T)
  }

However, I have this error message:

    Error in valindex.default(sim, obs) : 
  Invalid argument: 'length(sim) != length(obs)' !! (96!=97) !!

Can someone help me out with that?

@LyzandeR. Well I have tried that too but I still get the same error message. It seems that the `predict` function predicts on my train data and not on the test data. it is a bit weird. — SimonB, Oct 06 '15 at 13:18
@SimonB I provided an answer that shows what the problem is. — LyzandeR, Oct 06 '15 at 13:26

LyzandeR · Accepted Answer · 2015-10-06T13:31:12.003

You have a few small mistakes and a big logic mistake in the above code which I address below:

First of all the code should be like this:

library(stats)
library (hydroGOF)

id <- unique(df$ID)
for (i in id){
  fit1 <- try(nls(y~A*x^3+B*x^2+C*x+D, data = df[df$ID != i,], start = list(A=0.02, B=-0.6, C= 50, D=200)), silent=TRUE)
  Out <- if (inherits(fit1, "nls")) NSE(sim = predict(fit1, newdata=df[df$ID==i,]), obs = df$y[df$ID==i], na.rm=T)
}

In your question, in the NSE function you set the argument as newdatadata=df[df$ID==i,]) instead of newdata=df[df$ID==i,]) i.e. there is an additional data in there which causes trouble when running the function (you misspelled the argument :) ) . Also, the obs argument does not have the correct length as the whole y column will be used but you only need df$y[df$ID==i] which is of length one (in order to match the prediction which now is of length one too).

Now after the corrections the above code will run.

However, once you run it you will see that it produces a warning that says that you cannot use NSE when 'sum((obs - mean(obs))^2)=0' => it is not possible to compute 'NSE'. In your case, since you only have one obs the calculation 'sum((obs - mean(obs))^2)=0' will always be zero.

So, you cannot use this technique with NSE because it will fail by definition (since you try to calculate NSE on a single observation). You should probably collect all the leave-one-out predictions, store them in a variable and then use NSE on that variable against df$y. That will work.

What I mean is the following (done with leave-one-out cv):

Out <- c()
id <- unique(df$ID)
for (i in id){
  fit1 <- try(nls(y~A*x^3+B*x^2+C*x+D, data = df[df$ID != i,], start = list(A=0.02, B=-0.6, C= 50, D=200)), silent=TRUE)
  Out[i] <- if (inherits(fit1, "nls")) sim = predict(fit1, newdata=df[df$ID==i,])
}

Which will now work:

> NSE(Out, df$y)
[1] 0.3440862

Nice. Thanks. And what about trying computing two functions in the same loop and have an Out2 object. Let;s say I have this gamma function as well `fit1 <- try(nls(y~A*(x^B)*(exp(k*x)), data = dft[df$ID != i,], start = list(A = 1000, B = 0.170, k = -0.00295)), silent=TRUE)`. How can I implement it? — SimonB, Oct 06 '15 at 14:00
@SimonB Save that to `fit2` and then use an `out2` variable in exactly the same way as `out`. I can't think of anything better. — LyzandeR, Oct 06 '15 at 14:03
Alright. This should work indeed. I also want to apply this loop in a list of dataframe and not on only one i.e.`df`. I guess I have to make a function out of it, right? Any ideas how I could make it? — SimonB, Oct 06 '15 at 14:08
@SimonB Yeah I think converting it into a function will be the best choice. There are lots of questions on SO for getting some help on it and feel free to ask a new one if you find yourself in trouble. Thanks! — LyzandeR, Oct 06 '15 at 14:12
I have tried to play around with implementing the function without success. I have asked a new question. Here is the link if you would like to try it out. http://stackoverflow.com/questions/32972778/compute-a-function-from-a-loop-to-apply-it-in-a-list-of-dataframe — SimonB, Oct 06 '15 at 14:40

Compute model efficiency in a cross validation leave one subject out mode in R

1 Answers1