0

I need to fit x-y data with a model, which is non-analytic. I have a function f(x) that calculates the model for each x numerically, but there is no analytical equation. For the fit, I use optim in R. I minimise RMS between the model and the data. It works well and returns reasonable parameters.

I would like to find confidence intervals (or at least standard errors) on the best-fitting parameters. I found on internet that this can be done from the Hessian matrix, but only if maximising log-likelihood function. I don't know how to do this, all I have is x, y and f(x) from which I find RMS. Alas, I have no good way of estimating errors on y.

How can I find confidence intervals on my fit parameters?

Edit: perhaps an example in R might help explaining what I'm asking for. This example uses a simple analytic function to fit the data, in my real case the function is non-analytic, so I cannot use, e.g., nls.

set.seed(666)

# generate data
x <- seq(100) / 100
y <- 0.5 * x + rnorm(100, sd = 0.03) + 0.2

# function to fit
f <- function(x, a, b) {
  a * x + b
}

# error function to minimise: RMS
errfun <- function(par, x, y) {
  a <- par[1]
  b <- par[2]
  err <- sqrt(sum((f(x, a, b) - y)^2))
}

# use optim to fit the model to the data
par <- c(1, 0)
res <- optim(par, errfun, gr=NULL, x, y)

# best-fitting parameters
best_a <- res$par[1]
best_b <- res$par[2]

The best fitting parameters are a = 0.50 and b = 0.20. I need to find 95% confidence intervals on these.

  • If there is no (easy) analytical expression for the likelihood, you are probably better using a (nonparametric) bootstrap. Calculate f(x*) for many different x* which are sampled with replacement from x. – JDL Jan 08 '18 at 09:53
  • How do I find confidence intervals from the bootstrap? I wasn't very clear with my definitions, the function I fit to the data is f(x; a, b, c), where a, b and c are model parameters. – Marek Gierliński Jan 08 '18 at 10:56
  • I've turned it into a full answer. Hopefully that is more helpful (assuming you are using the standard notation of `y` being the outcome, `x` being the data, `f(x)` being some sort of estimator and `a,b,c` being independent model parameters (e.g. number of iterations) that don't depend on `x`). – JDL Jan 08 '18 at 16:54

1 Answers1

0

This is a job for the bootstrap:

(1) create a large number of synthetic datasets x*. These are created by sampling from x with replacement the same number of data as were in x. For example, if your data is (1,2,3,4,5,6) an x* might be (5,2,4,4,2,3) (note that values might appear multiple times, or not at all because we are sampling with replacement)

(2) For each x*, calculate f(x*). If there are other parameters which don't depend on the data, don't change them. (so f(x,a,b,c) becomes f(x*,a,b,c) as long as a,b,c don't depend on x. Call these quantities f*.

(3) You can estimate anything you want from these f*. If you want the standard deviation of f(x), take the standard deviation of f*. If you want the 95% confidence interval, take the range from the 2.5 to the 97.5 percentiles of f*. More formally, if you want to estimate g(f(x)) you estimate it as g(f(x*)).

I should say this is a very practically-oriented explanation of the bootstrap. I have glossed over many theoretical details, but the bootstrap is near-universally applicable (basically as long as the thing you are trying to estimate actually exists, you are usually okay).

To apply this to the example you have given in your code:

x <- seq(100) / 100
y <- 0.5 * x + rnorm(100, sd = 0.03) + 0.2

# function to fit
f <- function(x, a, b) {
  a * x + b
}

# error function to minimise: RMS
errfun <- function(par, x, y) {
  a <- par[1]
  b <- par[2]
  err <- sqrt(sum((f(x, a, b) - y)^2))
}

# this is the part where we bootstrap
# use optim to fit the model to the data
best_a <- best_b <- numeric(10000)
for(i in 1:10000){
  j <- sample(100,replace=TRUE)
  x.boot <- x[j]; y.boot <- y[j]
par <- c(1, 0)
res <- optim(par, errfun, gr=NULL, x.boot, y.boot)

# best-fitting parameters
best_a[i] <- res$par[1]
best_b[i] <- res$par[2]
}
# now, we look at the *vector* best_a
# for example, if you want the standard deviation of a,
sd(best_a)
# or a 95% confidence interval for b,
quantile(best_b,c(0.025,0.975))
JDL
  • 1,496
  • 10
  • 18
  • Thank you, JDL. This is a very clear explanation. However, I need confidence intervals on fit parameters, a, b and c. That is, apart from the best-fitting a, I need a_lo and a_up, which are, e.g., 95% CI for a. – Marek Gierliński Jan 09 '18 at 07:57
  • Then these are part of the results of `f(x)`. By putting them in the brackets it looks like they are user-specified. `f(x)` is allowed to have multiple elements in it. – JDL Jan 09 '18 at 13:00
  • Sorry, I still cannot see how bootstrapping x can help me with my problem. I extended the question to include a simple example. Perhaps you could explain how to apply bootstrap on this example. – Marek Gierliński Jan 10 '18 at 15:58
  • Ah! I see now! It is sampling with replacement from data (x, y), fitting each sample and finding parameters a* and b*. I have done quite a bit of bootstrapping in my life, but haven't seen this approach. But now I see that this makes perfect sense. Thanks a lot for your effort! – Marek Gierliński Jan 11 '18 at 07:46
  • Glad to be of help :) – JDL Jan 11 '18 at 08:35