nls function does not perform well

Question

I need to fit this formula

 y ~ 1/(pi*a*(1+((x-2.15646)/a)^2))+1/(pi*b*(1+((x-2.16355)/b)^2))

to the variables x and y

 x<- c(2.15011, 2.15035, 2.15060, 2.15084, 2.15109, 2.15133, 2.15157, 2.15182, 2.15206, 2.15231, 2.15255, 2.15280, 2.15304, 2.15329, 2.15353, 2.15377, 2.15402, 2.15426, 2.15451, 2.15475, 2.15500, 2.15524, 2.15549, 2.15573, 2.15597, 2.15622, 2.15646, 2.15671, 2.15695, 2.15720, 2.15744, 2.15769, 2.15793, 2.15817, 2.15842, 2.15866, 2.15891, 2.15915, 2.15940, 2.15964, 2.15989, 2.16013, 2.16037, 2.16062, 2.16086, 2.16111, 2.16135, 2.16160, 2.16184, 2.16209, 2.16233, 2.16257, 2.16282, 2.16306, 2.16331, 2.16355, 2.16380, 2.16404, 2.16429, 2.16453, 2.16477, 2.16502, 2.16526, 2.16551, 2.16575, 2.16600, 2.16624, 2.16649, 2.16673, 2.16697, 2.16722, 2.16746, 2.16771, 2.16795, 2.16820, 2.16844, 2.16869, 2.16893, 2.16917, 2.16942, 2.16966, 2.16991)
 y<- c(3.77212,  3.79541,  3.84574,  3.91918, 4.01056,  4.11677,  4.23851,  4.37986, 4.54638,  4.74367,  4.97765,  5.25593,  5.58823,  5.98405,  6.44850,  6.98006, 7.57280,  8.22085,  8.92094,  9.66990, 10.45900, 11.26720, 12.05540, 12.76920, 13.34830, 13.74250, 13.92420, 13.89250, 13.67090, 13.29980, 12.82780, 12.30370, 11.76950, 11.25890, 10.80020, 10.41860, 10.13840,  9.98005,  9.95758, 10.07690, 10.33680, 10.73210, 11.25730, 11.90670, 12.67240, 13.54110, 14.49530, 15.51670, 16.58660, 17.67900, 18.75190, 19.74600, 20.59680, 21.24910, 21.66800, 21.83910, 21.76560, 21.46020, 20.94020, 20.22730, 19.35360, 18.36460, 17.31730, 16.26920, 15.26920, 14.35320, 13.54360, 12.85230, 12.28520, 11.84690, 11.54040, 11.36610, 11.32130, 11.39980, 11.59230, 11.88310, 12.25040, 12.66660, 13.09810, 13.50220, 13.82580, 14.01250)

for estiming 'a' and 'b' values according to x and y. 'a' and 'b' are in the range between 0 and 1.

However, when I used the nls command:

nls(y ~1/(pi*a*(1+((x-2.15646)/a)^2))+1/(pi*b*(1+((x-2.16355)/b)^2)), control = list(maxiter = 500), start=list(a=0.4,b=0.4))

The console reported the following error:

singular gradient

Can anyone explain to me why does the console print this message?

Your starting values are not good and it might be better to use a different optimizer (assuming the model actually fits the data well). — Roland, Jan 29 '14 at 15:30
which ones do you suggest? I tried the 'nls2' package, but I got the same error. In theory, the model fits the data well. It's a piece of an NMR spectrum and all peaks has a lorentzian shape. As I've observed 2 overlapped peaks, the input formula I chose is the sum of 2 lorentzian curves. — user3032330, Jan 29 '14 at 15:36
Fitting a pair of peaks to data w/ 2 peaks often does not work. It's just not that straightforward. You know "In theory the model works. In practice, the theory doesn't" In your case, the paucity of data is going to make it quite hard to fit any model formula very well. — Carl Witthoft, Jan 29 '14 at 15:41
Do you mean that with more data points I should be able to get any approximation? Or may I have any other handicap that I overlooked? — user3032330, Jan 29 '14 at 16:03
How would you do the fitting if it was your problem (i.e. other programming language, other statistical approximation,...) — user3032330, Jan 29 '14 at 16:05
I wouldn't. I might run a spline smoothing algorithm (`spline` or `smooth.spline` and then use one of the many peak-finding tools and be done w/ it. You've posted variations on this question several times now; it's still not clear why you need to fit a curve rather than ID the peaks. — Carl Witthoft, Jan 29 '14 at 18:07

jlhoward · Accepted Answer · 2014-01-29T22:04:24.390

This gives a better fit:

Before getting into the code (below), there are several issues with your model:

Assuming this is proton NMR, the area under the peaks is proportional to the proton abundance (so, number of protons). Your model does not allow for this, essentially forcing all peaks to have the same area. This is the main reason for the poor fit. We can accommodate this easily by including a "height" factor for each peak.
Your model assumes the peak positions. Why not just let the algorithm find the true peak positions?
Your model does not account for baseline drift, which as you can see is quite severe in your dataset. We can accommodate this by adding a linear drift function to the model.

nls(...) is poor for this type of modeling - the algorithms it uses are not especially robust. The default algorithm, Gauss-Newton, is especially poor when fitting offsetted data. So estimating p1 and p2 in a model with f(x-p1,x-p2) nearly always fails.

A better approach is to use the exceptionally robust Levenberg-Marquardt algorithm implemented in nls.lm(...) in package minpack. This package is a bit harder to use but it is capable of dealing with problems inaccessible with nls(...). If you're going to do a lot of this, you should read the documentation to understand how this example works.

Finally, even with nls.lm(...) the starting points have to be reasonable. In your model a and b are the peak widths. Clearly they must be comparable to or smaller than the difference in peak positions or the peaks will get smeared together. Your estimates of (a,b) = (0.4, 0.4) were much too large.

plot(x,y)
library(minpack.lm)
lorentzian <- function(par,x){ 
  a  <- par[1]
  b  <- par[2]
  p1 <- par[3]
  p2 <- par[4]
  h1 <- par[5]
  h2 <- par[6]
  drift.a <- par[7]
  drift.b <- par[8]
  h1/(pi*a*(1+((x-p1)/a)^2))+h2/(pi*b*(1+((x-p2)/b)^2)) + drift.a + drift.b*x
}

resid   <- function(par,obs,xx) {obs-lorentzian(par,xx)}
par=c(a=0.001,b=0.001, p1=2.157, p2=2.163, h1=1, h2=1, drift.a=0, drift.b=0)
lower=c(a=0,b=0,p1=0,p2=0,h1=0, h2=0,drift.a=NA,drift.b=NA)
nls.out <- nls.lm(par=par, lower=lower, fn=resid, obs=y, xx=x, 
                  control = nls.lm.control(maxiter=500))
coef(nls.out)
#             a             b            p1            p2            h1            h2       drift.a       drift.b 
#  1.679632e-03  1.879690e-03  2.156308e+00  2.163500e+00  4.318793e-02  8.199394e-02 -9.273083e+02  4.323897e+02 

lines(x,lorentzian(coef(nls.out), x), col=2, lwd=2)

One last thing: the convention on SO is to wait a day before "accepting" an answer. The reason is that questions with accepted answers rarely get additional attention - once you accept an answer, no one else will look at it.

agree that this is a better answer (@G.Grothendieck's is correct, but this one is more complete) — Ben Bolker, Jan 29 '14 at 22:02
Wow! I am shocked about your implementation! You're a genius! I spent too much time in this that I couldn't figure out how wrong I was in my code. I just want to point out that (2) how I was unable of making my function worked I removed peak positions variables out of the formula, trying to be faster(?) but unefficient. Also, (3) the data is previously treated with and special NMR software, so it might not have a sharp drift drawn. In this special case, I pick the data from a quintuplet, as I didn't want to use too many variables in SO. — user3032330, Jan 30 '14 at 00:20

G. Grothendieck · Answer 2 · 2014-01-30T14:27:30.443

2

Try using the reciprocals of a and b:

fm<-nls(y~1/(pi*(1/a)*(1+((x-2.15646)/(1/a))^2))+1/(pi*(1/b)*(1+((x-2.16355)/(1/b))^2)), 
   lower = 1, alg = "port",
   control = list(maxiter = 500), 
   start = list(a = 1/.4, b = 1/.4))

which gives:

> 1/coef(fm)
         a          b 
1.00000000 0.02366843

Unfortunately the model does not work very well as the graph at the bottom shows.

plot(y ~ x, pch = 20)
lines(x, fitted(fm), col = "red")

ADDED:

In his answer, @jlhoward provided a better fitting model based on 8 parameters. I will just point out that if we used his model with his starting values for a, b, p1 and p2 (we don't need starting values for the linear parameters if we specify alg = "plinear") then nls would work too.

fo <- y ~ cbind(1/(pi*a*(1+((x-p1)/a)^2)), 1/(pi*b*(1+((x-p2)/b)^2)), 1, x)
start <- c(a = 0.001, b = 0.001, p1 = 2.157, p2 = 2.163)
fm2 <- nls(fo, start = start, alg = "plinear")

giving:

> coef(fm2)
            a             b            p1            p2         .lin1 
 1.679635e-03  1.879682e-03  2.156308e+00  2.163500e+00  4.318798e-02 
        .lin2         .lin3        .lin.x 
 8.199364e-02 -9.273104e+02  4.323907e+02

Graph showing poor fit for fm:

enter image description here

REVISED to add constraints.

edited Jan 30 '14 at 14:27

answered Jan 29 '14 at 17:04

G. Grothendieck

254,981
17
203
341

Indeed, 'a' shouldn't be negative, but I guess it's because of the last points of the data corresponds to a third peak that is not defined in the function. Do you know if I can define any constraint when the 'nls' is trying to find the values of my parameters? – user3032330 Jan 29 '14 at 17:15
Optimization routines don't work well with flat functions and taking the reciprocals helps with that here. I have also added the constraints now but we still get a poor visual fit. – G. Grothendieck Jan 29 '14 at 17:18
Does 'lower=1' mean that the max value of 'a' or 'b' is 1, doesn't it? Or does it mean that 'a' and 'b' can only be positive? Tomorrow I will try to replicate your experiment in the whole dataset (it comprises 5 overlapped peaks) and posted here. I guess the result will be better if both sides of the spectrum are near 0. – user3032330 Jan 29 '14 at 17:28
When the reciprocal is taken the constraint interval `(0,1)` maps to the interval `(1, Inf)`. – G. Grothendieck Jan 29 '14 at 17:34

nls function does not perform well

2 Answers2

Linked