0

I have some time series data (x, y) which is consistent with a linear model. I would like to construct a regression formula in which one of the coefficients corresponds to the y-value at a specific moment in time (for example, Jan 1, 2013), rather than the y-intercept at x=0 (which would correspond to Jan 1, year 0 A.D., so far in the past as to be not of interest).

Using the nls(), solver, I am able to construct the type of model that I want, and using simulated test data, I am able to get reasonable results out of it (meaning that the fit results approximately reproduce the "truth" values that I used to generate the artificial sample data). Using the lm() solver, I am also able to construct a linear model, however it defaults to providing me with a fitted y-intercept at a time value that I don't want; i.e., Jan 1, 0 A.D. When I attempt to give the lm() solver the same formula that works with the nls() solver, it returns an invalid model formula error:

# Force warnings to be printed as they are generated
options(warn=1)

# -------------------
# Define Linear Model
# -------------------

# Express the same linear model three ways:
#   1.) As a function (needed for constructing test data)
#   2.) As a formula appropriate for nls()
#   3.) As a formula appropriate for lm() [has 2013 offset removed]
# In first two versions, physical interpretations of model coefficients are:
#   sv:    starting value on Jan 1, 2013
#   slope: annual rate of linear increase
linear_func <- function(year, sv, slope) {
  sv + slope * (year-2013)
}
linear_form_offset <- (value ~ sv + slope * (year-2013))
linear_form_nooffset <- (value ~ year)

# -------------------
# Construct Test Data
# -------------------

sv_true <- 5000
slope_true <- 1500
year <- c(2013.5, 2014.5, 2015.5, 2016.5, 2017.5, 2018.5, 2019.5)
# Use truth values, and add some Gaussian noise
value <- linear_func(year, sv_true, slope_true) + rnorm(length(year), sd=100)
dftest <- data.frame(year, value)

# ------------------
# Obtain Fit Results
# ------------------

# nls solver requires approximate starting values, somewhere near the local
# vicinity of the final optimized values.
print("Now running nls (with offset)")
initcoef <- c(sv=3000, slope=1000)
fitresult <-  nls(formula=linear_form_offset, data=dftest, start=initcoef)
print(coef(fitresult))

# lm solver, by contrast, has no concept of starting values, so omit them here
print("Now running lm (no offset)")
fitresult <- lm(formula=linear_form_nooffset, data=dftest)
print(coef(fitresult))

# lm solver using the offset formula that I would actually like to use --
# this results in an invalid model formula error.
print("Now running lm (with offset)")
fitresult <- lm(formula=linear_form_offset, data=dftest)
print(coef(fitresult))

When I run this example, I obtain the following as a typical result:

source("test_fit.R")
[1] "Now running nls (with offset)"
      sv    slope 
5002.463 1518.854 
[1] "Now running lm (no offset)"
 (Intercept)         year 
-3052450.171     1518.854 
[1] "Now running lm (with offset)"
Error in terms.formula(formula, data = data) : 
  invalid model formula in ExtractVars
>

Question 1 (simple): I am aware that, given a y-intercept at Jan 1, 0 A.D. and slope, I could easily calculate a corresponding y-value at Jan 1, 2013. However, for various reasons, I would prefer not to do that. I'd like to just construct the actual regression model that I really want, and solve that model natively using lm(). Is there any way (e.g., a preferred alternative syntax) to do that?

Question 2 (deeper): What is actually going on here, anyway? My naive understanding of formulas is that they are a specific object type in R--a formula should stand on its own as either an intrinsically valid or invalid construction, without external reference to the solver algorithm that is attempting to regress against it. But here the formula's validity seems to depend entirely upon which solver is actually using it. Why is this? And if the rules for constructing valid formulas are different for lm() vs. nls(), where are those rules written down, so that I can avoid running afoul of them next time?

stachyra
  • 4,423
  • 4
  • 20
  • 34
  • 2
    Can you cut this down to something minimal. There is way too much code relative to what is needed to express this. – G. Grothendieck Nov 27 '19 at 21:55
  • 1
    try `value ~ sv + slope * I(year-2013)` for your offset formula? The `-` operator has a different meaning in formula contexts *when used with linear machinery* (`lm`, `lme`, `gls`, etc.) (in particular on the right side of a formula), if you want to use the `^`, `+`, `-`, `/` operators with their usual arithmetic meaning you need to protect them with `I()` – Ben Bolker Nov 27 '19 at 22:11
  • @G.Grothendieck: O.K., I accept your suggestion, and I've slimmed down my example a bit. Do you have any idea how to do what I want, but without the error? – stachyra Nov 28 '19 at 04:25
  • @BenBolker: O.K., I tried enclosing `year-2013` in `I()`, as per your suggestion, and now the same line generates a different error: `Error in eval(predvars, data, env) : object 'sv' not found`. Do I need more `I()` enclosures, around other terms in the formula, or something different? – stachyra Nov 28 '19 at 04:29
  • @BenBolker: after a little additional investigation, I discovered that modifying the other formula (the one that I had labeled `linear_form_noffset`) so that it reads `value ~ I(year-2013)` results in a formula that can be interpreted by `lm()` with no errors, and also returns an "Intercept" value corresponding to the correct time point, Jan 1, 2013. This is much closer to what I had wanted. However, it would still be nice to understand how to construct a single unified version of the `linear_form_offset` formula which works identically inside either `nls()` or `lm()`. – stachyra Nov 28 '19 at 05:11
  • The problem is that the question is assuming that formulas have or ought to have a meaning that is independent of the function using them; however, that is not the case. Each function can define their meaning independently of any other function. `lm` and `nls` simply do not interpret the formula passed to them in the same way. – G. Grothendieck Nov 29 '19 at 15:17
  • @G.Grothendieck: can you cite a source that explains these differences in interpretation, and why they exist? Neither [`lm`](https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/lm) nor [`nls`](https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/nls) mentions any special idiosyncrasies in its approach to formula interpretation, and moreover the [`formula`](https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/formula) docs make it seem like a well-defined representation of an equation, which normally does not leave much room for interpretation. – stachyra Dec 01 '19 at 02:12
  • Both `?lm` and `?nls` do have references. – G. Grothendieck Dec 01 '19 at 12:23

0 Answers0