I have some time series data (x, y) which is consistent with a linear model. I would like to construct a regression formula in which one of the coefficients corresponds to the y-value at a specific moment in time (for example, Jan 1, 2013), rather than the y-intercept at x=0 (which would correspond to Jan 1, year 0 A.D., so far in the past as to be not of interest).
Using the nls()
, solver, I am able to construct the type of model that I want, and using simulated test data, I am able to get reasonable results out of it (meaning that the fit results approximately reproduce the "truth" values that I used to generate the artificial sample data). Using the lm()
solver, I am also able to construct a linear model, however it defaults to providing me with a fitted y-intercept at a time value that I don't want; i.e., Jan 1, 0 A.D. When I attempt to give the lm()
solver the same formula that works with the nls()
solver, it returns an invalid model formula
error:
# Force warnings to be printed as they are generated
options(warn=1)
# -------------------
# Define Linear Model
# -------------------
# Express the same linear model three ways:
# 1.) As a function (needed for constructing test data)
# 2.) As a formula appropriate for nls()
# 3.) As a formula appropriate for lm() [has 2013 offset removed]
# In first two versions, physical interpretations of model coefficients are:
# sv: starting value on Jan 1, 2013
# slope: annual rate of linear increase
linear_func <- function(year, sv, slope) {
sv + slope * (year-2013)
}
linear_form_offset <- (value ~ sv + slope * (year-2013))
linear_form_nooffset <- (value ~ year)
# -------------------
# Construct Test Data
# -------------------
sv_true <- 5000
slope_true <- 1500
year <- c(2013.5, 2014.5, 2015.5, 2016.5, 2017.5, 2018.5, 2019.5)
# Use truth values, and add some Gaussian noise
value <- linear_func(year, sv_true, slope_true) + rnorm(length(year), sd=100)
dftest <- data.frame(year, value)
# ------------------
# Obtain Fit Results
# ------------------
# nls solver requires approximate starting values, somewhere near the local
# vicinity of the final optimized values.
print("Now running nls (with offset)")
initcoef <- c(sv=3000, slope=1000)
fitresult <- nls(formula=linear_form_offset, data=dftest, start=initcoef)
print(coef(fitresult))
# lm solver, by contrast, has no concept of starting values, so omit them here
print("Now running lm (no offset)")
fitresult <- lm(formula=linear_form_nooffset, data=dftest)
print(coef(fitresult))
# lm solver using the offset formula that I would actually like to use --
# this results in an invalid model formula error.
print("Now running lm (with offset)")
fitresult <- lm(formula=linear_form_offset, data=dftest)
print(coef(fitresult))
When I run this example, I obtain the following as a typical result:
source("test_fit.R")
[1] "Now running nls (with offset)"
sv slope
5002.463 1518.854
[1] "Now running lm (no offset)"
(Intercept) year
-3052450.171 1518.854
[1] "Now running lm (with offset)"
Error in terms.formula(formula, data = data) :
invalid model formula in ExtractVars
>
Question 1 (simple): I am aware that, given a y-intercept at Jan 1, 0 A.D. and slope, I could easily calculate a corresponding y-value at Jan 1, 2013. However, for various reasons, I would prefer not to do that. I'd like to just construct the actual regression model that I really want, and solve that model natively using lm()
. Is there any way (e.g., a preferred alternative syntax) to do that?
Question 2 (deeper): What is actually going on here, anyway? My naive understanding of formulas is that they are a specific object type in R--a formula should stand on its own as either an intrinsically valid or invalid construction, without external reference to the solver algorithm that is attempting to regress against it. But here the formula's validity seems to depend entirely upon which solver is actually using it. Why is this? And if the rules for constructing valid formulas are different for lm()
vs. nls()
, where are those rules written down, so that I can avoid running afoul of them next time?