Difficulty fitting piecewise linear data in R

Question

I have the following data (cost of a product vs. time) that looks like the following:

annum <- c(1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 
    1914, 1915, 1916, 1917, 1918, 1919)
cost <- c(0.0000,  18.6140,  92.1278, 101.9393, 112.0808, 122.5521, 
    133.3532, 144.4843, 244.5052, 275.6068, 295.2592, 317.3145, 
    339.6527, 362.3537, 377.7775, 402.8443, 437.5539)

mydata <- as.data.frame(cbind(annum, cost))

g <- ggplot(mydata, aes(x = annum, y = cost))
g <- g + geom_point()
g <- g + scale_y_continuous(labels=scales::dollar_format())
g

This is the resulting plot of this data using this code The plot shows something that looks piecewise linear to me; there's a step from 1904 to 1905; then a clear line from 1905 to 1910; then a step; and then another line from 1911 to the end. (The first point (1903, 0) is fictitious.)

I've tried to use the segmented package to model this, but instead of picking something like 1904.5 and 1910.5 as breakpoints, it finds two points between 1911 and 1912.

I've tried some other techniques (e.g., "brute force" from "The R Book," and direct fitting), but I clearly don't understand this as much as I need to. Any help would be very much appreciated.

Ideally, I would end up with an equation for each segment and a single plot showing the piecewise fit and a confidence interval for the fit.

tpetzoldt · Accepted Answer · 2021-11-28T20:33:03.400

One can use package struccchange for this. Here a simplified code version:

library("strucchange")

startyear <- startyear
cost <- c(0.0000,  18.6140,  92.1278, 101.9393, 112.0808, 122.5521, 
          133.3532, 144.4843, 244.5052, 275.6068, 295.2592, 317.3145, 
          339.6527, 362.3537, 377.7775, 402.8443, 437.5539)

ts <- ts(cost, start=1903)
plot(ts)

## for small data sets you might consider to reduce segment length
bp <- breakpoints(ts ~ time(ts), data=ts, h = 5)

## BIC selection of breakpoints
plot(bp)
breakdates(bp)
fm1 <- lm(ts ~ time(ts) * breakfactor(bp), data=ts)
coef(fm1)

plot(ts, type="p")
lines(ts(fitted(fm1),  start = startyear),  col = 4)
lines(bp)
confint(bp)

lines(confint(bp))

More can be found in the package vignette or one of the related publications, e.g. https://doi.org/10.18637/jss.v007.i02 So it is for example possible to make significance tests, to estimate confidence intervals or to include covariates.

A segment length of 2 is not possible, because residual variance cannot be estimated. Similarly confidence intervals can only be estiated if segments are long enough. Therefore, only one breakpoint is shown below, while the excellent answer of @Rui Barradas omits confidence intervals but shows two breakpoints.

one breakpoint

Her an example without the first two points and an additional assumption to estimate the confidence interval in case of a small segment:

library("strucchange")

startyear <- 1905
cost <- c(92.1278, 101.9393, 112.0808, 122.5521, 
          133.3532, 144.4843, 244.5052, 275.6068, 295.2592, 317.3145, 
          339.6527, 362.3537, 377.7775, 402.8443, 437.5539)

ts <- ts(cost, start=startyear)
bp <- breakpoints(ts ~ time(ts), data=ts, h = 5)
fm1 <- lm(ts ~ time(ts) * breakfactor(bp), data=ts)
plot(ts, type="p")
lines(ts(fitted(fm1),  start = startyear),  col = 4)
lines(confint(bp, het.err=FALSE))

first two values omitted

Edit:

bugs of the original version corrected
coefficients and confidence interval added
images added
example with omitted first 2 values added

When I delete the first two points (the first of which is fictitious), the fit fails for the early segment (the slope is incorrect). Any thoughts on this? — Karl Wolfschtagg, Nov 28 '21 at 17:01
Thanks for the comment. The original version contained 2 bugs. The most important was to use `*` in the `lm` model formula, the other one a mixture between the time series object and the original vectors. — tpetzoldt, Nov 28 '21 at 20:02

Rui Barradas · Answer 2 · 2021-11-28T18:29:28.833

Here is another solution with package strucchange but without creating a time series first.

library(strucchange)

# first get a segment size as a fraction 
# of the number of observations
n <- nrow(mydata)
segmts <- 3
h <- (segmts + 1)/n

# now estimate the breakpoints
b <- breakpoints(cost ~ annum, h = h, breaks = (segmts - 1L), data = mydata)
bp <- mydata[b$breakpoints, "annum"]

# create a grouping variable for `ggplot`
# each group is a segment
bp <- c(bp, Inf)
mydata$grp <- findInterval(mydata$annum, bp, left.open = TRUE)

# plot the linear regressions
g + geom_smooth(
  mapping = aes(group = grp),
  method = "lm",
  formula = y ~ x,
  se = FALSE
)

If the first data points are removed, there will be only two segments but the code above will still work.

mydata <- mydata[-(1:2), ]
n <- nrow(mydata)
segmts <- 2
h <- (segmts + 1)/n
b <- breakpoints(cost ~ annum, h = h, breaks = segmts - 1L, data = mydata)
bp <- mydata[b$breakpoints, "annum"]
bp <- c(bp, Inf)
mydata$grp <- findInterval(mydata$annum, bp, left.open = TRUE)
mydata$grp <- factor(mydata$grp)

g + geom_smooth(
  mapping = aes(group = grp),
  method = "lm",
  formula = y ~ x,
  se = FALSE
)

If I cut off the first two points (the first of which is fictitious anyway), this leaves just two segments. When I run this method, setting `segmnts <- 2`, I get the error: `Error in breakpoints.formula(Total ~ Year, h = h, breaks = (segmts - 1L), : minimum segment size must be greater than the number of regressors.` Any thoughts? — Karl Wolfschtagg, Nov 28 '21 at 16:25
Try `h <- (segmts + 1)/n`. It worked without the first 2 points. I will edit my answer. — Rui Barradas, Nov 28 '21 at 18:26

score 1 · Answer 3 · answered Nov 29 '21 at 19:41

Confidence intervals for change point problems is a hard problem for frequentist methods, such as strucchange. Often, you simply get confidence intervals for each segment, i.e., hard breaks between segments rather than smooth transitions.

It's more straightforward using Bayesian methods. Here is a solution using the mcp package. Just to show off, we plot both the fitted interval and (dashed red lines) and the prediction interval (dashed green lines). The gray lines are random draws from the posterior distribution and the densities on the x-axis are the posteriors for the change point locations.

data = data.frame(
  annum = 1903:1919,
  cost = c(0.0000,  18.6140,  92.1278, 101.9393, 112.0808, 122.5521, 
          133.3532, 144.4843, 244.5052, 275.6068, 295.2592, 317.3145, 
          339.6527, 362.3537, 377.7775, 402.8443, 437.5539)
)

# Model as three disjoined slopes
model = list(
  cost ~ 1 + annum,
  ~ 1 + annum,
  ~ 1 + annum
)

library(mcp)
fit = mcp(model, data)
plot(fit, q_fit = TRUE, q_predict = TRUE)

If you're interested in the parameter estimates for the change points and the segments, just call summary(fit):

        name    mean  lower    upper Rhat n.eff
     annum_1   -0.11   -0.2 -6.6e-04  2.5    25
     annum_2   10.36    7.4  1.3e+01  1.0   609
     annum_3   22.74   21.2  2.4e+01  1.0   264
        cp_1 1904.50 1904.0  1.9e+03  2.5    24
        cp_2 1910.46 1910.0  1.9e+03  1.0   778
 Intercept_1  221.39   10.8  3.9e+02  1.0   948
 Intercept_2   86.77   75.0  9.8e+01  1.0  1297
 Intercept_3  236.03  221.7  2.5e+02  1.0   237
     sigma_1    5.97    3.6  8.9e+00  1.0  1709

This looks like an interesting approach, but for whatever reason, I cannot reproduce what you've done. I get a huge JAGS error that starts with: `Error : .onLoad failed in loadNamespace() for 'rjags', details: call: dyn.load(file, DLLpath = DLLpath, ...) error: unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rjags/libs/rjags.so': I went to the lindeloev website to try to fix it, but no joy. Any thoughts? — Karl Wolfschtagg, Nov 30 '21 at 03:29
Did you install JAGS? https://sourceforge.net/projects/mcmc-jags/files/JAGS/4.x/ Googling your error message, this has been the issue for others: https://gist.github.com/casallas/8411082. JAGS is the software used for MCMC sampling. — Jonas Lindeløv, Nov 30 '21 at 07:49
Excellent! I know this is off-topic, but how would you plot this using ggplot? — Karl Wolfschtagg, Nov 30 '21 at 17:13
The plot is a ggplot, so you can do `plot(fit) + labs(title = "This is the title")`. To do it from scratch, there's an example here: https://lindeloev.github.io/mcp/articles/predict.html#forecasting-with-future-change-points. Basically, just use `fitted(fit)` or `fitted(fit, summary = FALSE)` as data, and I'm sure you'll figure it out :-) — Jonas Lindeløv, Dec 01 '21 at 13:42

Difficulty fitting piecewise linear data in R

3 Answers3