3

I want to estimate two breakpoints of a function with the next data:

    df = data.frame (x = 1:180,
                y = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 2, 2, 4, 2, 2, 3, 2, 1, 2,0, 1, 0, 1, 4, 0, 1, 2, 3, 1, 1, 1, 0, 2, 0, 3,  2, 1, 1, 1, 1, 5, 4, 2, 1, 0, 2, 1, 1, 2, 0, 0, 2, 2, 1, 1, 1, 0, 0, 0, 0, 
                    2, 3, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0))
# plotting y ~ x 
plot(df)

enter image description here

I know that the function have two breakpoints such that:

y = y1 if x < b1;
y = y2 if b1 < x < b2;
y = y3 if b2 < x;

And I want to find b1 and b2 to fit a kind of rectangular function with the following form

enter image description here

Can anyone help me or point me in the right direction? Thanks!

jealcalat
  • 147
  • 9
  • If you could tell us where the data is coming from, then it is most likely there already is some domain specific R package for this problem. – zx8754 Sep 24 '18 at 13:53
  • 1
    @zx8754 It comes from behavioral experimental psychology in animals; y is response rate (responses/sec) as a function of time elapsed (x, in sec). I would like to use simple OLS. – jealcalat Sep 24 '18 at 14:41

2 Answers2

3

1) kmeans Try kmeans like this:

set.seed(123)
km <- kmeans(df, 3, nstart = 25)

> fitted(km, "classes") # or equivalently km$cluster
  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[112] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

> unique(fitted(km, "centers")) # or except for order km$centers
      x         y
3  30.5 0.5166667
1  90.5 0.9000000
2 150.5 0.0000000

> # groups are x = 1-60, 61-120 and 121-180
> simplify2array(tapply(df$x, km$cluster, range))
       1   2  3
[1,]  61 121  1
[2,] 120 180 60

plot(df, col = km$cluster)
lines(fitted(km)[, "y"] ~ x, df)

screenshot

2) brute force Another approach is a brute force approach in which we calculate every possible pair of breakpoints and choose the pair whose sum of squares in a linear model is least.

grid <- subset(expand.grid(b1 = 1:180, b2 = 1:80), b1 < b2)

# the groups are [1, b1], (b1, b2], (b2, Inf)
fit <- function(b1, b2, x, y) {
   grp <- factor((x > b1) + (x > b2))
   lm(y ~ grp)
}

dv <- function(...) deviance(fit(...))

wx <- which.min(mapply(dv, grid$b1, grid$b2, MoreArgs = df))

grid[wx, ]
##       b1 b2
## 14264 44 80

plot(df)
lines(fitted(fit(grid$b1[wx], grid$b2[wx], x, y)) ~ x, df)

screenshot

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Interesting approach. Two qs: 1) How can I know how well the means are fitting the dataset? 2) Is this method sensitive to both the number of points with the same value as well as the values of the points? – jealcalat Sep 24 '18 at 17:16
  • (1) you can check the between sum of squares as a percentage of the total sum of squares. Just type `km` to see this. (2) it tries to minimize the sum of the squares of the residuals where the residual of a point is the distance to the center of its cluster. – G. Grothendieck Sep 24 '18 at 17:30
  • Have added a second (non-equivalent) approach. – G. Grothendieck Sep 24 '18 at 17:47
  • the second one seems more natural to me. Thanks! – jealcalat Sep 24 '18 at 22:58
  • About 2) Is there a reason why the range of b1 is wider than b2? – jealcalat Sep 27 '18 at 04:22
1

I can see that y is integer numbers, so maybe this is best estimated with a Poisson or Binomial model. Here is a solution using the R package mcp:

# Three intercept segments
model = list(
  y ~ 1,
  ~ 1,
  ~ 1
)

library(mcp)
fit = mcp(model, df, family = poisson(), par_x = "x", adapt = 2000)
plot(fit)

enter image description here

Notice that mcp is one of the only packages to estimate the uncertainty around the change point ant parameter estimates. The summary shows where the change point is estimated to be (cp_1 and cp_2) as well as the other parameters (on a log-scale since that's the default link function for Poisson models):

summary(fit)

Family: poisson(link = 'log')
Iterations: 9000 from 3 chains.
Segments:
  1: y ~ 1
  2: y ~ 1 ~ 1
  3: y ~ 1 ~ 1

Population-level parameters:
  name   mean lower  upper Rhat n.eff
  cp_1  39.57  37.8  45.00    1    54
  cp_2  99.82  99.0 101.21    1  2211
 int_1  -4.00  -6.5  -1.88    1   577
 int_2   0.32   0.1   0.54    1  6288
 int_3 -11.02 -20.9  -3.56    1  2487
Jonas Lindeløv
  • 5,442
  • 6
  • 31
  • 54