Analysis using linear regression based on subgroups

Question

Assume I have data (t,y), where I expect a linear dependency y(t). Furthermore, there exist attributes to each observation par1, par2, par3. Is there an algorithm or technique to decide, if (one or both or all of the parameters) are relevant for the fit or not? I tried leaps::regsubsets(y ~ t + par1 + par2 + par3, data = mydata, nbest = 10) but was not able to get the formula for the best fit.

The final result should look like this if plotted. For data see below.

Thus, I want the information

Adding par1and par2 gives the best fit
The models are y_i = a_i * t_i + b_i with given a_i and b_i

Reproducible example:

t <- seq(0,10, length.out = 1000) # large sample of x values
# Create 3 linear equations of the form y_i = a*t_i + b
a <- c(1, 0.3, 0.2) # slope
b <- c(-0.5, 0.5, 0.1) # offset

# create t_i, y_ti and y_i (including noise)
d <- list()
y <- list()
y_t <- list()
for (i in 1:3) {
  set.seed(33*i)
  d[[i]] <- sort(sample(t, 50, replace = F))
  set.seed(33*i)
  noise <- rnorm(10)
  y[[i]] <- a[i]*d[[i]] + b[i] + noise
  y_t[[i]] <- a[i]*d[[i]] + b[i]
}
# Final data set
df1 <- data.frame(t=d[[1]], y=y[[1]], par1=rep(1), par2=rep(10), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
df2 <- data.frame(t=d[[2]], y=y[[2]], par1=rep(2), par2=rep(20), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
df3 <- data.frame(t=d[[3]], y=y[[3]], par1=rep(2), par2=rep(30), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
mydata <- rbind(df1, df2, df3)
mydata <- mydata[sample(nrow(mydata)), ]

# That is what the data is looking like:
plot(mydata$t, mydata$y)

# This is the result I am looking for (ideally):
plot(d[[1]], y[[1]], col = "black", xlim = c(0, 10), ylim = c(-2, 10), xlab = "t", ylab = "y",
     main = "Fit for three different groups")
points(d[[2]], y[[2]], col = "red")
points(d[[3]], y[[3]], col = "blue")
lines(d[[1]], y_t[[1]],col = "black")
lines(d[[2]], y_t[[2]], col = "red")
lines(d[[3]], y_t[[3]], col = "blue")

Comment and question on @Roland's answer:

I understand that that with the given three parameters there are 2^3=8 groups with 2*3*3=18 factor levels. But I would expect that we only have 8 relevant groups as I always have the choice between "include parameter x or not". To me it does not make sense to only "include level x of parameter y".

I tried the following

g <- 0
t_lin1 <- mydata$t[mydata$g == g]
y_lin1 <- mydata$y[mydata$g == g]
plot(mydata$t, mydata$y)
points(t_lin1, y_lin1, col = "red")
abline(lm(y_lin1 ~ t_lin1), col = "red")
points(pred.1se ~ t, data = mydata, col = as.integer(mydata$g), pch = 16)

and realized that the fit is off. Looking back this is clear because

I include the wrong factor levels (most likely parameter 3 is not relevant)
and thus get the wrong data for the fit

So my last question is:

Where can I find the relevant groups included in the best model and
what are the corresponding fit parameters from the regression?

Sorry, if this was obvious but to me it is mystery

Please provide a reproducible example without `eval(parse())`. Either write proper R code or use the output of `dput` to share the result. As a principle, I don't run `eval(parse())` code from the internet on my machine. — Roland, Nov 23 '16 at 13:42
@Roland Is there a reason why you don't run `eval(parse())` from the internet? I know it is slow (and bad style) but in the above case the script runs in less than a second. If you have an elegant suggestion how to get the testdata I would be glad. Just make an edit to the question. — Christoph, Nov 23 '16 at 13:50
It's a security risk. I'm not inclined to study such code in detail in order to be sure that it can safely been run. The usual advice applies here, instead of creating objects `a1`, `a2`, ... use an object (vector, list, matrix) that you can subset to `a[1]`, `a[2]`, ... — Roland, Nov 23 '16 at 14:40

Roland · Accepted Answer · 2016-11-23T15:06:34.000

3

The LASSO can come pretty close (although it identifies still too many effects):

#I assume these are supposed to be factors:
mydata$par1 <- factor(mydata$par1)
mydata$par2 <- factor(mydata$par2)
mydata$par3 <- factor(mydata$par3)

#create model matrix, remove intercept since glmnet adds it
x <- model.matrix(y ~ (par1 * par2 * par3) * t, data = mydata)[,-1]

#cross-validated LASSO
library(glmnet)
set.seed(42)
fit <- cv.glmnet(x, mydata$y, intercept = TRUE, nfolds = 10, alpha = 1)
plot(fit)

coef <- as.matrix(coef(fit, s = "lambda.1se"))
coef[coef != 0,]
#(Intercept)      par230           t     par12:t    par230:t   par3300:t 
# 0.47542479 -0.27612966  0.75497711 -0.42493030 -0.15044371  0.03033057

#The groups:
mydata$g <- factor((mydata$par2 == 30) + 10 * (mydata$par1 == 2) + 100 * (mydata$par3 == 300))



mydata$pred.1se <- predict(fit, newx = x, s = "lambda.1se")

library(ggplot2)
ggplot(mydata, aes(x = t, color = g)) +
  geom_point(aes(y = y)) +
  geom_line(aes(y = pred.1se))

You can then calculate the desired intercepts and slopes from the coefficients.

edited Nov 23 '16 at 15:06

answered Nov 23 '16 at 14:45

Roland

127,288
10
191
288

`predict` is the function `glmnet::predict.cv.glmnet`? If I understand correctly, predict gives the fit (in some way I don't understand) for all possible models? Is it obvious why one model is missing (3 pars yield 7 models)? Sorry for so many questions... – Christoph Nov 23 '16 at 16:04
Yes it is. Note how I set `s = "lambda.1se"`, i.e., ask for predictions for the "largest value of lambda such that error is within 1 standard error of the minimum" using the coefficients shown above. – Roland Nov 23 '16 at 16:09
I don't understand "Is it obvious why one model is missing (3 pars yield 7 models)?" – Roland Nov 23 '16 at 16:09
There are the following possibilities (Let's call the parameters 1, 2 and 3): combination of 1, 2, 3, 12, 13, 23, 123 relevant => Seven possibilities (I excluded "Non of the parameters significant" as it is impossible in my case). In general there are 2^(number of parameters) possibilities. – Christoph Nov 24 '16 at 08:21
1

But not all of these exist in your data. (There are 2*3*3=18 possible combinations of the factor levels of your parameters.) – Roland Nov 24 '16 at 08:39
Please see the new section in the end of my question. I hope that makes it clear. And thanks for the support! – Christoph Nov 24 '16 at 09:08
I think there is a lack of knowledge here. This is really not the format to teach you. First, you should be aware that you probably won't be able to recreate your expected result exactly since there is uncertainty in your data. Then I never recommended fitting an `lm` fit to the groups. Use the result of the LASSO, i.e., `abline(coef["(Intercept)",], coef["t",], col = "red")` – Roland Nov 24 '16 at 09:23

Analysis using linear regression based on subgroups

1 Answers1

Linked