0

I have used 'predict' find a fit line for a linear model(lm) I have created. Because the lm was built on only 2 data points and needs to have a positive slope, I have forced it to go thru the origin (0,0). I have also weighted the function by the number of observations underlying each data point.

Question 1: (SOLVED -see comment by @Gregor) Why does the predicted line lie so much closer to my second data point (B) than my first data point (A), when B has fewer underlying observations? Did I code something wrong here when weighting the model?

Question 2: Plotting GLM (link=logit) now, but how can still I force this through 0,0? I've tried adding formula = y~0+x in several places, none of which seem to work.

M <- data.frame("rate" = c(0.4643,0.2143), "conc" = c(300,6000), "nr_dead" = c(13,3), "nr_surv" = c(15,11), "region" = c("A","B"))
M$tot_obsv <- (M$nr_dead+M$nr_surv)
M_conc <- M$conc
M_rate <- M$rate
M_tot_obsv <- M$tot_obsv
#**linear model of data, force 0,0 intercept, weighted by nr. of observations of each data point.**
M_lm <- lm(data = M, rate~0+conc, weights = tot_obsv)
#**plot line using "predict" function**
x_conc <-c(600, 6700)
y_rate <- predict(M_lm, list(conc = x_conc), weights = tot_obsv, type = 'response') 
plot(x = M$conc, y = M$rate, pch = 16, ylim = c(0, 0.5), xlim = c(0,7000), xlab = "conc", ylab = "death rate")
lines(x_conc, y_rate, col = "red", lwd = 2)

#**EDIT 1:**

M_glm <- glm(cbind(nr_dead, nr_surv) ~ (0+conc), data = M, family = "binomial")

#*plot using 'predict' function*
binomial_smooth <- function(formula = (y ~ 0+x),...) {
    geom_smooth(method = "glm", method.args = list(family = "binomial"), formula = (y ~ 0+x), ...)
}
tibble(x_conc = c(seq(300, 7000, 1), M$conc), y_rate = predict.glm(M_glm, list(conc = x_conc), type = "response")) %>% left_join(M, by = c('x_conc' = 'conc')) %>% 
ggplot(aes(x = x_conc, y = y_rate)) + xlab("concentration") + ylab("death rate") +
geom_point(aes(y = rate, size = tot_obsv)) + binomial_smooth(formula = (y ~ 0+x)) + theme_bw() 
Lumi
  • 13
  • 5
  • You artificially anchor your line through (0, 0). Your second point is 5 times farther from 0 than your first point, so even though it has half as many observations it has loads of leverage. – Gregor Thomas Apr 19 '19 at 22:00
  • Let's say you fit a line from the origin straight through the first point, (300, 0.46). Extrapolating out to 6000, the y value would be 2.32, with a residual of 2.32 - 0.2143 = 2.1. The squared residual (what `lm` is minimizing) is 4.45. On the other hand, if we fit the line straight from (0, 0) to the second point at (6000, 0.2143), the predicted value for the first point would be 0.04, for a residual of 0.46 - .04 = 0.42. The squared residual is about 0.18. – Gregor Thomas Apr 19 '19 at 22:07
  • The weights essentially multiply the squared residuals, since there's a 2:1 weight ratio, the 0.18 becomes 0.36, and the 4.45 stays as is, and absolutely dominates---it's 12 times greater! So of course the best fit line is going to get pulled a lot closer to the 6000 point. – Gregor Thomas Apr 19 '19 at 22:09
  • Judging from your column names, you're modeling rates. So instead let me suggest a `glm`. Use logistic regression (GLM with logit link) to model the proportions, or use Poisson regression (GLM with log link, using the number of observations as an offset). That way you won't be extrapolating straight lines way out of the range of possibility. – Gregor Thomas Apr 19 '19 at 22:12
  • Thanks for your help! I've tried the glm function (see edited code), but now I can't figure out how to force it through 0,0. I tried adding the `formula = y~0+x` in several places, and none seem to work. Any thoughts on what I'm missing? – Lumi Apr 20 '19 at 06:44

0 Answers0