Fitting linear model on log transformed data where n% of the data is below the line

Question

I want to fit a model to a data that is assumed to be related in the form y = alpha*x^beta. My data looks like this:

And can be reproduced with this dput:

structure(list(y = c(15.8999997973442, 34.4999990463257, 60.0000017285347, 
234.099998548627, 15.3000003099442, 89.8999990224838, 30, 28.9999990463257, 
370.600006774068, 80.2999995946884, 91.3000009059906, 39.9000015258789, 
71.0999984741211, 6.20000004768372, 234.099998548627, 8.99999995529652, 
38.0000007152557, 17.5000001490116, 29.400000333786, 125.399999916553, 
4.80000007152557, 0.899999976158142, 40.0999994277954, 2.5, 45.8000001907349, 
0.899999976158142, 133.599999904633, 6.09999990463257, 70.7999984622002, 
17.5, 38.2999992370605, 33.4000001698732, 44.3000001907349, 45.8000001907349, 
0.800000011920929, 90.7999993562698, 29.5, 0.5, 130.800000190735, 
195.300004005432, 0.300000011920929, 27.8999991416931, 3.70000004768372, 
1, 4.79999995231628, 14.4999996423721, 46.599998831749, 3.3999999165535, 
7.40000009536743, 370.600006774068, 18.5, 37.6999998092651, 24.800000667572, 
34.9000000953674, 89.8999990224838, 92.7000005245209, 13.1999998092651, 
21.400000333786, 110.799999713898, 0.699999988079071, 44.3999996185303, 
20.8999996185303, 73.0000009536743, 86.5000005364418, 101.599999248981, 
32.3000005036592, 4.1000000834465, 167.699998855591, 65.4999992847443, 
15.0999998152256, 0.200000002980232, 30.0999995470047, 30.5, 
37.6999995708466, 92.7999982833862, 33.4000001698732, 83.5999986678362, 
24.7000007629395, 127.699999332428, 25, 27.8000001907349, 29.6999999582767, 
62.800000667572, 0.300000011920929, 37.9999990463257, 1, 9.10000009834766, 
33.8000000119209, 40.0999994277954, 15.5000000298023, 292.299997776747, 
15.9999995231628, 33.4000001698732, 0.899999976158142, 68.3000026345253, 
28, 30.3999996185303, 20, 30.3999996185303, 5), x = c(3L, 2L, 
6L, 22L, 4L, 6L, 2L, 2L, 13L, 7L, 5L, 1L, 2L, 3L, 22L, 3L, 2L, 
3L, 3L, 9L, 2L, 1L, 2L, 1L, 2L, 1L, 6L, 2L, 2L, 1L, 1L, 7L, 2L, 
2L, 1L, 11L, 1L, 1L, 5L, 4L, 1L, 3L, 1L, 1L, 2L, 2L, 3L, 2L, 
1L, 13L, 2L, 5L, 2L, 2L, 6L, 8L, 1L, 4L, 5L, 1L, 3L, 1L, 5L, 
8L, 3L, 7L, 2L, 7L, 3L, 2L, 1L, 5L, 1L, 4L, 5L, 7L, 3L, 1L, 5L, 
1L, 2L, 5L, 4L, 1L, 3L, 1L, 3L, 2L, 2L, 6L, 16L, 4L, 7L, 1L, 
6L, 2L, 2L, 1L, 2L, 1L)), row.names = c("494", "7", "476", "478", 
"462", "68", "357", "397", "105", "216", "53", "248", "366", 
"338", "478.1", "190", "119", "147", "371", "418", "231", "208", 
"19", "337", "408", "90", "44", "488", "435", "13", "249", "434", 
"419", "408.1", "209", "120", "47", "526", "82", "84", "3", "1", 
"485", "278", "15", "414", "467", "459", "137", "105.1", "425", 
"492", "532", "170", "68.1", "429", "347", "491", "29", "215", 
"151", "316", "352", "116", "465", "237", "376", "513", "472", 
"186", "453", "504", "157", "261", "403", "434.1", "469", "333", 
"83", "417", "301", "242", "46", "234", "487", "278.1", "134", 
"183", "19.1", "288", "98", "411", "434.2", "117", "375", "5", 
"356", "313", "356.1", "359"), class = "data.frame")

I know there are many (really good!!) answers on similar questions like:

https://stats.stackexchange.com/questions/61747/linear-vs-nonlinear-regression?rq=1

Fitting logarithmic curve in R, or

Exponential curve fitting in R

I however cannot get my head around it for some reason.

What I though about doing is the following. I want to fit a linear model in the log-transformed space of both variables. Because a linear model in the log-transformed space is like an exponential-model in the non-transformed space?! I know there are many assumptions about the distribution of the errors. Let's put them a litle bit side for the moment as this is really more about the understanding of the fitting mechanism. I also want to make sure, that only n-% of the data is below the fitted line. This seems like a perfect case for quantile regression. So I did the following:

plot(df$x, df$y)
# fit a linear quantile regression to the data
library(quantreg)
lm =rq(log(y) ~ log(x), data=df, tau = .05)
pr = predict(lm)
lines(exp(pr))

But what I get out is the following:

While I expected something like:

I am really sorry for these bad examples and the complete misunderstanding of basic topics. But maybe someone has an idea on what I'm not getting here.

Update

I mean something like this with the mammals data in R

# log transformed data
hist(log(df$body))
plot(log(brain) ~ log(body), mammals)
lm_log = lm(log(brain) ~ (log(body)), mammals) 
qr_log = rq(log(brain) ~ (log(body)), mammals, tau = .05) 
abline(lm_log)
abline(qr_log)

# using the linear model fitted on the log-transformed variables to predict and plot
# in the untransformed plot
new_data = data.frame(body = seq(min(df$body), max(df$body)), by=.5)
pr = predict(lm_log, newdata=new_data)
pr_qr = predict(qr_log, newdata=new_data)

plot(brain ~ body, mammals)
lines(exp(pr), col="green")
lines(exp(pr_qr), col="blue")

Which gives this plot

score 0 · Answer 1 · answered May 07 '21 at 10:13

0

if you just want the median line i would suggest the following:

ggplot(data = df, aes(x=x, y=y)) + geom_point() + geom_quantile(quantiles = 0.5)

answered May 07 '21 at 10:13

Elias

726
8
20

Fitting linear model on log transformed data where n% of the data is below the line

1 Answers1