-2

R: df looks like this (x = a, y = b, group = c):

  a      b      c
-------------------
-2.1    1203    5
 1.4    1103    10
-2.1    1203    5
..       ..    ..

I created a scatterplot with around 10 linear regression lines (ggplot2/geom_smooth) from this df, for each group in c(e.g. 5, 10). These have all different slopes and y-intersects

Is there any way I can approximate the function for the family of curves for these linear regression lines in R and display it with custom parameters in a new plot(ggplot2)?


edit:

Here is the dput(df): (will remove it again later)

structure(list(X = 1:102, a = c("0", "0", "0", "0", "0", "219.399.914.550.781", 
"987", "0", "0", "0", "0", "1320", "0", "0", "0", "0", "0", "0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0", "144.595.013.427.734", 
"176.470.013.427.734", "167.919.995.117.188", "125.239.242.553.711", 
"247.582.397.460.938", "173.550.708.007.812", "149.010.693.359.375", 
"908", "638.5", "81.173.999.023.438", "1472", "120.632.000.732.422", 
"2028", "154.019.999.694.824", "785.5", "118.188.000.488.281", 
"149.930.010.986.328", "-712", "-2100", "1628", "925", "1161", 
"516", "64.840.002.441.406", "426.5", "106.810.998.535.156", 
"92.175.994.873.047", "648.5", "125.379.998.779.297", "1120", 
"821", "795", "129.179.998.779.297", "137.460.000.610.352", "127.231.660.461.426", 
"148.579.998.779.297", "115.440.997.314.453", "4.482.857.905.469", 
"1163", "97.440.002.441.406", "130.298.852.539.062", "195.956.491.088.867", 
"504.998.779.296.989", "1043", "228.998.406.982.422", "128.374.566.650.391", 
"153.219.995.117.188", "111.604.742.431.641", "108.100.006.103.516", 
"1364.5", "102.669.999.694.824", "141.820.001.220.703", "83.124.743.652.344", 
"93.209.649.658.203", "149.629.656.982.422", "150.215.002.441.406", 
"161.379.998.779.297", "41.129.998.779.297", "91.320.001.220.703", 
"83.047.998.046.875", "1144.5", "104.020.001.220.703", "171.528.002.929.688", 
"1519", "123.510.003.662.109", "106.240.002.441.406", "145.934.997.558.594", 
"177.939.999.389.648", "195.360.003.662.109", "164.140.002.441.406", 
"113.640.002.441.406", "146.676.000.976.562", "1.769.916.015.625", 
"53.389.654.541.016", "685.018.981.933.594"), c = c(88L, 88L, 
88L, 88L, NA, 88L, 88L, 88L, NA, NA, NA, 86L, 86L, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 90L, 88L, 88L, 88L, 
88L, 88L, 88L, 88L, 88L, 88L, 88L, 88L, 86L, 88L, 88L, 88L, 84L, 
84L, 84L, 84L, 86L, 84L, 84L, 84L, 84L, 84L, 84L, 84L, 84L, 84L, 
82L, 82L, 84L, 82L, 82L, 84L, 82L, 82L, 82L, 82L, 82L, 82L, 82L, 
82L, 82L, 82L, 82L, 82L, 82L, 82L, 82L, 82L, 82L, 82L, 80L, 80L, 
80L, 80L, 80L, 82L, 84L, 82L, 82L, 80L, 80L, 80L, 80L, 80L, 80L, 
80L, 80L, 80L, 80L, 80L, 80L), c_null = c(88L, 88L, 88L, 88L, 
0L, 88L, 88L, 88L, 0L, 0L, 0L, 86L, 86L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 90L, 88L, 88L, 88L, 88L, 
88L, 88L, 88L, 88L, 88L, 88L, 88L, 86L, 88L, 88L, 88L, 84L, 84L, 
84L, 84L, 86L, 84L, 84L, 84L, 84L, 84L, 84L, 84L, 84L, 84L, 82L, 
82L, 84L, 82L, 82L, 84L, 82L, 82L, 82L, 82L, 82L, 82L, 82L, 82L, 
82L, 82L, 82L, 82L, 82L, 82L, 82L, 82L, 82L, 82L, 80L, 80L, 80L, 
80L, 80L, 82L, 84L, 82L, 82L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 
80L, 80L, 80L, 80L, 80L), c_orig = c("88.000.096.643.065", "88.000.096.643.065", 
"88.000.096.643.065", "0", "874.979.654.044.919", "873.618.932.305.081", 
"869.990.179.502.541", "0", "0", "0", "861.825.503", "861.825.503", 
"0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", 
"0", "899.000.015.258.789", "87.5", "880.999.984.741.211", "88", 
"879.000.015.258.789", "87", "869.000.015.258.789", "87", "868.000.030.517.578", 
"878.000.030.517.578", "876.999.969.482.422", "865.999.984.741.211", 
"861.999.969.482.422", "870.999.984.741.211", "869.000.015.258.789", 
"865.999.984.741.211", "871.999.969.482.422", "841.999.969.482.422", 
"84.5", "840.999.984.741.211", "845.999.984.741.211", "843.000.030.517.578", 
"841.999.969.482.422", "83", "834.000.015.258.789", "83", "825.999.984.741.211", 
"834.000.015.258.789", "831.999.969.482.422", "826.999.969.482.422", 
"823.000.030.517.578", "821.999.969.482.422", "825.999.984.741.211", 
"821.999.969.482.422", "82", "825.999.984.741.211", "816.999.969.482.422", 
"814.000.015.258.789", "81", "819.000.015.258.789", "816.999.969.482.422", 
"81.5", "821.999.969.482.422", "811.999.969.482.422", "814.000.015.258.789", 
"813.000.030.517.578", "808.000.030.517.578", "815.999.984.741.211", 
"818.000.030.517.578", "814.000.015.258.789", "814.000.015.258.789", 
"809.000.015.258.789", "809.000.015.258.789", "805.999.984.741.211", 
"801.999.969.482.422", "796.999.969.482.422", "801.999.969.482.422", 
"803.000.030.517.578", "804.000.015.258.789", "811.999.969.482.422", 
"825.999.984.741.211", "82.5", "819.000.015.258.789", "804.000.015.258.789", 
"795.999.984.741.211", "804.000.015.258.789", "80", "801.999.969.482.422", 
"798.000.030.517.578", "80", "80", "795.999.984.741.211", "800.999.984.741.211", 
"799.000.015.258.789", "791.999.969.482.422", "791.999.969.482.422"
), b = c("0", "0", "0", NA, NA, "-0.136072173983791", "-0.362875280254002", 
NA, NA, NA, NA, "0", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, "-240.000.152.587.891", "0.599998474121094", 
"-0.0999984741210938", "-0.0999984741210938", "-0.900001525878906", 
"-0.0999984741210938", "0.0999984741210938", "-0.199996948242202", 
"1", "-0.100006103515597", "-109.999.847.412.111", "-0.400001525878892", 
"0.900001525878892", "-0.199996948242188", "-0.300003051757812", 
"0.599998474121108", "0.5", "0.300003051757798", "-0.400001525878906", 
"0.5", "-0.299995422363295", "-0.100006103515597", "-11.999.969.482.422", 
"0.400001525878906", "-0.400001525878906", "-0.400001525878906", 
"0.800003051757812", "-0.200004577636705", "-0.5", "-0.399993896484403", 
"-0.100006103515597", "0.400001525878892", "-0.400001525878892", 
"-0.199996948242202", "0.599998474121094", "-0.900001525878892", 
"-0.299995422363295", "-0.400001525878906", "0.900001525878906", 
"-0.200004577636705", "-0.199996948242202", "0.699996948242202", 
"-1", "0.200004577636705", "-0.099998474121108", "-0.5", "0.799995422363295", 
"0.200004577636705", "-0.400001525878892", "0", "-0.5", "0", 
"-0.300003051757812", "-0.400001525878892", "-0.5", "0.5", "0.100006103515597", 
"0.099998474121108", "0.799995422363295", "140.000.152.587.889", 
"-0.0999984741210938", "-0.599998474121094", "-1.5", "-0.800003051757812", 
"0.800003051757812", "-0.400001525878906", "0.199996948242202", 
"-0.399993896484403", "0.199996948242202", "0", "-0.400001525878906", 
"0.5", "-0.199996948242188", "-0.700004577636705", NA), b_null = c("0", 
"0", "0", "0", "0", "-0.136072173983791", "-0.362875280254002", 
"0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", 
"0", "0", "0", "0", "0", "0", "0", "-240.000.152.587.891", "0.599998474121094", 
"-0.0999984741210938", "-0.0999984741210938", "-0.900001525878906", 
"-0.0999984741210938", "0.0999984741210938", "-0.199996948242202", 
"1", "-0.100006103515597", "-109.999.847.412.111", "-0.400001525878892", 
"0.900001525878892", "-0.199996948242188", "-0.300003051757812", 
"0.599998474121108", "0.5", "0.300003051757798", "-0.400001525878906", 
"0.5", "-0.299995422363295", "-0.100006103515597", "-11.999.969.482.422", 
"0.400001525878906", "-0.400001525878906", "-0.400001525878906", 
"0.800003051757812", "-0.200004577636705", "-0.5", "-0.399993896484403", 
"-0.100006103515597", "0.400001525878892", "-0.400001525878892", 
"-0.199996948242202", "0.599998474121094", "-0.900001525878892", 
"-0.299995422363295", "-0.400001525878906", "0.900001525878906", 
"-0.200004577636705", "-0.199996948242202", "0.699996948242202", 
"-1", "0.200004577636705", "-0.099998474121108", "-0.5", "0.799995422363295", 
"0.200004577636705", "-0.400001525878892", "0", "-0.5", "0", 
"-0.300003051757812", "-0.400001525878892", "-0.5", "0.5", "0.100006103515597", 
"0.099998474121108", "0.799995422363295", "140.000.152.587.889", 
"-0.0999984741210938", "-0.599998474121094", "-1.5", "-0.800003051757812", 
"0.800003051757812", "-0.400001525878906", "0.199996948242202", 
"-0.399993896484403", "0.199996948242202", "0", "-0.400001525878906", 
"0.5", "-0.199996948242188", "-0.700004577636705", "0")), .Names = c("X", 
"a", "c", "c_null", "c_orig", "b", "b_null"), class = "data.frame", row.names = c(NA, 
-102L))

And I have just plotted it with ggplot2:

ggplot(aes(x = a, y = b, group = c), data = Health, na.rm = TRUE) + 
  geom_point(aes(color = c, size = factor(c)), alpha=0.3) +
  scale_color_distiller(palette="YlGnBu", na.value="white") +
  geom_smooth(method = "lm", aes(group = factor(c), color = c), se = F)

And now I want to approximate the family of curves for all the geom_smooth lines to predict in another plot different scenarios with other parameters!

Max
  • 397
  • 5
  • 15
  • 1
    Yes. Please provide sample data (paste into your question the output of `dput(data.sample)`) and the code you've run already and we can show you how. – eipi10 Jun 23 '16 at 17:11
  • Thanks. For future reference, it's better to provide a small data sample that illustrates your problem, rather than a large data set. – eipi10 Jun 23 '16 at 17:35
  • Ok, I will delete it later so other ppl. with the same question can read it more easily. – Max Jun 23 '16 at 17:37
  • The data frame you posted has just one column of character data rather than multiple of columns of data that match the column names in your code. Please fix. – eipi10 Jun 23 '16 at 17:38
  • Done, sorry was an input mistake. Please tell me how to set it to code formatting all at once ;) Sry! – Max Jun 23 '16 at 17:43
  • 1
    Highlight the entire code section and then click the `{}` at the top of the question window. – eipi10 Jun 23 '16 at 17:45
  • Thanks, also corrected a typo in the ggplot (GG -> c). Hope someone can help me here. – Max Jun 23 '16 at 17:57
  • `a` and `b` are character variables in the data you posted. – eipi10 Jun 23 '16 at 18:05
  • But it does work for me like this? How to change it in R? – Max Jun 23 '16 at 18:11
  • Yes, that data produces a plot, but the "numbers" it's plotting are the underlying factor codes, rather than actual data values. Take a look at the data you posted and see if it matches the data you're plotting with in your R session. – eipi10 Jun 23 '16 at 18:13
  • It is numeric in my file, plots and values are correct – Max Jun 23 '16 at 18:28

1 Answers1

1

Here's a simple example of how to create and plot the prediction model.

First, let's create some fake data:

## Fake data
set.seed(595)
a = runif(50, 0, 100)
c = runif(50, 0, 100)

# Add equation for b in terms of a and c
dat = data.frame(a,c, b = 2*a + 3*c + 10 + rnorm(50, 0, 20))

Now, to predict b, from a and c we need a regression model:

## Regression model
m1 = lm(b ~ a + c, data=dat)

Here's a summary of that model:

summary(m1)    
Call:
  lm(formula = b ~ a + c, data = dat)

Residuals:
  Min      1Q  Median      3Q     Max 
-63.169 -10.364   0.385  12.959  53.623 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.63897    9.07948  -0.401     0.69    
a            2.19203    0.09701  22.595   <2e-16 ***
c            3.00188    0.11356  26.435   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 20.58 on 47 degrees of freedom
Multiple R-squared:  0.9565,  Adjusted R-squared:  0.9547 
F-statistic:   517 on 2 and 47 DF,  p-value: < 2.2e-16

To plot the model predictions, let's create a data frame of those predictions. We'll predict b for four values of c.

## Predict b for various values of c and the entire range of a values
newdat = expand.grid(a=0:100, c=c(5,20,80,95))
newdat = data.frame(newdat, b_pred=predict(m1, newdata=newdat))

Now plot the data and the four predictions from the model:

# Plot points for b vs. a and then show prediction lines for various values of c.
ggplot(dat, aes(a, b)) +
  geom_point() +
  geom_line(data=newdat, aes(a, b_pred, color=factor(c))) +
  guides(colour=guide_legend(reverse=TRUE))

enter image description here

UPDATE: Following up on my last comment, maybe it would help to visualize the regression surface for the model. It's a surface because b = f(a,c) is a function of two variables. The function is: b = f(a,c) = -3.64 + 2.19*a + 3.00*c, as shown above in the regression summary. This is the equation of a plane. Here's what it looks like when we plot it:

# Set up data grid for plotting
a=seq(1, 100, length.out=20)
c=seq(1, 100, length.out=20)
z = outer(a,c, FUN=function(a,c) -3.64 + 2.19*a + 3.00*c)

# Plot regression surface: b = f(a,c) = -3.64 + 2.19*a + 3.00*c
mat=persp(a, c, z, ylim=c(0,100), theta=35, phi=20, zlab="b", border="grey30")

# Add red lines of constant c
lapply(c(5,20,80,95), function(c_val) {
  lines(trans3d(x=a, y=c_val, z=-3.64 + 2.19*a + 3.00*c_val, pmat=mat), 
        col="red", lwd=2, lty="12")

Note in the plot below that each red line has a constant value of c. Only a varies. This is what the plot we made with ggplot becomes when you expand it into the 3rd dimension.

enter image description here

In fact, you can perhaps see this more easily if we rotate the 3D plot so that we are looking in the direction of the c axis (and perpendicular to the a axis) and reduce the perspective effect a bit. Compare the plot below with the one we made with ggplot:

# Plot regression surface: b = f(a,c) = -3.64 + 2.19*a + 3.00*c
mat=persp(a, c, z, ylim=c(0,100), theta=0, phi=0, zlab="b", border="grey30", d=5)

# Add red lines of constant c
lapply(c(5,20,80,95), function(c_val) {
  lines(trans3d(x=a, y=c_val, z=-3.64 + 2.19*a + 3.00*c_val, pmat=mat), 
        col="red", lwd=2, lty="12")
}) 

enter image description here

eipi10
  • 91,525
  • 24
  • 209
  • 285
  • looks somewhat similar to mine, although it is numeric in my script. Would it be possible to approximate a function for the family of curves of the geom_smooth lines, if so, how? In my df they are more similar (I removed a lot data to keep it small enough to post it) – Max Jun 23 '16 at 18:25
  • What do you mean by "approximate a function for the family of curves of the geom_smooth lines." Do you mean you want the parameters for each of the regression lines? – eipi10 Jun 23 '16 at 18:32
  • no, I am looking for a function `y(x,k)` which approximates the **family of functions** of all regression lines – Max Jun 23 '16 at 18:37
  • sorry I'm not a native English speaker. Simply looking for a function (obviously only possible with an approximative approach) which represents all regression lines (`x = a; y = b`) dependent of `c`. With this model I want to predict the regression line for a new `c` – Max Jun 23 '16 at 18:42
  • No problem. I'd be pretty excited if I knew another language as well (or even half as well) as you obviously know English! – eipi10 Jun 23 '16 at 18:47
  • 1
    As far as a regression function, if `a`,`b`, and `c` are all numeric (as your gpplot code implies), then the most basic regression function would be `lm(b ~ a + c, data=Health)`. From that you can predict `b` for any values of `a` and `c`. Note however, that this will in general be different from the regression functions you get if you repeatedly run `lm(b ~ a)` for various fixed values of `c`. – eipi10 Jun 23 '16 at 18:49
  • But with this model I cant predict the line for single `c` values? I'm looking for a fixed function `b(a,c)`. so obviously the slope and intersect should somehow be dependent of `c`? e.g. I input `c = 80`, I need the function to return the approximated regression line `(a,b)` (based on `data = Health`). Or simply: a method which outputs for a random `c`(e.g. `c = 20`) the slope and intersect of the approximated regression line (based on the available `df`) – Max Jun 23 '16 at 19:05
  • So the R code should return a function/formula (e.g. `y = x(2+c-3c^2) + 2*c`) where I input a random `c` (e.g. `c = 2` and get as output the function of the specific regression line (e.g. `y = -10*x + 4`). The function/formula which returns this new/specific regresion function should be approximated, based on the provided dataframe. How can I achieve this? – Max Jun 23 '16 at 19:44
  • I created data with a particular functional form. In your real data you're trying to discover the appropriate function that relates `b` to `a` and `c`. Assuming a linear model, you run `m1 = lm(...)` to discover that function. In other words, `m1` (the object returned by `lm`) is the "function" that R returns. Then you use `predict` to get predictions for `b` given the model `m1` and input values for `a` and `c`. – eipi10 Jun 23 '16 at 19:47
  • Ok thanks for that, seems great so far. The only thing I am concerned is that all regression lines have the same slope. When I look at the original df and pull some regression lines for 5 `c`values, the slope differs very much. It almost seems like all regression lines are pivoting around a point. Is there any way to achieve this in R, to predict the **slope as well as the intersect** based on available data for a new `c`? – Max Jun 23 '16 at 19:56
  • 1
    The only way to get different slopes is by adding interactions to the regression model. With only two variables, there's only one possible interaction: `lm(b ~ a*c, data=Health)`. In your case, I think you may be getting different slopes because ggplot is calculating a separate model `lm(b ~ a)` for each value of `c`, rather than a model including both variables. As a result, the slope for `a` is not constrained to be the same for each value of `c`. (FYI, it's "y-intercept" rather than "y-intersect".) – eipi10 Jun 23 '16 at 20:09
  • Is there no way that R looks at different regression lines (or `c's? and then approximates a formula of a new regression line for any `c`? This would really help me, because `lm(b ~ a*c)` is also far away. – Max Jun 23 '16 at 20:36
  • The coefficients of the model `m1` give you the regression function. The equation is `b = -3.64 + 2.19*a + 3.00*c`. This equation (which is what `predict` is using) gives you the modeled value of `b` for any values of `a` and `c` you choose to put into the equation. This regression equation is actually a regression "surface", because there a two predictor variables (`a` and `c`) involved. When you choose a particular value of `c` and create a regression line for `b` vs. `a`, you're looking at one particular line (a line of constant `c`) along that surface. – eipi10 Jun 24 '16 at 03:06