22

I have a data.frame containing a continuous predictor and a dichotomous response variable.

> head(df)
  position response
1        0        1
2        3        1
3       -4        0
4       -1        0
5       -2        1
6        0        0

I can easily compute a logistic regression by means of the glm()-function, no problems up to this point.

Next, I want to create a plot with ggplot, that contains both the empiric probabilities for each of the overall 11 predictor values, and the fitted regression line.

I went ahead and computed the probabilities with cast() and saved them in another data.frame

> probs
   position   prob
1        -5 0.0500
2        -4 0.0000
3        -3 0.0000
4        -2 0.2000
5        -1 0.1500
6         0 0.3684
7         1 0.4500
8         2 0.6500
9         3 0.7500
10        4 0.8500
11        5 1.0000

I plotted the probabilities:

p <- ggplot(probs, aes(x=position, y=prob)) + geom_point()

But when I try to add the fitted regression line

p <- p + stat_smooth(method="glm", family="binomial", se=F)

it returns a warning: non-integer #successes in a binomial glm!. I know that in order to plot the stat_smooth "correctly", I'd have to call it on the original df data with the dichotomous variable. However if I use the dfdata in ggplot(), I see no way to plot the probabilities.

How can I combine the probabilities and the regression line in one plot, in the way it's meant to be in ggplot2, i.e. without getting any warning or error messages?

vincentqu
  • 357
  • 1
  • 2
  • 6
  • Plot the data and the `+stat_smooth` first, and then add the line plot for the probabilities you want with a call to: `+geom_line(aes(x=position, y=prob), data=probs)`. Untested in the absence of a data example. – IRTFM Jun 09 '13 at 16:19

1 Answers1

34

There are basically three solutions:

Merging the data.frames

The easiest, after you have your data in two separate data.frames would be to merge them by position:

mydf <- merge( mydf, probs, by="position")

Then you can call ggplot on this data.frame without warnings:

ggplot( mydf, aes(x=position, y=prob)) +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = FALSE) 

enter image description here

Avoiding the creation of two data.frames

In future you could directly avoid the creation of two separate data.frames which you have to merge later. Personally, I like to use the plyr package for that:

librayr(plyr)
mydf <- ddply( mydf, "position", mutate, prob = mean(response)  )

Edit: Use different data for each layer

I forgot to mention, that you can use for each layer another data.frame which is a strong advantage of ggplot2:

ggplot( probs, aes(x=position, y=prob)) +
  geom_point() +
  geom_smooth(data = mydf, aes(x = position, y = response),
    method = "glm", method.args = list(family = "binomial"), 
    se = FALSE)

As an additional hint: Avoid the usage of the variable name df since you override the built in function stats::df by assigning to this variable name.

PatrickT
  • 10,037
  • 9
  • 76
  • 111
Beasterfield
  • 7,023
  • 2
  • 38
  • 47
  • Thanks a lot! The second alternative seems pretty elegant. I guess I have to delve into the `plyr` package a bit. Seems pretty useful! – vincentqu Jun 09 '13 at 17:05
  • 4
    This code no longer works. `Error: Unknown parameters: family`. One has to use `stat_smooth(method="glm", se=F, method.args = list(family="binomial"))`. The `...` passes the family parameter to the layer, not the method. – CoderGuy123 Oct 03 '16 at 10:07