1

Problem

I have some data points stored in data.frame with three variables, x, y, and gender. My goal is to draw several generally fitted lines and also lines specifically fitted for male/female over the scatter plot, with points coloured by gender. It sounds easy but some issues just persist.

What I currently do is to use a new set of x's and predict y's for every model, combine the fitted lines together in a data.frame, and then convert wide to long, with their model name as the third var (from this post: ggplot2: how to add the legend for a line added to a scatter plot? and this: Add legend to ggplot2 line plot I learnt that mapping should be used instead of setting colours/legends separately). However, while I can get a multicolor line plot, the points come without specific colour for gender (already a factor) as I expected from the posts I referenced.

I also know it might be possible to use aes=(y=predict(model)), but I met other problems for this. I also tried to colour the points directly in aes, and assign colours separately for each line, but the legend cannot be generated unless I use lty, which makes legend in the same colour.

Would appreciate any idea, and also welcome to change the whole method.


Code

Note that two pairs of lines overlap. So it only appeared to be two lines. I guess adding some jitter in the data might make it look differently.

slrmen<-lm(tc~x+I(x^2),data=data[data['gender']==0,])
slrwomen<-lm(tc~x+I(x^2),data=data[data['gender']==1,])
prdf <- data.frame(x = seq(from = range(data$x)[1], 
                  to = range(data$x)[2], length.out = 100),
                  gender = as.factor(rep(1,100)))
prdm <- data.frame(x = seq(from = range(data$x)[1], 
                  to = range(data$x)[2], length.out = 100),
                  gender = as.factor(rep(0,100)))
prdf$fit <- predict(fullmodel, newdata = prdf)
prdm$fit <- predict(fullmodel, newdata = prdm)
rawplotdata<-data.frame(x=prdf$x, fullf=prdf$fit, fullm=prdm$fit, 
                     linf=predict(slrwomen, newdata = prdf),
                     linm=predict(slrmen, newdata = prdm))
plotdata<-reshape2::melt(rawplotdata,id.vars="x",
                         measure.vars=c("fullf","fullm","linf","linm"),
                         variable.name="fitmethod", value.name="y")
plotdata$fitmethod<-as.factor(plotdata$fitmethod)

plt <- ggplot() + 
       geom_line(data = plotdata, aes(x = x, y = y, group = fitmethod, 
                                      colour=fitmethod)) +
       scale_colour_manual(name = "Fit Methods", 
                           values = c("fullf" = "lightskyblue", 
                                      "linf" = "cornflowerblue",
                                      "fullm"="darkseagreen", "linm" = "olivedrab")) +
       geom_point(data = data, aes(x = x, y = y, fill = gender)) +
       scale_fill_manual(values=c("blue","green"))  ## This does not work as I expected...
show(plt)

Points cannot be coloured

Code for another method (omitted two lines), which generates same-colour legend and multi-color plot:

ggplot(data = prdf, aes(x = x, y = fit)) +  # prdf and prdm are just data frames containing the x's and fitted values for different models
       geom_line(aes(lty="Female"),colour = "chocolate") +
       geom_line(data = prdm, aes(x = x, y = fit, lty="Male"), colour = "darkblue") + 
       geom_point(data = data, aes(x = x, y = y, colour = gender)) +
       scale_colour_discrete(name="Gender", breaks=c(0,1), 
                             labels=c("Male","Female"))

enter image description here

user48867
  • 141
  • 1
  • 9

2 Answers2

4

This is related to using the colour aesthetic for lines and the fill aesthetics for points in your own (first) example. In the second example, it works because the colour aesthetic is used for lines and points.

By default, geom_point can not map a variable to fill, because the default point shape (19) doesn't have a fill.

For fill to work on points, you have to specify shape = 21:25 in geom_point(), outside of aes().

Perhaps this small reproducible example helps to illustrate the point:

Simulate data

set.seed(4821)
x1 <- rnorm(100, mean = 5)

set.seed(4821)
x2 <- rnorm(100, mean = 6)

data <- data.frame(x = rep(seq(20,80,length.out = 100),2),
                   tc = c(x1, x2),
                   gender = factor(c(rep("Female", 100), rep("Male", 100))))

Fit models

slrmen <-lm(tc~x+I(x^2), data = data[data["gender"]=="Male",])
slrwomen <-lm(tc~x+I(x^2),data = data[data["gender"]=="Female",])

newdat <- data.frame(x = seq(20,80,length.out = 200))

fitted.male <- data.frame(x = newdat,
                          gender = "Male",
                          tc = predict(object = slrmen, newdata = newdat))
fitted.female <- data.frame(x = newdat,
                           gender = "Female",
                           tc = predict(object = slrwomen, newdata = newdat))

Plot using colour aesthetics

Use the colour aesthetics for both points and lines (specify in ggplot such that it gets inherited throughout). By default, geom_point can map a variable to colour.

library(ggplot2)

ggplot(data, aes(x = x, y = tc, colour = gender)) +
  geom_point() +
  geom_line(data = fitted.male) +
  geom_line(data = fitted.female) +
  scale_colour_manual(values = c("tomato","blue")) +
  theme_bw()

Plot using colour and fill aesthetics

Use the fill aesthetics for points and the colour aesthetics for lines (specify aesthetics in geom_* to prevent them being inherited). This will reproduce the problem.

ggplot(data, aes(x = x, y = tc)) +
  geom_point(aes(fill = gender)) +
  geom_line(data = fitted.male, aes(colour = gender)) +
  geom_line(data = fitted.female, aes(colour = gender)) +
  scale_colour_manual(values = c("tomato","blue")) +
  scale_fill_manual(values = c("tomato","blue")) +
  theme_bw()

To fix this, change the shape argument in geom_point to a point shape that can be filled (21:25).

ggplot(data, aes(x = x, y = tc)) +
  geom_point(aes(fill = gender), shape = 21) +
  geom_line(data = fitted.male, aes(colour = gender)) +
  geom_line(data = fitted.female, aes(colour = gender)) +
  scale_colour_manual(values = c("tomato","blue")) +
  scale_fill_manual(values = c("tomato","blue")) +
  theme_bw()

Created on 2021-09-19 by the reprex package (v2.0.1)

Note that the scales for colour and fill get merged automatically if the same variable is mapped to both aesthetics.

scrameri
  • 667
  • 2
  • 12
  • Thanks you so much scrameri! I didn't realize that geom_points are not "fillable". I guess what I would use is the last method, changing the points to another shape, because I need to draw multiple lines and separately assign colours (say, 4 fitted lines and 2 different points), so the previous two methods might not be applicable. – user48867 Sep 19 '21 at 21:42
  • Also do you know why in my second method, I cannot assign two colours to the two lines even if I have manually put in different colours to `colour` outside of `aes` but within `geom_line`? (In fact, I am not really expecting a dashed line) – user48867 Sep 19 '21 at 21:44
  • Hi Jasper :) I'm not sure what happens exactly with your linetype, but you probably shouldn't map a character such as "Male" or "Female" to the linetype aesthetic. Did you try specifying `geom_line(colour = "chocolate", linetype = 1)` and `geom_line(data = prdm, aes(x = x, y = fit), linetype = 2, colour = "darkblue")` instead? – scrameri Sep 19 '21 at 22:22
  • Alternatively, you could `rbind` all your `data.frames` with x and fit values into `prdm`, and add additional variables `gender` and `model` specifying model type, and then use everything to map to the linetype and colour aesthetic in one go: `geom_line(data = prdm, aes(x = x, y = fit, linetype = type, colour = gender)` – scrameri Sep 19 '21 at 22:22
  • Just used a character (not a variable name) in `aes` myself. Such an approach is similar to mapping a variable from your data.frame to an aesthetic: in your example, you first mapped `lty` to "Female", which produced the solid line, and then you mapped `lty` to "Male", which is then interpreted as a second level of some factor. But that factor isn't your `gender` variable in `prdf`and `prdm`. To map `gender` to `linetype` (lty), you can use `aes(linetype = gender)`. If you dont' want different linetypes, you shouldn't specify `linetype` inside `aes`. Hope this helps! – scrameri Sep 21 '21 at 08:38
  • Thank you so much scrameri! Your examples help a lot. Yes, `lty` seems not meant to be used in this way... The reason I used it was just I wanted to make legends directly for the data within `geom_line`. Now I believe it shouldn't be used in this way. Thanks again! – user48867 Sep 22 '21 at 16:20
2

It seems to me that what you really want to do is use ggplot2::stat_smooth instead of trying to predict yourself.

Borrowing the data from @scrameri:

ggplot(data, aes(x = x, y = tc, color = gender)) +
   geom_point() +
   stat_smooth(aes(linetype = "X^2"), method = 'lm',formula = y~x + I(x^2)) +
   stat_smooth(aes(linetype = "X^3"), method = 'lm',formula = y~x + I(x^2) + I(x^3)) +
   scale_color_manual(values = c("darkseagreen","lightskyblue"))

enter image description here

Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
  • Thank you Ian! I didn't know I can fit directly in `ggplot2` till now. Interesting feature. However, simple regression might only be a special case, and oftentimes I think more advanced fits need to be done outside of `ggplot()`, with only fitted points fed into it. In this case I guess usually a more precise manual control would be desired. Still thanks for your great answer! – user48867 Sep 19 '21 at 21:48