1

I’m totally new to ggplot, relatively fresh with R and want to make a smashing ”before-and-after” scatterplot with connecting lines to illustrate the movement in percentages of different subgroups before and after a special training initiative. I’ve tried some options, but have yet to:

  • show each individual observation separately (now same values are overlapping)
  • connect the related before and after measures (x=0 and X=1) with lines to more clearly illustrate the direction of variation
  • subset the data along class and id using shape and colors

How can I best create a scatter plot using ggplot (or other) fulfilling the above demands?

Main alternative: geom_point()

Here is some sample data and example code using genom_point

    x <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1) # 0=before, 1=after
    y <- c(45,30,10,40,10,NA,30,80,80,NA,95,NA,90,NA,90,70,10,80,98,95) # percentage of ”feelings of peace"
    class <- c(0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1) # 0=multiple days 1=one day
    id <- c(1,1,2,3,4,4,4,4,5,6,1,1,2,3,4,4,4,4,5,6) # id = per individual

    df <- data.frame(x,y,class,id)

    ggplot(df, aes(x=x, y=y), fill=id, shape=class) + geom_point()

Click here for example image based on geom_point()

Alternative: scale_size()

I have explored stat_sum() to summarize the frequencies of overlapping observations, but then not being able to subset using colors and shapes due to overlap.

    ggplot(df, aes(x=x, y=y)) +
      stat_sum()

Click here for example image based on scale_size()

Alternative: geom_dotplot()

I have also explored geom_dotplot() to clarify the overlapping observations that arise from using genom_point() as I do in the example below, however I have yet to understand how to combine the before and after measures into the same plot.

    df1 <- df[1:10,] # data before
    df2 <- df[11:20,] # data after

    p1 <- ggplot(df1, aes(x=x, y=y)) +
      geom_dotplot(binaxis = "y", stackdir = "center",stackratio=2,
           binwidth=(1/0.3))

    p2 <- ggplot(df2, aes(x=x, y=y)) +
      geom_dotplot(binaxis = "y", stackdir = "center",stackratio=2,
           binwidth=(1/0.3))

    grid.arrange(p1,p2, nrow=1) # GridExtra package

Click here for example image based on geom_dotplot()

camille
  • 16,432
  • 18
  • 38
  • 60
  • There are points/people with multiple values at a specific time (ie, two `id`s of '1' at time zero). Is this intentional – wibeasley Jun 18 '18 at 14:15
  • 2
    I think it would be best to figure out what type of chart you want first, and then people can help you build it. This is a situation where a slopegraph would probably work well, and is easy to implement in ggplot – camille Jun 18 '18 at 14:39
  • @wibeasley Yes, this is intentional and I will be subsetting on this parameter later. Thanks for pointing this unclarity out! – Mathilda Lindgren Jun 19 '18 at 06:13
  • @camille Thank you generous genius! I have completely forgotten about slopegraphs and I think you're totally right that this is were I should be with the kind of data I have. Will explore further and let you all know how it goes. Until then, thanks again! – Mathilda Lindgren Jun 19 '18 at 06:15

2 Answers2

3

Or maybe it is better to summarize data by x, id, class as mean/median of y, filter out ids producing NAs (e.g. ids 3 and 6), and connect the points by lines? So in case if you don't really need to show variability for some ids (which could be true if the plot only illustrates tendencies) you can do it this way:

library(ggplot)
library(dplyr)
#library(ggthemes)

df <- df %>%
  group_by(x, id, class) %>%
  summarize(y = median(y, na.rm = T)) %>%
  ungroup() %>%
  mutate(
    id = factor(id),
    x = factor(x, labels = c("before", "after")),
    class = factor(class, labels = c("one day", "multiple days")),
    ) %>%
  group_by(id) %>%
  mutate(nas = any(is.na(y))) %>%
  ungroup() %>%
  filter(!nas) %>%
  select(-nas)

ggplot(df, aes(x = x, y = y, col = id, group = id)) +
  geom_point(aes(shape = class)) +
  geom_line(show.legend = F) +
  #theme_few() +
  #theme(legend.position = "none") +
  ylab("Feelings of peace, %") +
  xlab("")

lines

utubun
  • 4,400
  • 1
  • 14
  • 17
  • 1
    Thank you so much @utubun for this suggestion and code! I have now tried this approach and it is indeed useful for illustrating tendencies, which I am also interested in as a first start. Together with slopegraphs, as suggested above in the comments by camille I feel like I will be able to illustrate the things I want to illustrate so thanks a lot! :) – Mathilda Lindgren Jun 26 '18 at 09:42
1

Here's one possible solution for you.

First - to get the color and shapes determined by variables, you need to put these into the aes function. I turned several into factors, so the labs function fixes the labels so they don't appear as "factor(x)" but just "x".

To address multiple points, one solution is to use geom_smooth with method = "lm". This plots the regression line, instead of connecting all the dots. The option se = FALSE prevents confidence intervals from being plotted - I don't think they add a lot to your plot, but play with it. Connecting the dots is done by geom_line - feel free to try that as well.

Within geom_point, the option position = position_jitter(width = .1) adds random noise to the x-axis so points do not overlap.

ggplot(df, aes(x=factor(x), y=y, color=factor(id), shape=factor(class), group = id)) + 
  geom_point(position = position_jitter(width = .1)) + 
  geom_smooth(method = 'lm', se = FALSE) + 
  labs(
    x = "x",
    color = "ID",
    shape = 'Class'
  )
Melissa Key
  • 4,476
  • 12
  • 21