0

Below is a sample of the data:

df_1 <- data.frame(total = c(0.9, 0.4, 0.2), white = c(0.6, 0.2, 0.1), black = c(0.3, 0.2, 0.1), immigrant = c(0.7, 0.3, 0.9))

df_2 <- data.frame(total = c(0.8, 0.7, 0.6), white = c(0.4, 0.3, 0.2), black = c(0.4, 0.4, 0.4), immigrant = c(0.9, 0.2, 0.1))

df_3 <- data.frame(total = c(0.6, 0.8, 0.9), white = c(0.4, 0.2, 0.7), black = c(0.2, 0.6, 0.2), immigrant = c(0.6, 0.8, 0.5))

Hi, I am interested in using ggplot2 to graph the dataframes above. In my example, each dataframe represents a different decade as follows: df_1 represents 1930, df_2 represents 1990, and df_3 represents 2020. I am interested in calculating the mean/average of each of the four columns and then graphing the results. I would like the x-axis to represent each year (1930, 1990, and 2020) and the y-axis to represent the calculated means (which should range from 0-1). The columns in all of the dataframes show different demographic groups and would be visualized as a point in the graph. Below is an idea of what I am envisioning. Illustration of the desired graph

I tried grouping the dataframes first but then I am not sure how to categorize each dataframe as a different year. The code below is something I adapted from another graph I made but it didn't work as expected. Note, 'ratio' is meant to represent the calculated means of each column.

Consideration:

  • The number of rows in each column may be different throughout the dataframes
list(df_1, 
     df_2,
     df_3) %>%
     lapply(function(x) setNames(x, 'ratio')) %>%
     {do.call(bind_rows, c(., .id = 'demographic'))} %>%
     mutate(ratio = mean(ratio)) %>%
     group_by(demographic) %>%
     ggplot(aes(ratio, n, colour = demographic, group = demographic)) +
     labs(x="Mean", y="Year", ))
jrcalabrese
  • 2,184
  • 3
  • 10
  • 30
Kimberly
  • 5
  • 2

1 Answers1

0

If you want your plot to be a ggplot, then it's important for your data to be tidy. That means that 1) each variable must have its own column, 2) each observation must have its own row, and 3) each value must have its own cell. These requirements also imply that all relevant values are in one dataset, not distributed over multiple datasets.

One option is to assign a year variable to each dataset, bind your datasets together, and then "lengthen" your dataset using pivot_longer(), so you can see each combination of year and your grouping variable. Then you can use summarize() to average by year and your grouping variable.

library(tidyverse)
df_1 <- data.frame(total = c(0.9, 0.4, 0.2), white = c(0.6, 0.2, 0.1), black = c(0.3, 0.2, 0.1), immigrant = c(0.7, 0.3, 0.9))
df_2 <- data.frame(total = c(0.8, 0.7, 0.6), white = c(0.4, 0.3, 0.2), black = c(0.4, 0.4, 0.4), immigrant = c(0.9, 0.2, 0.1))
df_3 <- data.frame(total = c(0.6, 0.8, 0.9), white = c(0.4, 0.2, 0.7), black = c(0.2, 0.6, 0.2), immigrant = c(0.6, 0.8, 0.5))

df_1$year <- 1930
df_2$year <- 1990
df_3$year <- 2020

bigdf <- rbind(df_1, df_2, df_3) %>%
  pivot_longer(cols = -year) %>%
  mutate(year = as.factor(year)) %>%
  group_by(year, name) %>%
  summarize(value = mean(value))

ggplot(bigdf, aes(x = year, y = value, 
                  color = name, group = name)) + 
  geom_path() + geom_point()

enter image description here

small edit

If you want to reorder the labels in the legend, you can turn name into an ordered factor.

bigdf <- bigdf %>%
  mutate(name = factor(name,
                          levels = c("total",
                                     "black",
                                     "white",
                                     "immigrant")))
jrcalabrese
  • 2,184
  • 3
  • 10
  • 30
  • Thank you! This worked exactly how I was hoping for~~ – Kimberly Jan 26 '23 at 23:39
  • If you're available, how can I reorder the legend items? I tried using the code below and a few different iterations but it did not work. Thanks + scale_x_discrete(limits = c('total', 'black', 'white', 'immigrant')) – Kimberly Jan 27 '23 at 00:06
  • One way to reorder the legend items is to turn `name` into an ordered factor. I updated my answer; make sure to run that code before making the plot with `ggplot()`. – jrcalabrese Jan 27 '23 at 01:54