Dont understand the why this book solution is "right" for cleaning up data in flights dataset

Question

I'm working through the questions of R for Data Science and the website's solutions here: https://jrnold.github.io/r4ds-exercise-solutions/exploratory-data-analysis.html.

The question I'm working on is:

Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?

This is the data they are working with, which comes from the nycflights123 "flights" dataset:

dput(head(flights))
structure(list(year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2013L
), month = c(1L, 1L, 1L, 1L, 1L, 1L), day = c(1L, 1L, 1L, 1L, 
1L, 1L), dep_time = c(517L, 533L, 542L, 544L, 554L, 554L), sched_dep_time = c(515L, 
529L, 540L, 545L, 600L, 558L), dep_delay = c(2, 4, 2, -1, -6, 
-4), arr_time = c(830L, 850L, 923L, 1004L, 812L, 740L), sched_arr_time = c(819L, 
830L, 850L, 1022L, 837L, 728L), arr_delay = c(11, 20, 33, -18, 
-25, 12), carrier = c("UA", "UA", "AA", "B6", "DL", "UA"), flight = c(1545L, 
1714L, 1141L, 725L, 461L, 1696L), tailnum = c("N14228", "N24211", 
"N619AA", "N804JB", "N668DN", "N39463"), origin = c("EWR", "LGA", 
"JFK", "JFK", "LGA", "EWR"), dest = c("IAH", "IAH", "MIA", "BQN", 
"ATL", "ORD"), air_time = c(227, 227, 160, 183, 116, 150), distance = c(1400, 
1416, 1089, 1576, 762, 719), hour = c(5, 5, 5, 5, 6, 5), minute = c(15, 
29, 40, 45, 0, 58), time_hour = structure(c(1357034400, 1357034400, 
1357034400, 1357034400, 1357038000, 1357034400), tzone = "America/New_York", class = c("POSIXct", 
"POSIXt"))), row.names = c(NA, -6L), class = c("tbl_df", "tbl", 
"data.frame"))

So this is the answer they give in the link I shared:

flights %>%
      group_by(month, dest) %>%                                 # This gives us (month, dest) pairs
      summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
      group_by(dest) %>%                                        # group all (month, dest) pairs by dest ..
      filter(n() == 12) %>%                                     # and only select those that have one entry per month 
      ungroup() %>%
      mutate(dest = reorder(dest, dep_delay)) %>%
      ggplot(aes(x = factor(month), y = dest, fill = dep_delay)) +
      geom_tile() +
      labs(x = "Month", y = "Destination", fill = "Departure Delay")
    #> `summarise()` regrouping output by 'month' (override with `.groups` argument)

Which yields this:

I have so many questions about this:

First off, why does he/she group this twice? I see that there is an initial grouping by month/dest, but then they group again by dest two lines down.
Next, what is the purpose of then ungrouping it? Maybe the ungroup function serves a purpose im not aware of, but sounds counterintuitive.
Finally, this data still doesn't look "clean" like the book seems to want. Sure, it shows some heatmapping by month, but the destinations plotted on y just look like alphabet soup, so its hard to derive any actual context.

I guess my two major problems right now are that I dont understand how they came up with this, nor do I understand why this is an acceptable answer given it doesn't show much.

If my answer is correct please mark it as being correct, thank you. — Hansel Palencia, Sep 03 '21 at 12:06

Hansel Palencia · Accepted Answer · 2021-09-03T13:21:54.347

I'll answer your question in parts.

First off, why does he/she group this twice? I see that there is an initial grouping by month/dest, but then they group again by dest two lines down.

Well first of all, grouping twice gives different information. For example, in this case they grouped by the destination and the month.

group_by(month, dest)

This group by was then combined with a summarise to calculate the average departure delay per destination per month.

summarise(dep_delay = mean(dep_delay, na.rm = TRUE))

They then grouped again on destination from there they filtered that grouped dataset (important for question 2).

group_by(dest) %>%                                        
filter(n() == 12) %>%

The purpose of this is so the entire dataset doesn't be removed. The regrouping causes the operation to perform on the count of destination instead of destination and month. In other words, we will be instead making sure that each destination has one count per one month, instead of each destination and month having one count, since this can be false if there are multiple years in your dataset. (i.e. why we filter by 12). In other words our final dataframe will be a long table with one destination pertaining to exactly 12 points.

Next, what is the purpose of then ungrouping it? Maybe the ungroup function serves a purpose im not aware of, but sounds counterintuitive.

Ungroup is not counterintuitive whatsoever, it's actually essential. Grouping, as per the help file, takes an existing table and converts it to a grouped table. This makes it so operations perform by group and not by observation. Ungroup reverses this functionality. Since the operation in this case was reorder, the author wanted to reorder the entire dataset not reorder each destination by its own points.

Finally, this data still doesn't look "clean" like the book seems to want. Sure, it shows some heatmapping by month, but the destinations plotted on y just look like alphabet soup, so its hard to derive any actual context.

Looking at a plot or even copying/saving a plot on different monitors looks different. The plot will autoscale to your monitor size. It is most likely that the author had a larger monitor than you, and the default dimensions were fine in their case. For you I would recommend changing the dimension sizes of the plot or simply maximizing your plot window.

Overall, your questions were simple to answer if you had looked in the help files, played with the solution (i.e. commenting out a group by to see what would happen), and really inspected the plot.

I try using help all the time in R, but I have no idea what its trying to say half the time, even when I check again with R Documentation, which often has bizarrely confusing explanations for even simple operations for newbs. I tried cutting the command up in pieces to see what it was doing each time, but as each part seems to display rows up on rows of data each time, it doesn't show me a good idea of whats actually going on. I guess I still have many question: why does month need to be filtered by 12 when its already numbered? Why do you need grouped tables? Its all confusing. — Shawn Hemelstrand, Sep 03 '21 at 12:38
Its numbered by month, and then the filter is filtering on the count `n()` of each destination. In other words, it is making sure that there are exactly 12 observations per destination. Grouped tables as I explained in my answer make it so that you can perform a function/operation on a specific group. For example, if I want to count the number of destinations then I would group by the destinations and then use the summarise function to count. If I wanted to count the number of destinations per month then I would group by destinations and month then count. It's actually not confusing at all. — Hansel Palencia, Sep 03 '21 at 12:41
This is what ungroup says where I looked it up: "Most data operations are done on groups defined by variables. group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group". ungroup() removes grouping." How is this not confusing? It literally says what I just did. — Shawn Hemelstrand, Sep 03 '21 at 12:53
Right so how do you go from "ungroup ungroups the data" as I previously said to what you said earlier, which is "it groups it into a table." These are two sentences that are contradictory in their meaning. And no, you're not really being considerate considering this is now the third time you've placed a condescending comment about what an idiot I must be for not understanding. If you dont want to help, there are thousands of other posts you can be rude about instead. — Shawn Hemelstrand, Sep 03 '21 at 13:14
Apologies, I made a slight error and included the information for ?group_by instead of ?ungroup — Hansel Palencia, Sep 03 '21 at 13:19
I think it is worth noting as well that `summarize()` will change typically change the status of your groups. It takes a block of rows, and (generally) consolidates them into one. That means you no longer have groups, you now have single entries. So it is quite common to `group_by()` again after a `summarize()` to redefine your "chunks" of rows that you are interested in. — , Sep 03 '21 at 13:23

score 2 · Answer 2 · answered Sep 03 '21 at 12:09

I agree the plot is difficult to read. I think it's easier to understand with a more divergent color scale, smaller y-axis labels, and month names instead of numbers, like this:

library(ggplot2)
library(dplyr)

nycflights13::flights %>%
  group_by(month, dest) %>%
  summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  group_by(dest) %>%
  filter(n() == 12) %>%
  ungroup() %>%
  mutate(dest = reorder(dest, dep_delay)) %>%
  ggplot(aes(
    x = factor(month.name[month], levels = month.name),
    y = dest,
    fill = dep_delay
  )) +
  geom_tile() +
  labs(x = "Month", y = "Destination", fill = "Departure Delay") +
  scale_fill_gradient(low = "white", high = "red") +
  theme(
    axis.text.y = element_text(size = 4),
    axis.text.x = element_text(size = 7),
    legend.position = "bottom"
  )

That makes it clearer that June, July, and December are the worst months.

Dont understand the why this book solution is "right" for cleaning up data in flights dataset

2 Answers2