I'm working through the questions of R for Data Science and the website's solutions here: https://jrnold.github.io/r4ds-exercise-solutions/exploratory-data-analysis.html.
The question I'm working on is:
Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?
This is the data they are working with, which comes from the nycflights123 "flights" dataset:
dput(head(flights))
structure(list(year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2013L
), month = c(1L, 1L, 1L, 1L, 1L, 1L), day = c(1L, 1L, 1L, 1L,
1L, 1L), dep_time = c(517L, 533L, 542L, 544L, 554L, 554L), sched_dep_time = c(515L,
529L, 540L, 545L, 600L, 558L), dep_delay = c(2, 4, 2, -1, -6,
-4), arr_time = c(830L, 850L, 923L, 1004L, 812L, 740L), sched_arr_time = c(819L,
830L, 850L, 1022L, 837L, 728L), arr_delay = c(11, 20, 33, -18,
-25, 12), carrier = c("UA", "UA", "AA", "B6", "DL", "UA"), flight = c(1545L,
1714L, 1141L, 725L, 461L, 1696L), tailnum = c("N14228", "N24211",
"N619AA", "N804JB", "N668DN", "N39463"), origin = c("EWR", "LGA",
"JFK", "JFK", "LGA", "EWR"), dest = c("IAH", "IAH", "MIA", "BQN",
"ATL", "ORD"), air_time = c(227, 227, 160, 183, 116, 150), distance = c(1400,
1416, 1089, 1576, 762, 719), hour = c(5, 5, 5, 5, 6, 5), minute = c(15,
29, 40, 45, 0, 58), time_hour = structure(c(1357034400, 1357034400,
1357034400, 1357034400, 1357038000, 1357034400), tzone = "America/New_York", class = c("POSIXct",
"POSIXt"))), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
So this is the answer they give in the link I shared:
flights %>%
group_by(month, dest) %>% # This gives us (month, dest) pairs
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
group_by(dest) %>% # group all (month, dest) pairs by dest ..
filter(n() == 12) %>% # and only select those that have one entry per month
ungroup() %>%
mutate(dest = reorder(dest, dep_delay)) %>%
ggplot(aes(x = factor(month), y = dest, fill = dep_delay)) +
geom_tile() +
labs(x = "Month", y = "Destination", fill = "Departure Delay")
#> `summarise()` regrouping output by 'month' (override with `.groups` argument)
Which yields this:
I have so many questions about this:
- First off, why does he/she group this twice? I see that there is an initial grouping by month/dest, but then they group again by dest two lines down.
- Next, what is the purpose of then ungrouping it? Maybe the ungroup function serves a purpose im not aware of, but sounds counterintuitive.
- Finally, this data still doesn't look "clean" like the book seems to want. Sure, it shows some heatmapping by month, but the destinations plotted on y just look like alphabet soup, so its hard to derive any actual context.
I guess my two major problems right now are that I dont understand how they came up with this, nor do I understand why this is an acceptable answer given it doesn't show much.