1

I'm looking at some code:

df1 <- inner_join(metadata, otu_counts, by="sample_id") %>%
  inner_join(., taxonomy, by="otu") %>% 
  group_by(sample_id) %>%
  mutate(rel_abund = count / sum(count)) %>% 
  ungroup() %>% 
  select(-count)

This first chunk I completely understand but I'm new and I can only assume that this second chunk's '.group = "drop"' does the same thing as the previous ungroup().

If so, then does it have to do with the last function being a summarize() function?

df2 <- df1 %>%
  filter(level=="phylum") %>%
  group_by(disease_stat, sample_id, taxon) %>%
  summarize(rel_abund = sum(rel_abund), .groups="drop") %>% #
  group_by(disease_stat, taxon) %>%
  summarize(mean_rel_abund = 100*mean(rel_abund), .groups="drop") 

Can someone explain?

UPDATE: I realize that the first .groups = "drop" eliminates a newly created variable which was sample_id. Is there more to this?

Antonio
  • 417
  • 2
  • 8
  • We would have to see a `head(dput(df))` in order to tell exactly what is happening with the newly created variable. My guess is that summarizing automatically gets rid of variables/columns that are not included in the grouping or output, and that the issue does not pertain to the `.groups = "drop"` argument. – dcsuka Aug 03 '22 at 00:52

1 Answers1

2

This is a special behavior/capability of summarize. When you group data by multiple variables, summarize defaults to keeping the first grouping in the output data frame.

library(wec)
library(dplyr)

data(PUMS)

PUMS %>%
  group_by(race, education.cat) %>%
  summarise(hi = mean(wage))

# # A tibble: 8 × 3
# # Groups:   race [4]
#   race     education.cat     hi
#   <fct>    <fct>          <dbl>
# 1 Hispanic High school   35149.
# 2 Hispanic Degree        52344.
# 3 Black    High school   30552.
# 4 Black    Degree        48243.
# 5 Asian    High school   35350 
# 6 Asian    Degree        78213.
# 7 White    High school   38532.
# 8 White    Degree        69135.

Notice that the above data frame still has 4 groups. If you use the .groups = "drop" argument in summarize, the output numbers are identical but the data frame has no groups.

PUMS %>%
  group_by(race, education.cat) %>%
  summarise(hi = mean(wage), .groups = "drop")

# # A tibble: 8 × 3
#   race     education.cat     hi
#   <fct>    <fct>          <dbl>
# 1 Hispanic High school   35149.
# 2 Hispanic Degree        52344.
# 3 Black    High school   30552.
# 4 Black    Degree        48243.
# 5 Asian    High school   35350 
# 6 Asian    Degree        78213.
# 7 White    High school   38532.
# 8 White    Degree        69135.

The mutate function in the first of your examples does not have a built in .groups functionality, so you have to take an extra line to ungroup() if you wish to do so afterwards.

dcsuka
  • 2,922
  • 3
  • 6
  • 27