5

When running unnest on a data.frame is there a way to add the group name of nested item to the individual columns it contains (either as a suffix or prefix). Or does renaming have to be done manually via rename?

This is particularly relevant with 'unnesting' multiple groups that contain columns with the same names.

In the example below the base aggregate command does this well (eg. Petal.Length.mn), but I couldn't find an option to get unnest to do the same thing?

I'm using nest with purrr::map as I want the flexibility to mix functions, eg. calculate means and sd on a couple of variables and also run a t test to look at differences between them.


library(dplyr, warn.conflicts = FALSE)

msd_c <- function(x) c(mn = mean(x), sd = sd(x))
msd_df <- function(x) bind_rows(c(mn = mean(x), sd = sd(x)))

aggregate(cbind(Petal.Length, Petal.Width) ~ Species, 
          data = iris, FUN = msd_c)
#>      Species Petal.Length.mn Petal.Length.sd Petal.Width.mn Petal.Width.sd
#> 1     setosa       1.4620000       0.1736640      0.2460000      0.1053856
#> 2 versicolor       4.2600000       0.4699110      1.3260000      0.1977527
#> 3  virginica       5.5520000       0.5518947      2.0260000      0.2746501

iris %>% 
  select(Petal.Length:Species) %>% 
  group_by(Species) %>% 
  tidyr::nest() %>% 
  mutate(
    Petal.Length = purrr::map(data, ~ msd_df(.$Petal.Length)),
    Petal.Width = purrr::map(data, ~ msd_df(.$Petal.Width)),
    Correlation = purrr::map(data, ~ broom::tidy(cor.test(.$Petal.Length, .$Petal.Width))),
  ) %>% 
  select(-data) %>% 
  tidyr::unnest(c(Petal.Length, Petal.Width, Correlation), names_repair = tidyr::tidyr_legacy)
#> # A tibble: 3 x 13
#> # Groups:   Species [3]
#>   Species    mn    sd   mn1   sd1 estimate statistic  p.value parameter conf.low
#>   <fct>   <dbl> <dbl> <dbl> <dbl>    <dbl>     <dbl>    <dbl>     <int>    <dbl>
#> 1 setosa   1.46 0.174 0.246 0.105    0.332      2.44 1.86e- 2        48   0.0587
#> 2 versic~  4.26 0.470 1.33  0.198    0.787      8.83 1.27e-11        48   0.651 
#> 3 virgin~  5.55 0.552 2.03  0.275    0.322      2.36 2.25e- 2        48   0.0481
#> # ... with 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>

Created on 2020-05-20 by the reprex package (v0.3.0)

JWilliman
  • 3,558
  • 32
  • 36

3 Answers3

11

The answer to this was somewhat obvious, use the names_sep option rather than the names_repair option. As quoted from the nest help menu under names_sep:

If a string, the inner and outer names will be used together. In nest(), the names of the new outer columns will be formed by pasting together the outer and the inner column names, separated by names_sep. In unnest(), the new inner names will have the outer names (+ names_sep) automatically stripped. This makes names_sep roughly symmetric between nesting and unnesting.


library(dplyr, warn.conflicts = FALSE)

msd_c <- function(x) c(mn = mean(x), sd = sd(x))
msd_df <- function(x) bind_rows(c(mn = mean(x), sd = sd(x)))

iris %>% 
  select(Petal.Length:Species) %>% 
  group_by(Species) %>% 
  tidyr::nest() %>% 
  mutate(
    Petal.Length = purrr::map(data, ~ msd_df(.$Petal.Length)),
    Petal.Width = purrr::map(data, ~ msd_df(.$Petal.Width)),
    Correlation = purrr::map(data, ~ broom::tidy(cor.test(.$Petal.Length, .$Petal.Width))),
  ) %>% 
  select(-data) %>% 
  tidyr::unnest(c(Petal.Length, Petal.Width, Correlation), names_sep = ".")
#> # A tibble: 3 x 13
#> # Groups:   Species [3]
#>   Species Petal.Length.mn Petal.Length.sd Petal.Width.mn Petal.Width.sd
#>   <fct>             <dbl>           <dbl>          <dbl>          <dbl>
#> 1 setosa             1.46           0.174          0.246          0.105
#> 2 versic~            4.26           0.470          1.33           0.198
#> 3 virgin~            5.55           0.552          2.03           0.275
#> # ... with 8 more variables: Correlation.estimate <dbl>,
#> #   Correlation.statistic <dbl>, Correlation.p.value <dbl>,
#> #   Correlation.parameter <int>, Correlation.conf.low <dbl>,
#> #   Correlation.conf.high <dbl>, Correlation.method <chr>,
#> #   Correlation.alternative <chr>

Created on 2020-06-10 by the reprex package (v0.3.0)

JWilliman
  • 3,558
  • 32
  • 36
0

To apply multiple functions to multiple columns I would use summarise_at/mutate_at instead of nesting and unnesting data.

For example, in this case we can do :

library(dplyr)
iris %>% 
  group_by(Species) %>% 
  summarise_at(vars(Petal.Length:Petal.Width), list(mn = mean, sd = sd))


#  Species    Petal.Length_mn Petal.Width_mn Petal.Length_sd Petal.Width_sd
#  <fct>                <dbl>          <dbl>           <dbl>          <dbl>
#1 setosa                1.46          0.246           0.174          0.105
#2 versicolor            4.26          1.33            0.470          0.198
#3 virginica             5.55          2.03            0.552          0.275

This automatically adds a prefix to column names which we are applying the function to. Also, this is equivalent dplyr version of aggregate function you tried.

Also note that summarise_at will soon be replaced with across in upcoming version of dplyr.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks. This works if you want the same functions to be applied to all variables (as in my `aggregate` example). But it doesn't work if you want to mix (complex) functions - hence why I was using `purrr::map`. I couldn't add in the 'Ratio' column above or add the output from a t.test for example. – JWilliman May 20 '20 at 01:18
  • Sorry, I don't get it. As far as I see you are passing the same function i.e `msd_c` or `msd_df` for multiple columns. As far as `Ratio` column is concerned you can just create that column and pass it to same function. `iris %>% mutate(Ratio = Petal.Length/Petal.Width) %>% group_by(Species) %>% summarise_at(vars(Petal.Length,Petal.Width, Ratio), list(mn = mean, sd = sd))` – Ronak Shah May 20 '20 at 01:31
  • Sorry I didn't come up with a good example. Have edited the question to replace 'Ratio' with 'Correlation' which makes it more obvious that the functions are quite different. – JWilliman May 20 '20 at 01:53
  • @JWilliman Well, you can do `iris %>% group_by(Species) %>% mutate_at(vars(Petal.Length:Petal.Width), list(mn = mean, sd = sd)) %>% mutate(Correlation = list(broom::tidy(cor.test(Petal.Length, Petal.Width)))) %>% slice(1L) %>% unnest_wider(Correlation)` in that case. I still don't get the purpose of `nest` and `unnest`. Maybe I am missing something. – Ronak Shah May 20 '20 at 02:40
0

You can use setNames like below. It is a little bit wordy, but it seems like you plan to specify each function for each column, this may be of interest.

iris %>% 
  select(Petal.Length:Species) %>% 
  group_by(Species) %>% 
  tidyr::nest() %>% 
  mutate(
    Petal.Length = purrr::map(data, ~ msd_df(.x$Petal.Length) %>%
                                setNames(paste0("Petal.Length.", names(.)))),
    Petal.Width = purrr::map(data, ~ msd_df(.$Petal.Width) %>%
                                setNames(paste0("Petal.Width.", names(.)))),
    Ratio = purrr::map(data, ~ msd_df(.$Petal.Length/.$Petal.Width) %>%
                               setNames(paste0("Ratio.", names(.))))
  ) %>% 
  select(-data) %>% 
  tidyr::unnest(c(Petal.Length, Petal.Width, Ratio))
# A tibble: 3 x 7
# Groups:   Species [3]
  Species    Petal.Length.mn Petal.Length.sd Petal.Width.mn Petal.Width.sd Ratio.mn Ratio.sd
  <fct>                <dbl>           <dbl>          <dbl>          <dbl>    <dbl>    <dbl>
1 setosa                1.46           0.174          0.246          0.105     6.91    2.85 
2 versicolor            4.26           0.470          1.33           0.198     3.24    0.312
3 virginica             5.55           0.552          2.03           0.275     2.78    0.407

Or modify your function to allow it being able to modify the column name like this.

msd_df_name <- function(x, name){
  bind_rows(c(mn = mean(x), sd = sd(x))) %>%
    setNames(paste0(name, ".", names(.)))
}

iris %>% 
  select(Petal.Length:Species) %>% 
  group_by(Species) %>% 
  tidyr::nest() %>% 
  mutate(
    Petal.Length = purrr::map(data, ~ msd_df_name(.x$Petal.Length, "Petal.Length")),
    Petal.Width = purrr::map(data, ~ msd_df_name(.$Petal.Width, "Petal.Width")),
    Ratio = purrr::map(data, ~ msd_df_name(.$Petal.Length/.$Petal.Width, "Ratio"))
  ) %>% 
  select(-data) %>% 
  tidyr::unnest(c(Petal.Length, Petal.Width, Ratio))
www
  • 38,575
  • 12
  • 48
  • 84
  • Thank you. This probably is the way to go. I just wonder if I was missing something in the `names.repair` argument of `unnest`. It says that you can pass a function (for custom name repair), or 'a purrr-style anonymous function' but I wasn't sure how to write one that would capture the group name. – JWilliman May 20 '20 at 01:37
  • @JWilliman I don't know how to use `names.repair`, either. It would be great if someone can share their insights. – www May 20 '20 at 01:38
  • Adding the group name seems to me to be the most intuitive way of avoiding duplicate names IMO, but perhaps it isn't so easy to implement. I'll see if someone comes up with a solution using it, otherwise I'll tick your answer. – JWilliman May 20 '20 at 01:44
  • I discovered that I'd overlooked the `names_sep` option which does exactly what I wanted. Regarding the `names_repair` option this is passed to `vctrs::vec_as_names` – JWilliman Jun 10 '20 at 00:15