0

I'm trying to simplify a current piece of code in my script.

I want to group by each possible combination of two categorical variables and summarise a mean value of my explanatory variable.

Example using mpg database found in ggplot2;

library(tidyverse)

   mpg %>% group_by(manufacturer, model) %>% summarise(mean = mean(hwy))
   mpg %>% group_by(manufacturer, year) %>% summarise(mean = mean(hwy))
   mpg %>% group_by(manufacturer, cyl) %>% summarise(mean = mean(hwy)) 

(this would continue until all combination of categorical variables - columns is done)

mpg %>% group_by(cyl, year) %>% summarise(mean = mean(hwy))

etc...

My actual database has hundreds of categorical variables so I would like to iterate the process in a for loop or using purrr for example.

Thanks

JmezR
  • 3
  • 2

1 Answers1

1

This uses purrr to select character and factor columns and then combn() to select all of the combinations.

library(ggplot2)
library(purrr)
library(dplyr)

map_lgl(mpg, ~ is.character(.) | is.factor(.))%>%
  names(.)[.]%>%
  combn(2, function(x) {mpg%>%group_by_at(x)%>%summarize(mean = mean(hwy))}, simplify = F)

Note, this can become messy as choose(100,2) evaluates to 4,950 combinations.

Cole
  • 11,130
  • 1
  • 9
  • 24
  • Thanks I intend to filter the list of data frames created based on certain factors e.g max mean value in dataframe to make it more manageable. – JmezR Nov 13 '19 at 13:58