5

I'm new to purrr, Hadley's promising functional programming R library. I'm trying to take a grouped and split dataframe and run a t-test on a variable. An example using a sample dataset might look like this.

mtcars %>% 
  dplyr::select(cyl, mpg) %>% 
  group_by(as.character(cyl)) %>% 
  split(.$cyl) %>% 
  map(~ t.test(.$`4`$mpg, .$`6`$mpg))

This results in the following error:

Error in var(x) : 'x' is NULL
In addition: Warning messages:
1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
2: In mean.default(x) : argument is not numeric or logical: returning NA

Am I just misunderstanding how map works? Or is there a better way to think about this?

Community
  • 1
  • 1
  • One thing I notices is that the examples in the `map` docs show `map` operating on each list item from `split` separately, but your example is trying to operate between list items. – steveb Feb 22 '16 at 17:08
  • Yes, very true. Any idea if there is an easy way to operate between list items? – Samarth Bhaskar Feb 22 '16 at 17:36
  • I am not sure how you would get it to work wity `map` but you could capture the results up to `split` then use `lapply` on that result. – steveb Feb 22 '16 at 17:40
  • I was thinking something like the following. If you capture the results up to `split` in `mtcars_split` then you could do something like `lapply(names(mtcars_split)[2:length(mtcars_split)], function(x) { t.test(mtcars_split[['4']]$mpg, mtcars_split[[x]]$mpg) })`. I suspect there is a cleaner (i.e. more readable) way to do this though. – steveb Feb 22 '16 at 17:51
  • this might be helpful additional reading http://stackoverflow.com/questions/35505187/comparison-between-dplyrdo-purrrmap-what-advantages – timelyportfolio Feb 22 '16 at 22:18
  • Why is it being called "Hadley's" (with a link no less) when its creator _and_ maintainer was/is Lionel Henry? – IRTFM May 05 '18 at 02:12

3 Answers3

11

I don't fully understand the expected result, but this might be a starting point for an answer. map() from purrr uses .x in the formula argument.

Here is one way to accomplish what I think you are trying to do with just purrr.

mtcars %>%
  split(as.character(.$cyl)) %>%
  map(~t.test(.x$mpg)) 

But, purrr::by_slice() pairs nicely with dplyr::group_by().

library(purrr)
library(dplyr)

mtcars %>% 
  dplyr::select(cyl, mpg) %>% 
  group_by(as.character(cyl)) %>%
  by_slice(~ t.test(.x$mpg))

Or, you could skip purrr entirely using dplyr:::summarise().

library(purrr)
library(dplyr)

mtcars %>% 
  dplyr::select(cyl, mpg) %>% 
  group_by(as.character(cyl)) %>%
  summarise(t_test = data_frame(t.test(.$mpg)))

If the nested data.frame is confusing, broom can help us get an easy data.frame summary of the results.

purrr + broom + tidyr

library(broom)
library(tidyr)
mtcars %>%
  group_by(as.character(cyl)) %>%
  by_slice(~tidy(t.test(.x$mpg))) %>%
  unnest()

dplyr + broom

library(broom)

mtcars %>% 
  dplyr::select(cyl, mpg) %>% 
  group_by(as.character(cyl)) %>%
  do(tidy(t.test(.$mpg)))

Edited to include response to comment

With pipes, we can get carried away quite quickly. I think Walt did a nice job in his answer, but I wanted to make sure that I provided a purrr-ty answer. I hope the use of pipeR is not overly confusing.

library(purrr)
library(dplyr)
library(broom)
library(tidyr)
library(pipeR)

mtcars %>>%
  (split(.,.$cyl)) %>>%
  (split_cyl~
    names(split_cyl) %>>%
     (
       cross_d(
         list(against=.,tested=.),
         .filter = `==`
       )
     ) %>>%
     by_row(
       ~tidy(t.test(split_cyl[[.x$tested]]$mpg,split_cyl[[.x$against]]$mpg))
     )
  ) %>>%
  unnest()
timelyportfolio
  • 6,479
  • 30
  • 33
  • 1
    This is great, thanks very much! I didn't know about that dplyr::data_frame usage. That'll come in handy – Samarth Bhaskar Feb 22 '16 at 22:24
  • The broom updates are very helpful! A quick clarification: this result set shows a `t.test` for each slice or group. How can I compare one group to another. Something like `t.test(4$mpg, 6$mpg)`? – Samarth Bhaskar Feb 22 '16 at 22:49
  • The **skip purrr entirely using dplyr:::summarise()** does not work: `Erreur : Variables must be length 1 or 9. Problem variables: 'as.character(cyl)'`; `summarise` do not like the returned data frames. I like the purrr + dplyr solution :) – Costin Nov 14 '16 at 00:01
  • `by_slice` and `by_row` in `purrr` are now deprecated. So the workable solutions now is the usage of `dplyr` + `broom` to summarize the grouped statistics. – raymkchow May 16 '18 at 02:11
6

Especially when dealing with pipes that require multiple inputs (we don't have Haskell's Arrows here), I find it easier to reason by types/signatures first, then encapsulate logic in functions (which you can unit test), then write a concise chain.

In this case you want to compare all possible pairs of vectors, so I would set a goal of writing a function that takes a pair (i.e. a list of 2) of vectors and returns the 2-way t.test of them.

Once you've done this, you just need some glue. So the plan is:

  1. Write function that takes a list of vectors and performs the 2-way t-test.
  2. Write a function/pipe that fetches the vectors from mtcars (easy).
  3. Map the above over the list of pairs.

It's important to have this plan before writing any code. Things are somehow obfuscated by the fact that R is not strongly typed, but this way you reason about "types" first, implementation second.

Step 1

t.test takes dots, so we use purrr:lift to have it take a list. Since we don't want to match on the names of the elements of the list, we use .unnamed = TRUE. Also we make it extra clear we're using the t.test function with arity of 2 (though this extra step is not needed for the code to work).

t.test2 <- function(x, y) t.test(x, y)
liftedTT <- lift(t.test2, .unnamed = TRUE)

Step 2

Wrap the function we got in step 1 into a functional chain that takes a simple pair (here I use indexes, it should be easy to use cyl factor levels, but I don't have time to figure it out).

doTT <- function(pair) {
  mtcars %>%
    split(as.character(.$cyl)) %>%
    map(~ select(., mpg)) %>% 
    extract(pair) %>% 
    liftedTT %>% 
    broom::tidy
}

Step 3

Now that we have all our lego pieces ready, composition is trivial.

1:length(unique(mtcars$cyl)) %>% 
  combn(2) %>% 
  as.data.frame %>% 
  as.list %>% 
  map(~ doTT(.))

$V1
  estimate estimate1 estimate2 statistic      p.value parameter conf.low conf.high
1 6.920779  26.66364  19.74286  4.719059 0.0004048495  12.95598 3.751376  10.09018

$V2
  estimate estimate1 estimate2 statistic      p.value parameter conf.low conf.high
1 11.56364  26.66364      15.1  7.596664 1.641348e-06  14.96675 8.318518  14.80876

$V3
  estimate estimate1 estimate2 statistic      p.value parameter conf.low conf.high
1 4.642857  19.74286      15.1  5.291135 4.540355e-05  18.50248 2.802925  6.482789

There's quite a bit here to clean up, mainly using factor levels and preserving them in the output (and not using globals in the second function) but I think the core of what you wanted is here. The trick not to get lost, in my experience, is to work from the inside out.

Roberto
  • 2,800
  • 6
  • 29
  • 28
  • this doesn't work for me. `Error in UseMethod("extract_") : no applicable method for 'extract_' applied to an object of class "list"`. i'm using packages: [1] dplyr_0.5.0 purrr_0.2.2 readr_1.0.0 tidyr_0.6.0 [5] tibble_1.2 ggplot2_2.1.0.9001 tidyverse_1.0.0 – Dominik Oct 09 '16 at 20:00
2

To perform the two sample t-tests, you have to create the combinations of the numbers of cylinders. I don't see that you can create the combinations using purrr functions. However a way which uses only purrr and base R functions is

library(purrr)
t_test2 <- mtcars %>% split(.$cyl) %>%
          transpose() %>%
          .[["mpg"]] %>%
          (function(x) combn(names(x), m=2, function(y) t.test(flatten_dbl(x[y[1]]), flatten_dbl(x[y[2]])) , simplify=FALSE))

although this does seem a bit contrived.

A similar approach which uses only base R functions with chaining is

t_test <- mtcars %>% split(.$cyl) %>%
                          (function(x) combn(names(x), m=2, function(y) x[y], simplify=FALSE)) %>%
                           lapply( function(x) t.test(x[[1]]$mpg, x[[2]]$mpg))
WaltS
  • 5,410
  • 2
  • 18
  • 24