15

Ultimately, I am trying to achieve something similar to the following, but leveraging dplyr instead of plyr:

library(dplyr)  
probs = seq(0, 1, 0.1)

plyr::ldply(tapply(mtcars$mpg, 
                   mtcars$cyl, 
                   function(x) { quantile(x, probs = probs) }))

#   .id   0%   10%   20%   30%   40%  50%   60%   70%   80%   90% 100%
# 1   4 21.4 21.50 22.80 22.80 24.40 26.0 27.30 30.40 30.40 32.40 33.9
# 2   6 17.8 17.98 18.32 18.98 19.40 19.7 20.48 21.00 21.00 21.16 21.4
# 3   8 10.4 11.27 13.90 14.66 15.04 15.2 15.44 15.86 16.76 18.28 19.2

The best dplyr equivalent I can come up with is something like this:

library(tidyr)
probs = seq(0, 1, 0.1)

mtcars %>%
  group_by(cyl) %>%
  do(data.frame(prob = probs, stat = quantile(.$mpg, probs = probs))) %>%
  spread(prob, stat)

#   cyl    0   0.1   0.2   0.3   0.4  0.5   0.6   0.7   0.8   0.9    1
# 1   4 21.4 21.50 22.80 22.80 24.40 26.0 27.30 30.40 30.40 32.40 33.9
# 2   6 17.8 17.98 18.32 18.98 19.40 19.7 20.48 21.00 21.00 21.16 21.4
# 3   8 10.4 11.27 13.90 14.66 15.04 15.2 15.44 15.86 16.76 18.28 19.2

Notice that I I also need to use tidyr::spread. In addition, notice that I have lost the % formatting for the column headers at the benefit of replacing .id with cyl in the first column.

Questions:

  1. Is there a better dplyr based approach to accomplishing this tapply %>% ldply chain?
  2. Is there a way to get the best of both worlds without jumping through too many hoops? That is, get the % formatting and the proper cyl column name for the first column?
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116

2 Answers2

12

Using dplyr

library(dplyr)
mtcars %>% 
   group_by(cyl) %>% 
   do(data.frame(as.list(quantile(.$mpg,probs=probs)), check.names=FALSE))
#  cyl   0%   10%   20%   30%   40%  50%   60%   70%   80%   90% 100%
#1   4 21.4 21.50 22.80 22.80 24.40 26.0 27.30 30.40 30.40 32.40 33.9
#2   6 17.8 17.98 18.32 18.98 19.40 19.7 20.48 21.00 21.00 21.16 21.4
#3   8 10.4 11.27 13.90 14.66 15.04 15.2 15.44 15.86 16.76 18.28 19.2

Or an option using data.table

library(data.table)
as.data.table(mtcars)[, as.list(quantile(mpg, probs=probs)) , cyl]
#   cyl   0%   10%   20%   30%   40%  50%   60%   70%   80%   90% 100%
#1:   6 17.8 17.98 18.32 18.98 19.40 19.7 20.48 21.00 21.00 21.16 21.4
#2:   4 21.4 21.50 22.80 22.80 24.40 26.0 27.30 30.40 30.40 32.40 33.9
#3:   8 10.4 11.27 13.90 14.66 15.04 15.2 15.44 15.86 16.76 18.28 19.2
akrun
  • 874,273
  • 37
  • 540
  • 662
  • care to explain `check.names = FALSE`? – JasonAizkalns Jun 02 '15 at 13:53
  • 2
    @JasonAlzkains It is an argument in `data.frame` where the default option is to `check.names=TRUE`. So, if the column names start with non-numeric values, it will be append `X` to it. The relevant code is ` if (check.names) vnames <- make.names(vnames, unique = TRUE)` – akrun Jun 02 '15 at 13:56
7

@akrun's version is good, but I would use data_frame_ inside the do statement.

mtcars %>% 
  group_by(cyl) %>% 
  do(data_frame_(quantile(.$mpg, probs = probs)))
## Source: local data frame [3 x 12]
## Groups: cyl
## 
##   cyl   0%   10%   20%   30%   40%  50%   60%   70%   80%   90% 100%
## 1   4 21.4 21.50 22.80 22.80 24.40 26.0 27.30 30.40 30.40 32.40 33.9
## 2   6 17.8 17.98 18.32 18.98 19.40 19.7 20.48 21.00 21.00 21.16 21.4
## 3   8 10.4 11.27 13.90 14.66 15.04 15.2 15.44 15.86 16.76 18.28 19.2

Upon further investigation on why this works, it looks like data_frame_ differs from the usual SE logics used in dplyr. data_frame_ only takes one argument columns and really expects a lazy_dots argument.

If it gets a vector instead, it works, because lazy evaluation of the individual arguments work. So this feature of using data_frame_ on a vector like that may actually be a bug.

shadow
  • 21,823
  • 4
  • 63
  • 77
  • 1
    Didn't know that `data_frame_` works in a compact manner. Good info! – akrun Jun 02 '15 at 13:58
  • Is there a way to generate the output in long form using `data_frame_()` then? – Arun Jun 02 '15 at 13:59
  • @Arun: You could use `lazy_dots`, but that seems a bit overly complicated: `data_frame_(lazyeval::lazy_dots(quantile(.$mpg, probs = probs)))`. Don't know of a simpler solution. Of course this is equivalent to `data_frame(quantile(.$mpg, probs = probs))`. – shadow Jun 02 '15 at 14:08
  • 1
    @shadow - very interesting, not sure I completely understand **why** this works, and I'm guessing others would benefit from an explanation in your answer. – JasonAizkalns Jun 02 '15 at 14:14
  • @JasonAizkalns +1. shadow, seems to me that `data_frame()` and `data_frame_()` should yield identical results.. (as one is the SE and other's the NSE)? – Arun Jun 02 '15 at 14:23
  • Thank you @shadow for the additional explanation. I've added this as [issue #1194](https://github.com/hadley/dplyr/issues/1194) on GitHub. – JasonAizkalns Jun 02 '15 at 14:58