0

I have a dataframe with multiple variables and I would like to find the quantiles () of each of these variables

Sample code:

testtable = data.frame(groupvar = c(rep('x',100), rep('y',100)), 
                       numericvar = rnorm(200))

I want to apply quantile(., c(.05, .1, .25, .5, .75, .9, .95)) to each of the variables in testtable. The ideal result would look like

   x    y
  .05 .05
  .1  .1
  .25 .25
  .5  .5
  .75 .75
  .9  .9
  .95 .95

where each entry is a quantile of x or y. For sample, .05 is the 5th percentile of the x .1 is the 10th percentile distribution of x, etc.

I tried summarise in dplyr but ran into a problem because my quantile function is returning a vector of length 7.

What is the best way to do this?

Micha Wiedenmann
  • 19,979
  • 21
  • 92
  • 137
Amazonian
  • 391
  • 2
  • 8
  • 22

3 Answers3

4

Here is a base R solution where we unstack the data frame and calculating the quantile for each column, for each quantile, i.e.

sapply(unstack(testtable, numericvar ~ groupvar), function(i) quantile(i, v1))

which gives,

              x           y
5%  -1.82980882 -1.49900735
10% -1.26047295 -1.02626933
25% -0.83928910 -0.68248217
50%  0.02757385 -0.02096953
75%  0.64842517  0.48624513
90%  1.63382801  1.09722178
95%  1.91104161  1.72846846

where v1 <- c(0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95)

Sotos
  • 51,121
  • 6
  • 32
  • 66
  • a tiny bit shorter: `sapply(unstack(testtable, numericvar ~ groupvar), quantile, probs = v1)` – Jaap Nov 26 '18 at 10:51
2

Another possibility with lapply, we need to convert first to list:

l <- split(testtable$numericvar, testtable$groupvar)

Now we can get the quantile then trasform back to data.frame:

ll <- lapply(l, function(x) quantile(unlist(x), c(.05, .1, .25, .5, .75, .9, .95)))
as.data.frame(ll)
#             x           y
# 5%  -1.8028162 -1.69293054
# 10% -1.3129427 -1.23125086
# 25% -0.7335853 -0.57010352
# 50% -0.1223181  0.05119533
# 75%  0.6727871  0.66203631
# 90%  1.3411195  1.08830220
# 95%  1.7068070  1.54248740

This could be turned in a function to call, you can add more to it to make it more general:

quantile_grouped <- function(data, group_var = "groupvar", quantile_var = "numericvar") {

  l <- split(testtable[, quantile_var], testtable[, group_var ])

  ll <- lapply(l, function(x) quantile(unlist(x), c(.05, .1, .25, .5, .75, .9, .95)))
  as.data.frame(ll)

}
quantile_grouped(testtable)
RLave
  • 8,144
  • 3
  • 21
  • 37
  • 2
    For your first line, you can avoid `lapply` by simply doing `split(testtable$numericvar, testtable$groupvar)` – Sotos Nov 26 '18 at 10:29
1

Another option:

pr <- c(0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95)
as.data.frame.list(tapply(testtable$numericvar, testtable$groupvar,
                          quantile, probs = pr))

which gives:

              x          y
5%  -1.57823487 -1.5142682
10% -1.28807795 -1.2153000
25% -0.60598752 -0.6889401
50% -0.07536852 -0.2036487
75%  0.57269482  0.4892494
90%  1.04087379  1.2231926
95%  1.22329927  1.7421848
Jaap
  • 81,064
  • 34
  • 182
  • 193