How to calculate percentile with group by?

Question

I have a data.table with over ten thousand of rows and it looks like this:

DT1 <- data.table(ID = 1:10,
                  result_2010 = c("TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE"),
                  result_2011 = c("FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE"),
                  years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))

    ID result_2010 result_2011 years
 1:  1        TRUE       FALSE  15.0
 2:  2       FALSE        TRUE  16.5
 3:  3        TRUE       FALSE  31.0
 4:  4       FALSE       FALSE   1.0
 5:  5       FALSE       FALSE  40.2
 6:  6        TRUE       FALSE   0.3
 7:  7       FALSE        TRUE  12.0
 8:  8       FALSE       FALSE  22.7
 9:  9        TRUE       FALSE  19.0
10: 10       FALSE        TRUE  12.0

For "result_2010" and "result_2011" I want to make a percentile analysis of the "years" but only if the value in for the individual is "TRUE". The code that I tried seems to work, but it gives back the same results for "result_2010" and "result_2011", which is for sure incorrect:

DT1 %>%
  group_by(result_2010 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))
DT1 %>%
  group_by(result_2011 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))

Could anyone help how to correct my code?

you probably wanted to use `filter` instead of `group_by` i.e `filter(result_2010 == "TRUE")` — Ronak Shah, Oct 10 '21 at 09:33
Is there a particular reason you're using `"TRUE"`/`"FALSE"` instead of the more-direct `TRUE`/`FALSE`? I find that efficient processing often starts with efficient data. — r2evans, Oct 10 '21 at 14:31
We seem to overwhelm the questioner with the number of solutions and their complexity. Gabesh must be afraid to try to do all of these. Not to mention deciding who to give 15 reputation points to. And he himself received +30 points for his question :-( ! PS. I checked your solution again, both when the variables `result_2010` and `result_2011` in `DT1` are `logical` and `character` type and every time I get the error "Error ... object 'value' not found". — Marek Fiołka, Oct 12 '21 at 21:41

jay.sf · Answer 1 · 2021-10-10T14:35:47.217

4

Using melt and aggregate.

library(data.table)
melt(DT1, c(1, 4), 2:3) |>
  transform(variable=substring(variable, 8)) |>
  subset(value == TRUE) |>
  with(aggregate(list(q=years), list(year=variable), \(x)
                 c(quantile(x), mean=mean(x))))
#   year   q.0%  q.25%  q.50%  q.75% q.100% q.mean
# 1 2010  0.300 11.325 17.000 22.000 31.000 16.325
# 2 2011 12.000 12.000 12.000 14.250 16.500 13.500

Note: Please use R>=4.1 for the |> pipes and \(x) function shorthand notation (or write function(x)).

edited Oct 10 '21 at 14:35

answered Oct 10 '21 at 09:40

jay.sf

60,139
8
53
110

1

I particularly like the `dplyr`-like pipe flow of this, sans `dplyr`. It's too bad (in my opinion) that group-wise `transform` (not used here, just in general) seems not-as-smooth (i.e., needing to use `ave`). – r2evans Oct 10 '21 at 14:49
@r2evans Did you ever inspect the `ave` code, hides `lapply(split())` inside. – jay.sf Oct 10 '21 at 14:50
1

Yes, in the past I've looked at it, and the use of `\`split<-\`` is quite enlightening. In general, dplyr's `group_by(grp) %>% mutate(a = ...)` does not seem to translate as well into `transform(a = ave(a, grp, FUN = \(x) ...))`, and works even less-well when transforming multiple variables at once. – r2evans Oct 10 '21 at 14:52

Marek Fiołka · Answer 2 · 2021-10-10T21:09:09.717

library(tidyverse)
DT1 <- tibble(ID = 1:10,
                  result_2010 = c(TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE),
                  result_2011 = c(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),
                  years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))

fQuantMean = function(x) t(quantile(x)) %>% 
  as_tibble() %>% bind_cols(mean = mean(x))

tibble(
  year = c(2010, 2011),
  data = list(DT1$years[DT1$result_2010],
              DT1$years[DT1$result_2011])
) %>% group_by(year) %>% 
  group_modify(~fQuantMean(.x$data[[1]]))

output

# A tibble: 2 x 7
# Groups:   year [2]
   year  `0%` `25%` `50%` `75%` `100%`  mean
  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
1  2010   0.3  11.3    17  22     31    16.3
2  2011  12    12      12  14.2   16.5  13.5

An update for anyone interested!

Hello dear colleagues. As you can see, each task can be solved in several different ways. So I decided to compare the methods proposed by here. Since @Gabesz mentioned that he has 10000 observations, I decided to check each of the solutions in terms of performance.

n=10000
set.seed(1234)
DT1 <- tibble(ID = 1:n,
              result_2010 = sample(c(TRUE, FALSE), n, replace = TRUE),
              result_2011 = sample(c(TRUE, FALSE), n, replace = TRUE),
              years = rnorm(n, 20, 5))

Then I did a little benchmark

fQuantMean = function(x) t(quantile(x)) %>% 
  as_tibble() %>% bind_cols(mean = mean(x))

fFiolka = function(){
  tibble(
    year = c(2010, 2011),
    data = list(DT1$years[DT1$result_2010],
                DT1$years[DT1$result_2011])
  ) %>% group_by(year) %>% 
    group_modify(~fQuantMean(.x$data[[1]]))
}
fFiolka()
# # A tibble: 2 x 7
# # Groups:   year [2]
#    year     `0%` `25%` `50%` `75%` `100%`  mean
#    <dbl>    <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
# 1  2010 -0.00697  16.4  19.9  23.3   38.6  19.9
# 2  2011 -0.633    16.5  20.0  23.4   38.6  20.0

library(data.table)

fjay_sf = function(){
  melt(DT1, c(1, 4), 2:3) |>
    transform(variable=substring(variable, 8)) |>
    subset(value == TRUE) |>
    with(aggregate(list(q=years), list(year=variable), \(x)
                   c(quantile(x), mean=mean(x))))
}
fjay_sf()
# year         q.0%        q.25%        q.50%        q.75%       q.100%       q.mean
# 1 2010 -0.006968224 16.447077579 19.947385976 23.348571278 38.636456902 19.944574420
# 2 2011 -0.633138113 16.530534403 20.043636844 23.424378551 38.636456902 20.013130400
# Warning message:
#   In melt(DT1, c(1, 4), 2:3) :
#   The melt generic in data.table has been passed a tbl_df and will attempt to redirect 
#   to the relevant reshape2 method; please note that reshape2 is deprecated, and this 
#   redirection is now deprecated as well. To continue using melt methods from reshape2
#    while both libraries are attached, e.g. melt.list, you can prepend the namespace 
#    like reshape2::melt(DT1). In the next version, this warning will become an error.


cols <- grep('result_', names(DT1), value = TRUE)

get_stats_fun <- function(DT, col) {
  DT %>%
    filter(.data[[col]] == "TRUE") %>%
    summarise("quantile" = list(round(quantile(years,c(.10,.25,.50,.75,.90)),1)),
              "median" = round(median(years), 1),
              "Mean" = round(mean(years),1)) %>%
    unnest_wider(quantile)
}

fShah = function(){
map_df(cols, ~get_stats_fun(DT1, .x), .id = 'Year') %>%
  mutate(Year = cols)
}
fShah()
# # A tibble: 2 x 8
#   Year        `10%` `25%` `50%` `75%` `90%` median  Mean
#   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
# 1 result_2010  13.5  16.4  19.9  23.3  26.4   19.9  19.9
# 2 result_2011  13.4  16.5  20    23.4  26.6   20    20  

library(microbenchmark)
ggplot2::autoplot(microbenchmark(fFiolka(), fjay_sf(), fShah(), times=100))

Hope the chart above explains it all.

@r2evans please don't blame me for skipping your solution but it caused me some errors.

score 2 · Answer 3 · answered Oct 10 '21 at 09:55

You may write-up a function and run it on every result column.

library(tidyverse)

cols <- grep('result_', names(DT1), value = TRUE)

get_stats_fun <- function(DT, col) {
  DT %>%
    filter(.data[[col]] == "TRUE") %>%
    summarise("quantile" = list(round(quantile(years,c(.10,.25,.50,.75,.90)),1)),
              "median" = round(median(years), 1),
              "Mean" = round(mean(years),1)) %>%
    unnest_wider(quantile)
}

map_df(cols, ~get_stats_fun(DT1, .x), .id = 'Year') %>%
  mutate(Year = cols)

#  Year        `10%` `25%` `50%` `75%` `90%` median  Mean
#  <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
#1 result_2010   4.7  11.3    17  22    27.4     17  16.3
#2 result_2011  12    12      12  14.2  15.6     12  13.5

r2evans · Answer 4 · 2021-10-10T14:45:26.977

A melt/dcast option:

library(data.table)
tmp <- melt(DT1, c("ID", "years"), variable.name = "year"
  )[ value == "TRUE",
   ][, .(variable = c(paste0("q", c(10, 25, 50, 75, 90)), "mu"),
         value = c(quantile(years, c(0.1, 0.25, 0.5, 0.75, 0.9)), 
                  mean(years)))
    , by = .(year)]
tmp
#            year variable  value
#          <fctr>   <char>  <num>
#  1: result_2010      q10  4.710
#  2: result_2010      q25 11.325
#  3: result_2010      q50 17.000
#  4: result_2010      q75 22.000
#  5: result_2010      q90 27.400
#  6: result_2010       mu 16.325
#  7: result_2011      q10 12.000
#  8: result_2011      q25 12.000
#  9: result_2011      q50 12.000
# 10: result_2011      q75 14.250
# 11: result_2011      q90 15.600
# 12: result_2011       mu 13.500

dcast(tmp, year ~ variable, value.var = "value")
#           year     mu   q10    q25   q50   q75   q90
#         <fctr>  <num> <num>  <num> <num> <num> <num>
# 1: result_2010 16.325  4.71 11.325    17 22.00  27.4
# 2: result_2011 13.500 12.00 12.000    12 14.25  15.6

You have complete control over the names, just assign then (in order) within the "variable" column (you might choose to name it better).

Or a solitary melt:

melt(DT1, c("ID", "years"), variable.name = "year"
  )[ value == "TRUE",
   ][, setNames(as.list(c(quantile(years, c(0.1, 0.25, 0.5, 0.75, 0.9)), 
                          mean(years))),
                c(paste0("q", c(10, 25, 50, 75, 90)), "mu"))
    , by = .(year)][]
#           year   q10    q25   q50   q75   q90     mu
#         <fctr> <num>  <num> <num> <num> <num>  <num>
# 1: result_2010  4.71 11.325    17 22.00  27.4 16.325
# 2: result_2011 12.00 12.000    12 14.25  15.6 13.500

Names are again controlled easily, now in the 2nd argument of setNames. The premise is that returning a named-list in data.table processing will convert it to named columns, so any function that does this is easily usable.

score 1 · Answer 5 · answered Oct 13 '21 at 15:48

This will be my first answer, so please forgive me if I do something wrong. By reading your question carefully, you wanted someone to help you improve your code. Here it is, please.

library(tidyverse)
library(data.table)

DT1 <- data.table(ID = 1:10,
                  result_2010 = c("TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE"),
                  result_2011 = c("FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE"),
                  years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))
DT1 %>%
  filter(result_2010 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))
DT1 %>%
  filter(result_2011 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))

In the first case, it returns the values 4.7, 11.3, 17, 22, 27.4, 16.3. In the second case, it returns 12, 12, 12, 14.2, 15.6, 13.5. I see so many different answers here. Although I honestly admit some of them I don't understand (yet). I really like the solution with quantile%>% tibble%>% bind_cols. But knock on what I have a low reputation for pointing to this as helpful.

How to calculate percentile with group by?

5 Answers5