5

I have the following df:

group = rep(seq(1,3),30)
variable = runif(90, 5.0, 7.5)
df = data.frame(group,variable)

I need to i) Define quantile by groups, ii) Assign each person to her quantile with respect to her group.

Thus, the output would look like:

id    group  variable  quantile_with_respect_to_the_group
1      1      6.430002     1
2      2      6.198008     3
          .......

There is a complicated way to do it with loops and cut function over each groups but it is not efficient at all. Does someone know a better solution ?

Thanks !

Jb_Eyd
  • 635
  • 1
  • 7
  • 20
  • you can use `tapply(df$variable, df$group, FUN = function(x) quantile(x, prob = 0.5), simplify = TRUE)` or something else like `aggregate`, or even the package `dplyr` – Mamoun Benghezal Feb 17 '16 at 10:28
  • It works for computing the quantile but it does not assign each person to his own quantile in the df. – Jb_Eyd Feb 17 '16 at 10:50

2 Answers2

6

In data.table:

library(data.table)

setDT(df)[,quantile := cut(variable, quantile(variable, probs = 0:4/4),
                         labels = FALSE, include.lowest = TRUE), by = group]

>head(df)
#    group variable quantile
# 1:     1 6.103909        2
# 2:     2 6.511485        3
# 3:     3 5.091684        1
# 4:     1 6.966461        4
# 5:     2 6.613441        4
mtoto
  • 23,919
  • 4
  • 58
  • 71
  • Could you explain the ":=" in your function and the setDT(df), thanks. It works pretty well :) ! – Jb_Eyd Feb 17 '16 at 17:04
  • it's part of the `data.table` syntax, you can read more about it [here](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reference-semantics.html) – mtoto Feb 17 '16 at 17:08
2

Another version with dplyr/findInterval

library(dplyr)
df %>%
  group_by(group) %>% 
  mutate(Quantile = findInterval(variable, 
                quantile(variable, probs=0:4/4)))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Comparing with the `data.table` solution from @mtoto, this does not create grouping within the "group", but only overall. – Dima Oct 24 '19 at 11:36
  • 1
    @Dima. Perhaps you have also loaded `plyr` which have `mutate` and it masked the `dplyr::mutate`. Use the. `dplyr::mutate` explicitly – akrun Oct 24 '19 at 16:34
  • Thanks a lot! You're totally right. However, this also produces slightly different results than the `data.table` solution. These differences apparently are due to different group assignments at the quartile borders: While `dplyr` assigns to the next _higher_ group for a value equal to a quartile, `data.table`assigns to the next _lower_ group. Furthermore, in my data set the `dplyr` solution even assigns some values to a **fifth** group. These are namely the group **maximum values**. As it therefore seems, both methods seem to use different value rounding or diverse ">=" signs. – Dima Oct 24 '19 at 22:08
  • 1
    @Dima It can be fixed by changing some of the parameters in `findInterval` – akrun Oct 25 '19 at 16:24