how to filter top 10 percentile of a column in a data frame group by id using dplyr

Question

I have the following data frame:

id   total_transfered_amount day
1       1000                 2
1       2000                 3
1       3000                 4
1       1000                 1
1       10000                4
2       5000                 3
2       6000                 4
2       40000                2
2       4000                 3
2       4000                 3
3       1000                 1
3       2000                 2
3       3000                 3
3       30000                3
3       3000                 3

Need to filter out rows that fall above 90 percentile in 'total_transfered_amount' column for every id seperately using dplyr package preferabely , for example I need to filter out following rows:

2       40000                2
3       30000                3

@Mateusz1981 i doubt sample_frac works based on percentile concept, don't want to sample the column, i want to keep 90 percentile and get rid of the rows that fake in top 10 percentile — chessosapiens, Jun 27 '16 at 09:38
how can i join it in deployer syntax using group_by and filter ? — chessosapiens, Jun 27 '16 at 09:40

Mateusz1981 · Answer 1 · 2016-06-27T14:27:56.627

9

Checkt this out. I do not understand why you have first row in your output

 dane <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3,3 ), total_trans = c(1000, 2000, 3000, 1000, 10000, 5000, 6000, 
                                                                                            40000, 4000, 4000, 1000, 2000, 3000, 30000, 3000), 
                       day = c(2, 3,4, 1, 4, 3, 4, 2, 3, 3, 1, 2, 3, 3, 3))

    library(dplyr)




dane %>% group_by(id) %>% filter(quantile(total_trans, 0.9)<total_trans)





      id total_trans   day   
  (dbl)       (dbl) (dbl) 
1     1       10000     4  
2     2       40000     2 
3     3       30000     3

edited Jun 27 '16 at 14:27

answered Jun 27 '16 at 09:47

Mateusz1981

1,817
17
33

Edited, i just guessed that 10000 may fall above the 90 percentile, – chessosapiens Jun 27 '16 at 09:49
what your answer is missing i think is that it calculates the percentile for whole column but we need do it seperately for every id group. – chessosapiens Jun 27 '16 at 09:56
but quantile is not a aggregation function is it? – chessosapiens Jun 27 '16 at 10:01
adding group_by before mutate would not solve the problem – chessosapiens Jun 27 '16 at 10:02
ok, I am missed. You wanted .9 quantile per each 'id' or you want the quantile value from all observations and than select all what is above in all 'id's' – Mateusz1981 Jun 27 '16 at 10:05
90 percentile for each id and then filter out top 10 percentile your code is not consistent with your output – chessosapiens Jun 27 '16 at 10:32
I do not understand how my answer is different than @akrun – Mateusz1981 Jun 27 '16 at 10:43
try your code:dane %>% group_by(id) %>% mutate(li = quantile(total_trans, 0.9)) %>% filter(total_trans > li) it returns something else – chessosapiens Jun 27 '16 at 10:45
if you add ' %>% select(-li)' it returns exactly the same, the 'li' column is not shown – Mateusz1981 Jun 27 '16 at 10:46
it still missing the first line , as it calculates the percentile for whole column – chessosapiens Jun 27 '16 at 10:51
2

the dplyr command you are look for is `dane %>% group_by(id) %>% filter(quantile(total_trans, 0.9) – ArunK Jun 27 '16 at 12:32
and it returns the same as my, I know my has more syntax – Mateusz1981 Jun 27 '16 at 14:28

akrun · Accepted Answer · 2016-06-27T10:27:49.827

1

We can use data.table

 library(data.table)
 setDT(df1)[,.SD[quantile(total_transfered_amount, 0.9) < 
                total_transfered_amount] , by = id]
 #    id total_transfered_amount day
 #1:  1                   10000   4
 #2:  2                   40000   2
 #3:  3                   30000   3

Or we can use base R

df1[with(df1, as.logical(ave(total_transfered_amount, id, 
              FUN=function(x) quantile(x, 0.9) < x))),]
#   id total_transfered_amount day
#5   1                   10000   4
#8   2                   40000   2
#14  3                   30000   3

edited Jun 27 '16 at 10:27

answered Jun 27 '16 at 10:07

akrun

874,273
37
540
662

yes correct, what if we want to keep it as a data frame and use dplyr ? – chessosapiens Jun 27 '16 at 10:15
@sanaz the `data.table` should work with `dplyr`. If you need to change to `data.frame, use `setDF(res)` – akrun Jun 27 '16 at 10:16
the problem is that i may want to migrate the code to r spark then there is no data.table concept in R spark yet – chessosapiens Jun 27 '16 at 10:21
@sanaz In that case, you can still use `base R`, right? `df1[with(df1, ave(total_transfered_amount, id, FUN=function(x) quantile(x, 0.9) < x)),]` – akrun Jun 27 '16 at 10:25

how to filter top 10 percentile of a column in a data frame group by id using dplyr

2 Answers2