Different percentiles using ecdf() and plyr::percent_rank()

Question

I've been trying to calculate percentiles on a fairly large number of observations. I came across two different ways of calculating the percentiles. Since I am working on a panel data set I would like to group the percentiles per time period. To this end, I was using this Use dplyr::percent_rank() to compute percentile ranks within group and this question Percentile for Each Observation w/r/t Grouping Variable.

The problem now is, that apparrently the percentiles are different between these two commands and I would like to know if both are "correct". To demonstrate the point:

library(data.table)
library(plyr)
years = c(2006, 2006, 2006, 2006, 2001, 2001, 2001, 2001, 2001)
scores = c(13, 65, 23, 34, 78, 56, 89, 98, 100)

dt <- data.table(years
                 , scores)

ddply(dt, .(years), transform, percentile = ecdf(scores)(scores)) 
ddply(dt, .(years), transform, percentile = round(percent_rank(scores), 4)) 
dt[, .( scores
      , ecdf.percentile = ecdf(scores)(scores)
      , p.rank.percentile = round(percent_rank(scores), 4) )
      , by = list(years)][order(years),]

It can be seen, that although they are pretty similar they are different:

   years scores ecdf.percentile p.rank.percentile
1:  2001     78            0.40            0.2500
2:  2001     56            0.20            0.0000
3:  2001     89            0.60            0.5000
4:  2001     98            0.80            0.7500
5:  2001    100            1.00            1.0000
6:  2006     13            0.25            0.0000
7:  2006     65            1.00            1.0000
8:  2006     23            0.50            0.3333
9:  2006     34            0.75            0.6667

Look at the definition of `ecdf` and `percent_rank`. I think you should be using `dplyr::cume_dist`. That would give you the same results as `ecdf`. `dt[, .( scores , ecdf.percentile = ecdf(scores)(scores) , p.rank.percentile = round(cume_dist(scores), 4) ) , by = list(years)][order(years),]` — A Gore, Jun 20 '17 at 15:37
Ah ok, thanks. But I think there's quite some confusion with these functions, what also can be seen in this answer. Don't know if that really is correct way to calculate quartiles, which would be the next step in my case as well. — hannes101, Jun 20 '17 at 15:51
In that case I would suggest to understand the difference between percentile and quantile. https://stats.stackexchange.com/questions/156778/percentile-vs-quantile-vs-quartile — A Gore, Jun 20 '17 at 16:05
Yeah, sorry was pretty late for me yesterday, anyway thanks. — hannes101, Jun 21 '17 at 07:13

Different percentiles using ecdf() and plyr::percent_rank()

0 Answers0