I've been trying to calculate percentiles on a fairly large number of observations. I came across two different ways of calculating the percentiles. Since I am working on a panel data set I would like to group the percentiles per time period. To this end, I was using this Use dplyr::percent_rank() to compute percentile ranks within group and this question Percentile for Each Observation w/r/t Grouping Variable.
The problem now is, that apparrently the percentiles are different between these two commands and I would like to know if both are "correct". To demonstrate the point:
library(data.table)
library(plyr)
years = c(2006, 2006, 2006, 2006, 2001, 2001, 2001, 2001, 2001)
scores = c(13, 65, 23, 34, 78, 56, 89, 98, 100)
dt <- data.table(years
, scores)
ddply(dt, .(years), transform, percentile = ecdf(scores)(scores))
ddply(dt, .(years), transform, percentile = round(percent_rank(scores), 4))
dt[, .( scores
, ecdf.percentile = ecdf(scores)(scores)
, p.rank.percentile = round(percent_rank(scores), 4) )
, by = list(years)][order(years),]
It can be seen, that although they are pretty similar they are different:
years scores ecdf.percentile p.rank.percentile
1: 2001 78 0.40 0.2500
2: 2001 56 0.20 0.0000
3: 2001 89 0.60 0.5000
4: 2001 98 0.80 0.7500
5: 2001 100 1.00 1.0000
6: 2006 13 0.25 0.0000
7: 2006 65 1.00 1.0000
8: 2006 23 0.50 0.3333
9: 2006 34 0.75 0.6667