9

I Have table called timings where we are storing 1 million response timings for load testing , now we need to divide this data into 100 groups i.e. - first 500 records as one group and so on , and calculate percentile of each group , rather than average.

so far i tried this query

Select quartile
     , avg(data) 
     , max(data) 
  FROM (

        SELECT data
             , ntile(500) over (order by data) as quartile
          FROM data
       ) x
 GROUP BY quartile
 ORDER BY quartile

but how do i have find the percentile

sshet
  • 1,152
  • 1
  • 6
  • 15
  • Paul A Jungwirthjungwirth's answer is great. But there is a more complete tutorial here : [PostgreSQL NTILE Function](https://www.postgresqltutorial.com/postgresql-ntile-function/) – ABS Oct 24 '20 at 10:28

2 Answers2

13

Usually, if you want to know the percentile, you are safer using cume_dist than ntile. That is because ntile behaves strangely when given few inputs. Consider:

=# select v, 
          ntile(100) OVER (ORDER BY v),
          cume_dist() OVER (ORDER BY v)
   FROM (VALUES (1), (2), (4), (4)) x(v);

 v | ntile | cume_dist 
---+-------+-----------
 1 |     1 |      0.25
 2 |     2 |       0.5
 4 |     3 |         1
 4 |     4 |         1

You can see that ntile only uses the first 4 out of 100 buckets, where cume_dist always gives you a number from 0 to 1. So if you want to find out the 99th percentile, you can just throw away everything with a cume_dist under 0.99 and take the smallest v from what's left.

If you are on Postgres 9.4+, then percentile_cont and percentile_disc make it even easier, because you don't have to construct the buckets yourself. The former even gives you interpolation between values, which again may be useful if you have a small data set.

Paul A Jungwirth
  • 23,504
  • 14
  • 74
  • 93
  • This is by design.. you are asking it to bucket using an n that is larger than your data.. not sure what you would expect it to give you. – karuhanga Mar 12 '21 at 09:58
10

Edit:

Please note that since I originally answered this question, Postgres has gotten additional aggregate functions to help with this. See percentile_disc and percentile_cont here. These were introduced in 9.4.

Original Answer:

ntile is how one calculates percentiles (among other n-tiles, such as quartile, decile, etc.).

ntile groups the table into the specified number of buckets as equally as possible. If you specified 4 buckets, that would be a quartile. 10 would be a decile.

For percentile, you would set the number of buckets to be 100.

I'm not sure where the 500 comes in here... if you want to determine which percentile your data is in (i.e. divide the million timings as equally as possible into 100 buckets), you would use ntile with an argument of 100, and the groups would have more than 500 entries.

If you don't care about avg nor max, you can drop a bunch from your query. So it would look something like this:

SELECT data, ntile(100) over (order by data) AS percentile
FROM data
ORDER BY data
khampson
  • 14,700
  • 4
  • 41
  • 43
  • each group contains 500 records , as there would be 100 groups.. so ntile will automatically calculate percentile? for each group? – sshet Jan 11 '15 at 05:16
  • Not sure what you mean by "each group contains 500 records". What groups? All of your data is in the `data` table. `ntile(100)` on that data will bucket the data into 100 groups, however many rows that will be. And that's how one would calculate the percentile for any given set of data. If you broke it into 500 groups first, that's no longer percentile... – khampson Jan 11 '15 at 05:23
  • we need to plot a percentile graph that goes from asc to desc order ..table contains 50,000 timings,, now we need to divide this 50, 000 timings into groups and calculate percentile for each group,, i am little confused about this. or how could we achieve this – sshet Jan 11 '15 at 05:26
  • Why do you need to divide it into groups first? That's the part I'm not clear on. If you want to calculate the percentile across your timings data, generally it would be done across the entire set, or else the final result really isn't the percentile for your data set anymore. How would you divide it into groups first? Based on what criteria? – khampson Jan 11 '15 at 05:30
  • we need to divide it into 100 groups, to plot a graph, as we cant show 500000 records in x- axis , the response timings are - how many circles lies in a polygon , these are the random polygons and circles, we divide the grps as first 1 - 500 as one group and so on. – sshet Jan 11 '15 at 05:37
  • Dividing it into 100 groups is what `ntile(100)` will give you. Where does the 500 come in? – khampson Jan 11 '15 at 05:39
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/68581/discussion-between-lampdev-and-khampson). – sshet Jan 11 '15 at 05:53
  • 3
    NTILE is NOT able to calculate Percentiles correctly (or quartiles or any other type of quantile). NTILE does not consider ties which means equal values can end up in different buckets. In other words - Sally and Joe both scored 81%. Using NTILE to calculate each person's percentile, you may see Sally or Joe ranked differently. That's not how percentiles work. – Alan Burstein Feb 15 '19 at 16:45