7

I have a column of data, some of which are NULL values, from which I wish to extract the single 90th percentile value:

ColA
-----
NULL
100
200
300
NULL
400
500
600
700
800
900
1000

For the above, I am looking for a technique which returns the value 900 when searching for the 90th percentile, 800 for the 80th percentile, etc. An analogous function would be AVG(ColA) which returns 550 for the above data, or MIN(ColA) which returns 100, etc.

Any suggestions?

jbeldock
  • 2,755
  • 3
  • 18
  • 31

3 Answers3

14

If you want to get exactly the 90th percentile value, excluding NULLs, I would suggest doing the calculation directly. The following version calculates the row number and number of rows, and selects the appropriate value:

select max(case when rownum*1.0/numrows <= 0.9 then colA end) as percentile_90th
from (select colA,
             row_number() over (order by colA) as rownum,
             count(*) over (partition by NULL) as numrows
      from t
      where colA is not null
     ) t

I put the condition in the SELECT clause rather than the WHERE clause, so you can easily get the 50th percentile, 17th, or whatever values you want.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • If the op is looking for just one value to be the 90th percrntile, shouldn't be either the mean or median value of all the 90th petcentiles? Say, position 90.5 instead of position 90.0? – MatBailie Aug 10 '12 at 22:24
  • @Dems ... Not exactly. In general, the split would be exactly the value at 90%, if it exists. Othwise, it is traditionally taken to be the average of the value immediately before and immediately after. The average of the 90th tile is, I think, closer to what you are thinking. A valid computation, but not what inwould call the 90th percentile. There may be sone disagreement on what to do with duplicate values, since the same value could be the 80th percentile and 90th (in either computation). – Gordon Linoff Aug 10 '12 at 22:50
  • Apologies for the late "accept"! Yours was the answer I used, Gordon. Thanks again. – jbeldock Jan 24 '14 at 01:50
6
WITH
  percentiles AS
(
  SELECT
    NTILE(100) OVER (ORDER BY ColA) AS percentile,
    *
  FROM
    data
)
SELECT
  *
FROM
  percentiles
WHERE
  percentile = 90


Note: If the data has less than 100 observations, not all percentiles will have a value. Equally, if you have more than 100 observations, some percentiles will contain more values.

MatBailie
  • 83,401
  • 18
  • 103
  • 137
  • +1 for `NTILE`. Note that if there's less than 100 values, lets say only 50 values... then the percentile will be `1 - 50`... so there will only be a 90 percentile if there is more than 90 values. – Michael Fredrickson Aug 10 '12 at 17:18
  • Thank you. This is a step in the right direction. I am editing my post to indicate two other requirements: the data contain many NULLs, and they should be ignored, and my objective is to derive a single value (**the** 90th percentile value). – jbeldock Aug 10 '12 at 17:21
2

Starting with SQL Server 2012, there are now PERCENTILE_DISC and PERCENTILE_CONT inverse distribution functions. These are (so far) only available as window functions, not as aggregate functions, so you would have to remove redundant results because of the lacking grouping, e.g. by using DISTINCT or TOP 1:

WITH t AS (
  SELECT *
  FROM (
    VALUES(NULL),(100),(200),(300),
      (NULL),(400),(500),(600),(700),
      (800),(900),(1000)
  ) t(ColA)
)
SELECT DISTINCT percentile_disc(0.9) WITHIN GROUP (ORDER BY ColA) OVER()
FROM t
;

I have blogged about percentiles more in detail here.

Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509