Another approach to percentiles?

Question

I have a dataset which essentially consists of a list of job batches, the number of jobs contained in each batch, and the duration of each job batch. Here is a sample dataset:

CREATE TABLE test_data
(
   batch_id    NUMBER,
   job_count   NUMBER,
   duration    NUMBER
);

INSERT INTO test_data VALUES (1, 37, 9);
INSERT INTO test_data VALUES (2, 47, 4);
INSERT INTO test_data VALUES (3, 66, 6);
INSERT INTO test_data VALUES (4, 46, 6);
INSERT INTO test_data VALUES (5, 54, 1);
INSERT INTO test_data VALUES (6, 35, 1);
INSERT INTO test_data VALUES (7, 55, 9);
INSERT INTO test_data VALUES (8, 82, 7);
INSERT INTO test_data VALUES (9, 12, 9);
INSERT INTO test_data VALUES (10, 52, 4);
INSERT INTO test_data VALUES (11, 3, 9);
INSERT INTO test_data VALUES (12, 90, 2);

Now, I want to calculate some percentiles for the duration field. Typically, this is done with something like the following:

SELECT
       PERCENTILE_DISC( 0.75 )
          WITHIN GROUP (ORDER BY duration ASC)
          AS third_quartile
FROM
       test_data;

(Which gives the result of 9)

My problem here is that we don't want to get the percentiles based on batches, I want to get them based on individual jobs. I can figure this out by hand quite easily by generating a running total of the job_count:

SELECT
       batch_id,
       job_count,
       SUM(
            job_count
       )
       OVER (
              ORDER BY duration
              ROWS UNBOUNDED PRECEDING
             )
          AS total_jobs,
        duration
FROM
       test_data
ORDER BY
       duration ASC;

BATCH_ID     JOB_COUNT    TOTAL_JOBS   DURATION     
6            35           35           1            
5            54           89           1            
12           90           179          2            
2            47           226          4            
10           52           278          4            
3            66           344          6            
4            46           390          6            
8            82           472          7            
9            12           484          9            
1            37           521          9            
11           3            524          9            
7            55           579          9

Since I have 579 jobs, then the 75th percentile would be job 434. Looking at the above result set, that corresponds with a duration of 7, different from what the standard function does.

Essentially, I want to consider each job in a batch as a separate observation, and determine percentiles based on those, instead on the batches.

Is there a relatively simple way to accomplish this?

Do you mean that you're looking for "`per job`" duration? If so, can use `duration/job_count` as a measure? Please clarify your requirements. Your second approach does not make much sense (at least mathematically). — PM 77-1, May 08 '13 at 22:17
While correct, that still leaves the problem in place. (I had omitted that for simplicity in the mock data) If I do that then the reported 75th percentile from the above dataset is 0.16, but the desired 75th percentile should be 0.13 because it's still determining the 75th percentile based on batches and not jobs. — emiller42, May 08 '13 at 22:27
Also worth noting that functionally, no job in a batch is considered finished until the entire batch is finished. So from the perspective of an end user, all jobs in a batch take the same amount of time. — emiller42, May 08 '13 at 22:31

score 3 · Accepted Answer · answered May 08 '13 at 23:51

I would think of this as "weighted" percentiles. I don't know if there is a built-in analytic function for this in Oracle, but it is easy enough to calculate. And you are on the way there.

The additional idea is to calculate the total number of jobs, and then use arithmetic to select the value you want. For the 75th percentile, the value is the smallest duration such that the cumulative number of jobs is greater than 0.75 times the total number of jobs.

Here is the example in SQL:

select pcs.percentile, min(case when cumjobs >= totjobs * percentile then duration end)
from (SELECT batch_id, job_count,
             SUM(job_count) OVER (ORDER BY duration) as cumjobs,
             sum(job_count) over () as totjobs,
             duration
      FROM test_data
     ) t cross join
     (select 0.25 as percentile from dual union all
      select 0.5 from dual union all
      select 0.75 from dual
     ) pcs
group by pcs.percentile;

This example gives you the percentile values (and as an added bonus, for three different percentiles) with each value on its own row. If you want the values on each row, you need to join back to your original table.

Both of these answers get me exactly what I want, but I'm accepting this one as it runs much faster against a larger data set. (Tested against 600k batches, with up to 1,800 jobs per batch) I would upvote both, but I can't yet. Thank you both for the answers! — emiller42, May 09 '13 at 15:05
Just in case anyone else is confused about the table _dual_ used in this answer, it's a default Oracle system table, often used for selecting constants (because the SELECT statement requires FROM). See the [Wikipedia article](https://en.wikipedia.org/wiki/DUAL_table). — revengeoftheants, Dec 09 '15 at 02:33

score 0 · Answer 2 · answered May 08 '13 at 23:56

OK. I think I have your answer. Idea is mine. Implementation is borrowed from this Ask Tom article

SELECT PERCENTILE_DISC( 0.75 )
       WITHIN GROUP (ORDER BY duration ASC)
       AS third_quartile
FROM(
with data as
  (select level l
   from dual, (select max(job_count) max_jobs from test_data)
   connect by level <= max_jobs
  )
  select *
  from test_data, data
  where l <= job_count
  --ORDER BY duration, batch_id
  ) inner
;

Here is SQL Fiddle.

Another approach to percentiles?

2 Answers2