In Postgresql, how to select top n percent of rows by a column?

Question

In Postgresql (version 10), following sql select all rows order by the avg_grade.

-- query - students list, order by average grade,
select s.student_id, s.student_name, avg(ce.grade) as avg_grade
from students as s
       left join course_enrollment as ce on s.student_id = ce.student_id
group by s.student_id
order by avg_grade desc NULLS LAST;

Relevant tables

students:

create table students (
  student_id   bigserial                           not null primary key,
  student_name varchar(200)                        not null,
  created      timestamp default CURRENT_TIMESTAMP not null
);

course_enrollment:

-- create table,
create table course_enrollment
(
  course_id  bigint                              not null,
  student_id bigint                              not null,
  grade      float                               not null,
  created    timestamp default CURRENT_TIMESTAMP not null,
  unique (course_id, student_id)
);

Questions:

How to retrieve only the top n% (e.g 10%) of rows, whose avg_grade have the highest values?
Wondering is there a window function to do this, or a sub query is required?

BTW:

This is different from Postgresql : How do I select top n percent(%) entries from each group/category
Because that one want top n% in each group, thus it could use partitions in the window functions.
But this one want the top n% overall, thus group by is required.

Possible duplicate of [Postgresql : How do I select top n percent(%) entries from each group/category](https://stackoverflow.com/questions/24626036/postgresql-how-do-i-select-top-n-percent-entries-from-each-group-category) — Joakim Danielson, Mar 02 '19 at 10:35
@JoakimDanielson That question select top n% from each group, but this question select top n% overall, how are they the same ? Have you tried? — Eric, Mar 02 '19 at 10:59
Do they have to be exactly the same to be helpful? Isn’t there enough info in that question to help you solve the issue? — Joakim Danielson, Mar 02 '19 at 11:03
@beatrice This is trivial, but in practical, maybe this would be the most efficient way. — Eric, Mar 02 '19 at 11:25

score 3 · Accepted Answer · edited Mar 02 '19 at 13:05

3

I would use a subquery:

select student_id, student_name, avg_grade, rank() over (order by avg_grade desc)
from (select s.student_id,
             s.student_name,
             avg(ce.grade)                                        as avg_grade,
             rank() over (order by avg(ce.grade) desc nulls last) as seqnum,
             count(*) over ()                                     as cnt
      from students s
             left join
           course_enrollment ce
           on s.student_id = ce.student_id
      group by s.student_id
     ) as ce_avg
where seqnum <= cnt * 0.1;

There are other window functions you can use instead, such as NTILE() and PERCENTILE_DISC(). I prefer the direct calculation because it gives more control over how ties are handled.

edited Mar 02 '19 at 13:05

Eric

22,183
20
145
196

answered Mar 02 '19 at 12:04

Gordon Linoff

1,242,037
58
646
786

This works, though it didn't handle the case when there are multiple rows with the same `avg_grade` at end of the top window, it's better to include all the rows with the same `avg_grade` to be more fair. – Eric Mar 02 '19 at 12:15
1

@EricWang . . . You can use `rank()` instead of `row_number()` to handle that case. – Gordon Linoff Mar 02 '19 at 12:54
Cool, that fixed it. – Eric Mar 02 '19 at 13:00
1

@EricWang . . . Ironically, this is exactly why I prefer to do the calculation directly rather than using `percentile_()` or `ntile()` -- it gives more control over handling ties. – Gordon Linoff Mar 02 '19 at 13:10

Eric · Answer 2 · 2019-03-02T12:59:33.053

After trying for a while, got a ugly yet working solution by myself.

select *, rank() over (order by avg_grade desc)
from (
       select s.student_id, s.student_name, avg(ce.grade) as avg_grade
       from students as s
              left join course_enrollment as ce on s.student_id = ce.student_id
       group by s.student_id
       order by avg_grade desc nulls last
     ) as ce_avg
where avg_grade >= (
  select ce_avg.avg_grade
  from (
         select s.student_id, s.student_name, avg(ce.grade) as avg_grade
         from students as s
                left join course_enrollment as ce on s.student_id = ce.student_id
         group by s.student_id
         order by avg_grade desc nulls last
       ) as ce_avg
  limit 1 offset (select (count(*) * 0.1)::int from students) - 1
);

Tips:

Can't simply use (limit %n * total) or (top n percent) anyway. Since the students with the avg_grade = minimal top avg_grade, might be only partly included, Which is not fair.
The ugly sql above could handle that case, with performance cost.

Here is an example that shows the differences of the running results with duplication handled or unhandled:
- Duplication handled - more fair.
- Duplication unhandled - not as fair

In Postgresql, how to select top n percent of rows by a column?

2 Answers2

Linked