11

We are new to postgres, we have following query by which we can select top N records from each category.

 create table temp (
     gp char,
     val int
 );

 insert into temp values ('A',10);
 insert into temp values ('A',8);
 insert into temp values ('A',6);
 insert into temp values ('A',4);
 insert into temp values ('B',3);
 insert into temp values ('B',2);
 insert into temp values ('B',1);

 select a.gp,a.val
 from   temp a
 where  a.val in (
              select b.val
              from   temp b
              where  a.gp=b.gp
              order by b.val desc
             limit 2);

Output of above query is something like this

 gp   val
 ----------
 A    10
 A    8
 B    3
 B    2

But our requirement is different, we want to select top n% records from each category where n is not fixed, n is based of some percent of elements in each group.

dpilwal
  • 361
  • 1
  • 5
  • 14
  • 2
    "*where [...] n is based of some percent of elements in each group*" - based on which elements? Please give us an example of your expected output based on your sample data. Btw: the "top-n" query is better done using a window function rather than a co-related sub-query –  Jul 08 '14 at 07:34
  • @a_horse_with_no_name Thank you for your quick reply, n is based on group element eg. in above example we want 75% of records from each group in that case we will get 3 records from group A and 2 records from group B. – dpilwal Jul 08 '14 at 07:52

3 Answers3

21

To retrieve the rows based on the percentage of the number of rows in each group you can use two window functions: one to count the rows and one to give them a unique number.

select gp,
       val
from (
  select gp, 
         val,
         count(*) over (partition by gp) as cnt,
         row_number() over (partition by gp order by val desc) as rn
  from temp
) t
where rn / cnt <= 0.75;

SQLFiddle example: http://sqlfiddle.com/#!15/94fdd/1


Btw: using char is almost always a bad idea because it is a fixed-length data type that is padded to the defined length. I hope you only did that for setting up the example and don't use it in your real table.

10

Referencing the response from a_horse_with_no_name, you can achieve something similar using percent_rank()

SELECT
    gp,
    val,
    pct_rank
FROM (
    SELECT
        gp,
        val,
        percent_rank() over (order by val desc) as pct_rank
    FROM variables.temp
    ) t
WHERE pct_rank <= 0.75;

You can then set the final WHERE clause to return data at whatever percent_rank() threshold you require.

xrpza
  • 329
  • 3
  • 6
-2

The accepted answer did not work for me. I find this solution that works for me:

SELECT * FROM temp ORDER BY val DESC
     LIMIT (SELECT (count(*) / 10) AS selnum FROM temp )

It is not optimal (performance) but it works

Diego Quirós
  • 898
  • 1
  • 14
  • 28