20

I was trying to understand PARTITION BY in postgres by writing a few sample queries. I have a test table on which I run my query.

id integer | num integer
___________|_____________
1          | 4 
2          | 4
3          | 5
4          | 6

When I run the following query, I get the output as I expected.

SELECT id, COUNT(id) OVER(PARTITION BY num) from test;

id         | count
___________|_____________
1          | 2 
2          | 2
3          | 1
4          | 1

But, when I add ORDER BY to the partition,

SELECT id, COUNT(id) OVER(PARTITION BY num ORDER BY id) from test;

id         | count
___________|_____________
1          | 1 
2          | 2
3          | 1
4          | 1

My understanding is that COUNT is computed across all rows that fall into a partition. Here, I have partitioned the rows by num. The number of rows in the partition is the same, with or without an ORDER BY clause. Why is there a difference in the outputs?

kishore
  • 604
  • 3
  • 7
  • 13
  • 5
    In the second the case, postgre counts the number of rows with `id` lower or equal to the actual `id` – Radim Bača Jun 20 '18 at 10:04
  • @RadimBača is it something specific to postgres or is that how it is supposed to work? I don't understand how the query is interpreted the way you have described. – kishore Jun 20 '18 at 10:12
  • Using COUNT(*) in place of COUNT(id) gives the same result. – kishore Jun 20 '18 at 10:13
  • 5
    See [the documentation](https://www.postgresql.org/docs/current/static/tutorial-window.html) on window functions, especially: `By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause. When ORDER BY is omitted the default frame consists of all rows in the partition.` – Marth Jun 20 '18 at 10:15
  • Thanks for the info. I now understand what is happening. I missed the info provided in the documentation. – kishore Jun 20 '18 at 10:20
  • @RadimBača: It's [Postgres](https://wiki.postgresql.org/wiki/FAQ#What_is_PostgreSQL.3F_How_is_it_pronounced.3F_What_is_Postgres.3F) not "postgre" –  Jun 20 '18 at 10:24
  • @a_horse_with_no_name I believe it is PostgreSQL ;) – Radim Bača Jun 20 '18 at 10:34

3 Answers3

39

When you add an order by to an aggregate used as a window function that aggregate turns into a "running count" (or whatever aggregate you use).

The count(*) will return the number of rows up until the "current one" based on the order specified.

The following query shows the different results for aggregates used with an order by. With sum() instead of count() it's a bit easier to see (in my opinion).

with test (id, num, x) as (
  values 
    (1, 4, 1),
    (2, 4, 1),
    (3, 5, 2),
    (4, 6, 2)
)
select id, 
       num,
       x,
       count(*) over () as total_rows, -- NB over () is needed
       count(*) over (order by id) as rows_upto,
       count(*) over (partition by x order by id) as rows_per_x,
       sum(num) over (partition by x) as total_for_x,
       sum(num) over (order by id) as sum_upto,
       sum(num) over (partition by x order by id) as sum_for_x_upto
from test;
       

will result in:

id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
 1 |   4 | 1 |          4 |         1 |          1 |           8 |        4 |              4
 2 |   4 | 1 |          4 |         2 |          2 |           8 |        8 |              8
 3 |   5 | 2 |          4 |         3 |          1 |          11 |       13 |              5
 4 |   6 | 2 |          4 |         4 |          2 |          11 |       19 |             11

There are more examples in the Postgres manual

CervEd
  • 3,306
  • 28
  • 25
  • ```with test (id, num, x) as ( values (1, 4, 1), (2, 4, 1), (3, 5, 2), (4, 6, 2) ) ``` Is there similar syntax in SQL Server? – Robin Sun Dec 14 '21 at 05:51
  • @RobinSun `with test as ( SELECT * FROM (VALUES (1, 4, 1), (2, 4, 1), (3, 5, 2), (4, 6, 2)) AS t (id, num, x))` – CervEd Feb 24 '23 at 07:24
7

Your two expressions are:

COUNT(id) OVER (PARTITION BY num)

COUNT(id) OVER (PARTITION BY num ORDER BY id)

Why would you expect these to return the same values? The syntax is different for a reason.

The first returns the overall count for each num -- essentially joining back the aggregated value.

The second does a cumulative count. It does the COUNT() for each row of id, for all values up to that ids value.

Note that such cumulative counts would normally be implemented using RANK() (or related functions). The cumulative count is subtly different from RANK(). The cumulative count implements:

COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

RANK() is slightly different. The difference only matters when the ORDER BY keys have ties.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • 4
    "Why would you expect these to return the same values?" Because ordering usually has no effect on filtering in sql. WHERE clauses are applied before ordering. It's far from obvious that it effects the function when the syntax suggests that it is applied "over a partition". Instead, an example on how to achieve the same results despite ordering would have been nice. – dube Feb 06 '20 at 14:29
  • 1
    @dube . . . This is not an ordering clause for a result set. It is an ordering clause that is part of a window function. – Gordon Linoff Feb 06 '20 at 15:24
  • 1
    I know and I understand the difference. Your introduction claims that it would be obvious that there is a difference, which I claim it is not. It's impossible to foresee without reading about the details in docs. – dube Feb 06 '20 at 15:43
4

The "why" has already been explained by others. Sometimes you have an ordered window, and you have to do a count over the whole partition despite having an ORDER BY.

To do so, use an unbounded range with RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

create table search_log
(
    id bigint not null primary key,
    query varchar(255) not null,
    stemmed_query varchar(255) not null,
    created timestamp not null,
);

SELECT query,
       created as seen_on,
       first_value(created) OVER query_window as last_seen,
       row_number() OVER query_window AS rn,
       count(*) OVER query_window AS occurence
FROM search_log l
     WINDOW query_window AS (PARTITION BY stemmed_query ORDER BY created DESC 
         RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
dube
  • 4,898
  • 2
  • 23
  • 41