Questions tagged [window-functions]

A window function is a type of SQL operation that aggregates over a partition of the result set.

As a basic example of a window function, consider a table that contains a list persons along with their age and gender.
If you wanted a query to return the complete list of people along with their age rank per gender, you could accomplish this most simply by using a ranking window function.
Such a function can be used to rank individual rows within a subset of the results.

Using a window function the above function would look like this:

SELECT name, 
       age, 
       gender, 
       RANK() OVER (PARTITION BY gender ORDER BY age DESC) AS age_rank
FROM persons

Window functions as defined by the SQL standard can also access rows before and after the current row using the lag() or lead() function. The following query calculates the delta to the previous row for the amount column for each customer:

SELECT customer_id,
       order_date,
       amount, 
       amount - lag(amount) over (partition by customer_id ORDER BY order_date) as delta
FROM order
ORDER BY customer_id, order_date;

Standard aggregates (e.g. count(), sum()) can also be used as a window function. If combined with an ORDER BY (inside the partition defintion) this will generate "running" totals based on the aggregate being used.

Window functions are available in a wide range of DBMS: Oracle (called "analytical functions" there), Microsoft SQL Server, DB2 (called "olap functions" there) PostgreSQL, Teradata, Sybase, Vertica, and partially in H2 Database.

4082 questions
18
votes
2 answers

User defined function to be applied to Window in PySpark?

I am trying to apply a user defined function to Window in PySpark. I have read that UDAF might be the way to to go, but I was not able to find anything concrete. To give an example (taken from here: Xinh's Tech Blog and modified for PySpark): from…
17
votes
1 answer

Postgres window function and group by exception

I'm trying to put together a query that will retrieve the statistics of a user (profit/loss) as a cumulative result, over a period of time. Here's the query I have so far: SELECT p.name, e.date, sum(sp.payout) OVER (ORDER BY e.date) -…
Martin
  • 2,180
  • 4
  • 21
  • 41
17
votes
1 answer

PySpark / Spark Window Function First/ Last Issue

From my understanding first/ last function in Spark will retrieve first / last row of each partition/ I am not able to understand why LAST function is giving incorrect results. This is my code. AgeWindow =…
Nikhil Redij
  • 1,011
  • 1
  • 14
  • 21
17
votes
5 answers

Rounding numbers to the nearest 10 in Postgres

I'm trying to solve this particular problem from PGExercises.com: https://www.pgexercises.com/questions/aggregates/rankmembers.html The gist of the question is that I'm given a table of club members and half hour time slots that they have booked…
neuron
  • 551
  • 3
  • 9
  • 16
16
votes
4 answers

MySql using correct syntax for the over clause

What is the correct syntax to get the over clause to work in mysql? I would like to see the total sms's sent by each user without grouping it with the group by clause. SELECT username, count(sentSmsId) OVER (userId) FROM …
Nightwolf
  • 945
  • 2
  • 9
  • 23
16
votes
1 answer

PostgreSQL window function: row_number() over (partition col order by col2)

Following result set is derived from a sql query with a few joins and a union. The sql query already groups rows on Date and game. I need a column to describe the number of attempts at a game partitioned by date column. Username Game ID …
user1951677
  • 329
  • 1
  • 3
  • 11
15
votes
3 answers

GROUP BY and aggregate sequential numeric values

Using PostgreSQL 9.0. Let's say I have a table containing the fields: company, profession and year. I want to return a result which contains unique companies and professions, but aggregates (into an array is fine) years based on numeric…
Joshua Burns
  • 8,268
  • 4
  • 48
  • 61
15
votes
1 answer

Difference between "ROWS BETWEEN" and "RANGE BETWEEN" in (Presto) window function "OVER" clause

This question is primarily about older versions of PrestoSQL, which have been resolved in the (now renamed) Trino project as of versions 346. However, Amazon's Athena project is based off of Presto versions 0.217 (Athena Engine 2) and 0.172 (Athena…
mbafford
  • 2,266
  • 1
  • 20
  • 25
15
votes
6 answers

Getting the First and Last Row Using ROW_NUMBER and PARTITION BY

Sample Input Name | Value | Timestamp -----|-------|----------------- One | 1 | 2016-01-01 02:00 Two | 3 | 2016-01-01 03:00 One | 2 | 2016-01-02 02:00 Two | 4 | 2016-01-03 04:00 Desired Output Name | Value | EarliestTimestamp |…
Muhammad Rehan Saeed
  • 35,627
  • 39
  • 202
  • 311
15
votes
2 answers

Partitioning by multiple columns in Spark SQL

With Spark SQL's window functions, I need to partition by multiple columns to run my data queries, as follows: val w = Window.partitionBy($"a").partitionBy($"b").rangeBetween(-100, 0) I currently do not have a test environment (working on settings…
Eric Staner
  • 969
  • 2
  • 9
  • 14
15
votes
1 answer

Spark Task not serializable with lag Window function

I've noticed that after I use a Window function over a DataFrame if I call a map() with a function, Spark returns a "Task not serializable" Exception This is my code: val hc:org.apache.spark.sql.hive.HiveContext = new…
15
votes
2 answers

TSQL OVER clause: COUNT(*) OVER (ORDER BY a)

This is my code: USE [tempdb]; GO IF OBJECT_ID(N'dbo.t') IS NOT NULL BEGIN DROP TABLE dbo.t END GO CREATE TABLE dbo.t ( a NVARCHAR(8), b NVARCHAR(8) ); GO INSERT t VALUES ('a', 'b'); INSERT t VALUES ('a', 'b'); INSERT t VALUES ('a',…
Just a learner
  • 26,690
  • 50
  • 155
  • 234
14
votes
2 answers

Kafka Stream Suppress session-windowed-aggregation

I have written this code in a Kafka stream application: KGroupedStream groupedStream = stream.groupByKey(); groupedStream.windowedBy( SessionWindows.with(Duration.ofSeconds(3)).grace(Duration.ofSeconds(3))) .aggregate(() -> {...}) …
14
votes
1 answer

Select random row for each group in a postgres table

I have a table that is roughly: id | category | link | caption | image My goal is to fetch a random row from each distinct category in the table, for all the categories in the table. The plan is to then assign each row to a variable for its…
Steve
  • 325
  • 5
  • 13
13
votes
1 answer

How to aggregate over rolling time window with groups in Spark

I have some data that I want to group by a certain column, then aggregate a series of fields based on a rolling time window from the group. Here is some example data: df = spark.createDataFrame([Row(date='2016-01-01', group_by='group1', get_avg=5,…
Mike S
  • 935
  • 2
  • 6
  • 18