Time window in PostgreSQL

Question

I am new to PostgreSQL (specifically, I use Timescale db) and have a question regarding time windows.

Data:

date      |customerid|names   
2014-01-01|1         |Andrew 
2014-01-02|2         |Pete   
2014-01-03|2         |Andrew 
2014-01-04|2         |Steve  
2014-01-05|2         |Stef   
2014-01-06|3         |Stef  
2014-01-07|1         |Jason 
2014-01-08|1         |Jason

The question is: Going back in time x days (viewed from every single row), how many distinct names are there which share the same id?

For x=2 days, the result should look like this:

date      |customerid|names  |count 
2014-01-01|1         |Andrew |1 
2014-01-02|2         |Pete   |1 
2014-01-03|2         |Andrew |2 
2014-01-04|2         |Steve  |3 
2014-01-05|2         |Stef   |3 
2014-01-06|3         |Stef   |1
2014-01-07|1         |Jason  |1
2014-01-08|1         |Jason  |1

Is this possible in PostgreSQL without using a loop over each single row?

Additional information: The time intervals of the data are not equidistant in reality.

Thank you very much!

Gordon Linoff · Accepted Answer · 2020-06-19T13:45:05.327

6

It would be nice if you could use window functions:

select t.*,
       count(distinct name) over (partition by id
                                  order by date
                                  range between interval 'x day' preceding and current row
                                 ) as cnt_x
from t;

Alas, that is not possible. So you can use a lateral join:

select t.*, tt.cnt_x
from t left join lateral
     (select count(distinct t2.name) as cnt_x
      from t t2
      where t2.id = t.id and
             t2.date >= t.date - interval 'x day' and t2.date <= t.date
     ) tt
     on true;

For performance, you want an index on (id, date, name).

edited Jun 19 '20 at 13:45

answered Jun 19 '20 at 13:30

Gordon Linoff

1,242,037
58
646
786

Yes, `COUNT()` is implemented in windows functions, but not `COUNT(DISTINCT )`. The lateral query is the solution. – The Impaler Jun 19 '20 at 13:34
Thank you very much! In the last weeks I used Spark to calculate queries like this and I was curious if this would also be equally possible in PostgrSQL. – Dominik Jun 19 '20 at 13:39
As I see, on big Data (+200k rows) this will not perform. What kind of database solution is suitable for those kinds of problems in your opinion? – Dominik Jun 19 '20 at 14:29
@Dominik . . . Depending on how many ids you have, the index should be a big help. – Gordon Linoff Jun 19 '20 at 15:39
Yes, I understand. But I have multiple queries (which are all similar but still different). In the end I will end up with a bunch of indexes and I think this would cause problems. Each row in the table is a purchase transaction. – Dominik Jun 19 '20 at 20:26
I am unable to solve two problems at the same time: 1. I have to be able to calculate queries that are very similar to this example on historical transactions. For example, 50 such queries on a historical data record with 200k lines. 2. I have to be able to quickly calculate 50 such queries for one line at runtime. I probably do it this way: I use Spark to calculate these queries on a historical record. I use Timscale DB to calculate queries of this kind at runtime on an incoming transaction. It would be great to have a technology that can do both. – Dominik Jun 19 '20 at 20:27

Time window in PostgreSQL

1 Answers1