4

I am new to PostgreSQL (specifically, I use Timescale db) and have a question regarding time windows.

Data:

date      |customerid|names   
2014-01-01|1         |Andrew 
2014-01-02|2         |Pete   
2014-01-03|2         |Andrew 
2014-01-04|2         |Steve  
2014-01-05|2         |Stef   
2014-01-06|3         |Stef  
2014-01-07|1         |Jason 
2014-01-08|1         |Jason 

The question is: Going back in time x days (viewed from every single row), how many distinct names are there which share the same id?

For x=2 days, the result should look like this:

date      |customerid|names  |count 
2014-01-01|1         |Andrew |1 
2014-01-02|2         |Pete   |1 
2014-01-03|2         |Andrew |2 
2014-01-04|2         |Steve  |3 
2014-01-05|2         |Stef   |3 
2014-01-06|3         |Stef   |1
2014-01-07|1         |Jason  |1
2014-01-08|1         |Jason  |1  

Is this possible in PostgreSQL without using a loop over each single row?

Additional information: The time intervals of the data are not equidistant in reality.

Thank you very much!

Dominik
  • 187
  • 2
  • 11

1 Answers1

6

It would be nice if you could use window functions:

select t.*,
       count(distinct name) over (partition by id
                                  order by date
                                  range between interval 'x day' preceding and current row
                                 ) as cnt_x
from t;

Alas, that is not possible. So you can use a lateral join:

select t.*, tt.cnt_x
from t left join lateral
     (select count(distinct t2.name) as cnt_x
      from t t2
      where t2.id = t.id and
             t2.date >= t.date - interval 'x day' and t2.date <= t.date
     ) tt
     on true;

For performance, you want an index on (id, date, name).

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Yes, `COUNT()` is implemented in windows functions, but not `COUNT(DISTINCT )`. The lateral query is the solution. – The Impaler Jun 19 '20 at 13:34
  • Thank you very much! In the last weeks I used Spark to calculate queries like this and I was curious if this would also be equally possible in PostgrSQL. – Dominik Jun 19 '20 at 13:39
  • As I see, on big Data (+200k rows) this will not perform. What kind of database solution is suitable for those kinds of problems in your opinion? – Dominik Jun 19 '20 at 14:29
  • @Dominik . . . Depending on how many ids you have, the index should be a big help. – Gordon Linoff Jun 19 '20 at 15:39
  • Yes, I understand. But I have multiple queries (which are all similar but still different). In the end I will end up with a bunch of indexes and I think this would cause problems. Each row in the table is a purchase transaction. – Dominik Jun 19 '20 at 20:26
  • I am unable to solve two problems at the same time: 1. I have to be able to calculate queries that are very similar to this example on historical transactions. For example, 50 such queries on a historical data record with 200k lines. 2. I have to be able to quickly calculate 50 such queries for one line at runtime. I probably do it this way: I use Spark to calculate these queries on a historical record. I use Timscale DB to calculate queries of this kind at runtime on an incoming transaction. It would be great to have a technology that can do both. – Dominik Jun 19 '20 at 20:27