1

I have a Postgres 11 table called sample_a that looks like this:

 time | cat | val
------+-----+-----
    1 |   1 |   5
    1 |   2 |   4
    2 |   1 |   6
    3 |   1 |   9
    4 |   3 |   2

I would like to create a query that for each unique timestep, gets the most recent values across each category at or before that timestep, and aggregates these values by taking the sum of these values and dividing by the count of these values.

I believe I have the query to do this for a given timestep. For example, for time 3 I can run the following query:

select sum(val)::numeric / count(val) as result from (
    select distinct on (cat) * from sample_a where time <= 3  order by cat, time desc
) x;

and get 6.5. (This is because at time 3, the latest from category 1 is 9 and the latest from category 2 is 4. The count of the values are 2, and they sum up to 13, and 13 / 2 is 6.5.)

However, I would ideally like to run a query that will give me all the results for each unique time in the table. The output of this new query would look as follows:

 time | result
------+----------
    1 |   4.5
    2 |   5
    3 |   6.5
    4 |   5

This new query ideally would avoid adding another subselect clause if possible; an efficient query would be preferred. I could get these prior results by running the prior query inside my application for each timestep, but this doesn't seem efficient for a large sample_a.

What would this new query look like?

Coder
  • 597
  • 7
  • 22
  • I don't understand the requirement at all. Where does the value 5 for time 2 come from? – Laurenz Albe Nov 07 '19 at 06:51
  • @LaurenzAlbe at time 2: the latest from category 1 is 6 and the latest from category 2 is 4. Therefore there are two values, 6 and 4. sum([6,4]) / count([6,4]) = 10 / 2 = 5. – Coder Nov 07 '19 at 06:54
  • @LaurenzAlbe I added an explanation of what happens at time `3` for clarity in the question above. – Coder Nov 07 '19 at 06:59
  • Do you know all the categories in advance? – shawnt00 Nov 07 '19 at 07:08
  • You can add up a series of expressions like below for each category. A similar expression can be used to detect nulls and get a distinct count of categories for the divisor. You might find a self lateral join to be a better approach though. – shawnt00 Nov 07 '19 at 07:33
  • `last_value(case when cat = X then value end) over (order by case when cat = X then value end nulls first, time rows between unbounded proceeding and current row)` – shawnt00 Nov 07 '19 at 07:36
  • @shawnt00 the categories can number in the range of 10K to 100K, so this would make your proposed query very long and impractical to use – Coder Nov 07 '19 at 07:54
  • @shawnt00 Depends on how you define 'in advance'. Not in the context you are thinking about here. – Coder Nov 07 '19 at 08:02

2 Answers2

1

See if performance is acceptable this way. Syntax might need minor tweaks:

select t.time, avg(mr.val) as result
from (select distinct time from sample_a) t,
    lateral (
        select distinct on (cat) val
        from sample_a a
        where a.time <= t.time
        order by a.cat, a.time desc
    ) mr
group by t.time
shawnt00
  • 16,443
  • 3
  • 17
  • 22
  • Clever solution! But it needs some error corrections: https://dbfiddle.uk/?rdbms=postgres_12&fiddle=1d6d0b13d83ba44cae7ef7082552afff – S-Man Nov 07 '19 at 08:27
  • Thanks @S-Man postgres isn't my native platform and I'm working on my cellphone. – shawnt00 Nov 07 '19 at 08:32
0

I think you just want cumulative functions:

select time,
       sum(sum(val)) over (order by time) / sum(sum(num_val)) over (order by time) as result
from (select time, sum(val) as sum_val, count(*) as num_val
      from sample_a a
      group by time
     ) a;

Note if val is an integer, you might need to convert to a numeric to get fractional values.

This can be expressed without a subquery as well:

select time,
       sum(sum(val)) over (order by time) / sum(count(*)) over (order by time) as result
from sample_a
group by time
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786