"distinct on" with group by postgres

Question

I have the following records:

id  run_hour               performance_hour      value
2  "2017-06-25 09:00:00"  "2017-06-25 07:00:00"    6
2  "2017-06-25 09:00:00"  "2017-06-25 08:00:00"    5
1  "2017-06-25 09:00:00"  "2017-06-25 08:00:00"    5
2  "2017-06-25 08:00:00"  "2017-06-25 07:00:00"    5
1  "2017-06-25 08:00:00"  "2017-06-25 07:00:00"    5

We run every hour to look at the result of each id for current hour and previous hours.

Only if there was a change from previous hour run we insert a new reocrd (we don't want to overwrite the value because we want to measure the value if looked after 1 hour, or 2 hours etc.

I would like to sum for each id in the latest available value (sort by run_hour) - the values.

In the above example, ad 1 for run 9:00 and performance hour of 7:00 doesn't have a record - because it was the same as the run of 8:00 and performance hour of 7:00

In the above example, if I ask for sum of values of run 2017-06-25 09:00:00 I would expect to get:

id, value
1   10
2   11

for id 1, it's 10 calculated: (run_hour<2017-06-25 08:00:00> + run_hour<2017-06-25 09:00:00>) and for id 2, it's 11 calculated: (run_hour<2017-06-25 09:00:00> + run_hour<2017-06-25 09:00:00>) I wrote the following query:

select distinct on (id, run_hour) id, sum(value) from metrics where  run_hour <= '2017-06-25 09:00' and performance_hour >= '2017-06-25 07:00' and  performance_hour < '2017-06-25 09:00'
group by id
order by id, run_hour

However I get an error that also run_hour has to be in the GROUP BY clause. - but if I add it I get incorrect data - also data for previous hours which I don't need - I need the latest hour that had data.

How can I use "distinct on" with group by?

What makes you decide for the record with value 6 for id=2 , run_hour= "2017-06-25 09:00:00"? Because its earlier performance_hour? — Thorsten Kettner, Jun 28 '17 at 14:28
It's data that I get every hour about performance of our ads. when I ran at 9:00 for hour 7 - I got value 6, which is different that the previous run for that hour (8:00 for hour 7 and ad 2) so I insert a new record — Dejell, Jun 28 '17 at 14:29
No, I don't understand. You have two records for id=2 , run_hour= "2017-06-25 09:00:00". You take the one with value 6 and dismiss the one with value 5 when building the sum. Why that record and not the other one? — Thorsten Kettner, Jun 28 '17 at 14:31
the performance hour is different. for id at 9:00 for performance between 7-9 the sum is 11, however, for id 1, there is no record for run hour 9:00 and performance record 8:00, because I retrieved the same data from the external source so I didn't want to create a redundant record (but since the value is the same - I need to use the record from run 8:00 to performance 7:00 for id 1) — Dejell, Jun 28 '17 at 14:33
Okay, I think I got it now. Complicated. I'll have to think of a solution... — Thorsten Kettner, Jun 28 '17 at 14:37

Thorsten Kettner · Accepted Answer · 2017-06-28T19:41:24.847

2

The task is very complicated. Let's say you want the performance hours 7:00 till 9:00 from the following data:

id  run_hour               performance_hour      value
2   "2017-06-25 09:00:00"  "2017-06-25 06:00:00"    6
2   "2017-06-25 09:00:00"  "2017-06-25 10:00:00"    5

The expected result would be 18 (6 for 7:00 + 6 for 8:00 + 6 for 9:00) all based on the 6:00 record which itself is outside the desired time range.

We need a recursive CTE starting from the first wanted performance hour per id till the last wanted one. Thus we build records that don't exist and that we can sum up later.

with recursive cte(id, run_hour, performance_hour, value) as
(
  select *
  from
  (
    select distinct on (id) 
      id, 
      run_hour,
      greatest(performance_hour, timestamp '2017-06-25 07:00') as performance_hour, 
      value
    from metrics
    where run_hour = timestamp '2017-06-25 09:00' 
      and performance_hour <= timestamp '2017-06-25 07:00'
    order by id, metrics.performance_hour desc
  ) start_by_id
  union all
  select 
    cte.id, 
    cte.run_hour,
    cte.performance_hour + interval '1 hour' as performance_hour,
    coalesce(m.value, cte.value) as value
  from cte
  left join metrics m on m.id = cte.id
                      and m.run_hour = cte.run_hour
                      and m.performance_hour = cte.performance_hour + interval '1 hour'
  where cte.performance_hour < timestamp '2017-06-25 09:00'
)
select id, sum(value)
from cte
group by id;

Rextester link: http://rextester.com/PHC88770

edited Jun 28 '17 at 19:41

answered Jun 28 '17 at 15:28

Thorsten Kettner

89,309
7
49
73

why the expected result will be 18 and not 11 for id: 2? I need the most recent run_time record if exists – Dejell Jun 28 '17 at 16:07
Okay, I missed the part that you want the time till current hour. Yes, this would include the second record, of course. The question however remains how many previous hours. If you look at 7:00 to 10:00 you'll get a sum of 23. If you look at 9:00 to 10:00 you get a sum of 11. If you look at 9:00 to 12:00 you get 21. Just adjust the timestamps in the query so as to get the range you want. – Thorsten Kettner Jun 28 '17 at 19:39
I had a bug in my query, though. In the `DISTINCT ON` query I ordered by `id, performance_hour` instead of `id, metrics.performance_hour`. That could pick the wrong record due to ambiguity. Well, that comes from using an alias name that is also a column name :-| I've updated my answer and the rextester test accordingly. – Thorsten Kettner Jun 28 '17 at 19:43
Thanks. Do you think that there is a better db to save this type of information without the need to look recursively? rather than postgresSQL – Dejell Jul 06 '17 at 14:48
1

Another DB or another DBMS? The database doesn't seem bad, but you are not writing every record. Though the data is regarded to come in hourly, you omit records when they have the same value as the previous. Thus your table has gaps. This makes your query so complicated; as a first step you must generate the missing rows (recursive query). If you want it easier, you'll have to write the records, so there are no gaps. Or use a programming language rather than SQL for calculation. As to your database management system: PostgreSQL is very good; I see no reason why you would change it for another. – Thorsten Kettner Jul 06 '17 at 16:20

Gordon Linoff · Answer 2 · 2017-06-29T02:42:50.760

1

You want the distinct on before the group by:

select id, sum(value)
from (select distinct on (id, run_hour) m.*
      from metrics m
      where run_hour <= '2017-06-25 09:00' and
            performance_hour >= '2017-06-25 07:00' and
            performance_hour < '2017-06-25 09:00'
      order by id, run_hour, performance_hour desc
     ) m
group by id;

edited Jun 29 '17 at 02:42

answered Jun 28 '17 at 14:17

Gordon Linoff

1,242,037
58
646
786

I get missing FROM-clause entry for table "m" – Dejell Jun 28 '17 at 14:21
Change `m.*` to `metrics.*`. – Thorsten Kettner Jun 28 '17 at 14:28
it doesn't work. I get for id: 2 the sum of 10 instead of 11 – Dejell Jun 28 '17 at 14:39
It takes the run of the 8:00 for id 2, however it should take the value of the run of 9:00 which returned 6 – Dejell Jun 28 '17 at 14:47

"distinct on" with group by postgres

2 Answers2