How do I use AVG() with GROUP BY in time_bucket_gapfill() in TimeScaleDB, PostgreSQL?

Question

I'm using TimescaleDB in my PostgreSQL and I have the following two Tables:

windows_log

| windows_log_id |      timestamp      | computer_id | log_count |
------------------------------------------------------------------
|        1       | 2021-01-01 00:01:02 |     382     |     30    |
|        2       | 2021-01-02 14:59:55 |     382     |     20    |
|        3       | 2021-01-02 19:08:24 |     382     |     20    |
|        4       | 2021-01-03 13:05:36 |     382     |     10    |
|        5       | 2021-01-03 22:21:14 |     382     |     40    |

windows_reliability_score

| computer_id (FK) |      timestamp      | reliability_score |
--------------------------------------------------------------
|        382       | 2021-01-01 22:21:14 |          6        |
|        382       | 2021-01-01 22:21:14 |          6        |
|        382       | 2021-01-01 22:21:14 |          6        |
|        382       | 2021-01-02 22:21:14 |          1        |
|        382       | 2021-01-02 22:21:14 |          3        |
|        382       | 2021-01-03 22:21:14 |          7        |
|        382       | 2021-01-03 22:21:14 |          8        |
|        382       | 2021-01-03 22:21:14 |          9        |

Note: In both tables is indexed on the timestamp column (hypertable)

So I'm trying to get the average reliability_score for each time bucket, but it just gives me the average for everything, instead of the average per specific bucket...

This is my query:

SELECT time_bucket_gapfill(CAST(1 * INTERVAL '1 day' AS INTERVAL), wl.timestamp) AS timestamp, 
COALESCE(SUM(log_count), 0) AS log_count,
AVG(reliability_score) AS reliability_score
FROM windows_log wl
JOIN reliability_score USING (computer_id)
WHERE wl.time >= '2021-01-01 00:00:00.0' AND wl.time < '2021-01-04 00:00:00.0'
GROUP BY timestamp
ORDER BY timestamp asc

This is the result I'm looking for:

|      timestamp      | log_count | reliability_score |
-------------------------------------------------------
| 2021-01-01 00:00:00 |     30    |          6        |
| 2021-01-02 00:00:00 |     20    |          2        |
| 2021-01-03 00:00:00 |     20    |          8        |

But this is what I get:

|      timestamp      | log_count | reliability_score |
-------------------------------------------------------
| 2021-01-01 00:00:00 |     30    |        5.75       |
| 2021-01-02 00:00:00 |     20    |        5.75       |
| 2021-01-03 00:00:00 |     20    |        5.75       |

score 1 · Answer 1 · answered Jul 20 '21 at 01:33

Given what we can glean from your example, there's no simple way to do a join between these two tables, with the given functions, and achieve the results you want. The schema, as presented, just makes that difficult.

If this is really what your data/schema look like, then one solution is to use multiple CTE's to get the two values from each distinct table and then join based on bucket and computer.

WITH wrs AS (
    SELECT time_bucket_gapfill('1 day', timestamp) AS bucket, 
    computer_id,
    AVG(reliability_score) AS reliability_score  
    FROM windows_reliability_score
    WHERE timestamp >= '2021-01-01 00:00:00.0' AND timestamp < '2021-01-04 00:00:00.0'
    GROUP BY 1, 2
),
wl AS (
    SELECT time_bucket_gapfill('1 day', wl.timestamp) bucket, wl.computer_id,
    sum(log_count) total_logs
    FROM windows_log wl
    WHERE timestamp >= '2021-01-01 00:00:00.0' AND timestamp < '2021-01-04 00:00:00.0'
    GROUP BY 1, 2
)
SELECT wrs.bucket, wrs.computer_id, reliability_score, total_logs
FROM wrs LEFT JOIN wl ON wrs.bucket = wl.bucket AND wrs.computer_id = wl.computer_id;

The filtering would have to be applied internally to each query because pushdown on the outer query likely wouldn't happen and so then you would scan the entire hypertable before the date filter is applied (not what you want I assume).

I tried to quickly re-create your sample schema, so I apologize if I got names wrong somewhere.

k_rus · Answer 2 · 2021-07-14T08:41:05.393

0

The main issue is that the join codition is on column computer_id, where both tables have the same values 382. Thus each column from table windows_log will be joined with each column from table reliability_score (Cartesian product of all rows). Also the grouping is done on column timestamp, which is ambigous, and is likely to be resolved to timestamp from windows_log. This leads to the result that average will use all values of reliability_score for each value of the timestamp from windows_log and explains the undesired result.

The resolution of the gropuing ambiguity, which resolved in favor the inner column, i.e., the table column, is explained in GROUP BY description in SELECT documentation:

In case of ambiguity, a GROUP BY name will be interpreted as an input-column name rather than an output column name.

To avoid having groups, which includes all rows matching on computer id, windows_log_id can be used for grouping. This will allow to bring log_count to the query result. And if it is desire to keep the output name timestamp, GROUP BY should use the reference to the position. For example:

SELECT time_bucket_gapfill('1 day'::INTERVAL, rs.timestamp) AS timestamp, 
AVG(reliability_score) AS reliability_score,
log_count
FROM windows_log wl
JOIN reliability_score rs USING (computer_id)
WHERE rs.timestamp >= '2021-01-01 00:00:00.0' AND rs.timestamp < '2021-01-04 00:00:00.0'
GROUP BY 1, windows_log_id, log_count
ORDER BY timestamp asc

For ORDER BY it is not an issue, since the output name is used. From the same doc:

If an ORDER BY expression is a simple name that matches both an output column name and an input column name, ORDER BY will interpret it as the output column name.

edited Jul 14 '21 at 08:41

answered Jul 13 '21 at 20:28

k_rus

2,959
1
19
31

Hey, thank you. The values of computer_id is not unique, I've already provided the tables where you can see the value features multiple times. I've tried your solutions and none of them give the correct result. The third query just returns 1 in the first row and null in the rest... – TheStranger Jul 14 '21 at 06:37
@Ben Can you explain why you need join? It is unclear to me what the acceptable join condition will be. – k_rus Jul 14 '21 at 06:43
Okay, so in my own table the windows_log table have 4 more columns with the log count for the 4 different severities (critical, error, warning and info). I need those counts and then I also need the reliability score from the reliability table. I have just cut that out to simplify the example as it is irrelevant to this issue. – TheStranger Jul 14 '21 at 06:48
By the wat, the time is not identical between the two tables.. – TheStranger Jul 14 '21 at 06:49
@Ben I noticed that the times are different this morning :). Still I don't understand what the condition to join. As I explain joining only on `computer_id` leads to all possible combinations between rows from the tables with the same computer id. Or may be the grouping condition should include more columns. Should it be grouped on `windows_log_id`? In such cases it will be possible to keep values of `windows_log` columns in the query result – k_rus Jul 14 '21 at 07:02
Ah okay :). I'm not sure I understand your confusion.. Why would it be grouped on windows_log_id ? I'm really confused... let me explain what I want. I have the two tables. I want to get the log count and average reliability score for the 382 computer_id for the chosen time bucket (could be 1 hour, 1 day, etc). The request/query will always only be restricted to one computer_id. I've edited the post to show the reason for the join. – TheStranger Jul 14 '21 at 07:22
@Ben the result of the query still doesn't depend on any columns in `windows_log`. Do you really need the join? Or do you want to check that computer id exists in windows log in the time frame? – k_rus Jul 14 '21 at 07:27
Okay... So how do I get the log_count without using the log_count column in windows_log ? – TheStranger Jul 14 '21 at 07:31
Thank you for clarifying. I updated my answer with my current understanding of your use case. – k_rus Jul 14 '21 at 07:57
It still returns the average reliability score for the full table rather than the bucket period. And also I need the result for each time bucket so i have to group by timestamp... – TheStranger Jul 14 '21 at 08:10
I found the bug in my example. `time_bucket_gapfill` should be applied to `reliability_score` (not to `windows_log`). Sorry, I don't have capacity to test the query until the evening or even tomorrow. – k_rus Jul 14 '21 at 08:42
Thank you but you are still not grouping by the timestamp. I need the result back grouped by timestamp just like I've shown in the question, not all these other things. I simply can't use them when they come like that. Anywaysm I think I will look somewhere else, because this is dragging on a bit. Thank you for your time and effort, appreciate it. – TheStranger Jul 14 '21 at 08:50
@Ben, one of the reasons there seems to be confusion is that your example schema & sample query have column names that don't match. @k_rus was correctly pointing out that there's a trap that is easy to fall into where you can alias an aggregate to the same name as the original column (eg. `time_bucket_gapfill('1 day', 'timestamp') as timestamp`. When you then do something like `GROUP BY timestamp`, Postgres will group by the column (many distinct values) and not the alias (one larger time bucket). Maybe not what you have in real life, but the example is misleading. Just FYI. – Ryan Jul 20 '21 at 01:25

How do I use AVG() with GROUP BY in time_bucket_gapfill() in TimeScaleDB, PostgreSQL?

2 Answers2