Continuous Aggregation Refresh takes long time for even small volume of data
This is regarding the continuous aggregation and refreshing it.
We ran the following queries and noted the observations.
- Creating a table and converting it into hypertable with proper primary keys and indexes.
CREATE TABLE "devices_data"(
time TIMESTAMP WITHOUT TIME ZONE NOT NULL,
device_id INTEGER,
temperature DOUBLE PRECISION,
PRIMARY KEY(time, device_id)
);
SELECT create_hypertable('devices_data', 'time');
CREATE INDEX ON "devices_data"(device_id, time DESC);
- Create a continuous aggregation view to aggregate hourly data and define a refresh policy.
CREATE MATERIALIZED VIEW devices_data_summary_hourly
WITH (timescaledb.continuous) AS
SELECT device_id,
time_bucket(INTERVAL '1 hour', time) AS bucket,
AVG(temperature),
MAX(temperature),
MIN(temperature),
SUM(temperature),
COUNT(*)
FROM devices_data
GROUP BY device_id, bucket
WITH NO DATA;
SELECT add_continuous_aggregate_policy('devices_data_summary_hourly',
start_offset => NULL,
end_offset => INTERVAL '1 h',
schedule_interval => INTERVAL '1 minute');
- Next we will add some data spanning 4 years for a particular device id.
INSERT INTO devices_data
SELECT time, 1, random()*50 + 10
FROM generate_series(TIMESTAMP '2017-03-01 00:00:00',
TIMESTAMP '2021-03-01 00:00:00',
INTERVAL '5 seconds') AS time;
query o/p : INSERT 0 25246081 Query returned successfully in 3 min 58 secs.
- Next we will observe how much time does the refresh job take to add these points to the hourly aggregated view
refresh job time -> 19.078569 seconds
select count(*) from devices_data_summary_hourly -> 35065
- Next we will add data for one device id but just one point per day for 4 years.
INSERT INTO devices_data
SELECT time, 2, random()*50 + 10
FROM generate_series(TIMESTAMP '2017-03-01 00:00:00',
TIMESTAMP '2021-03-01 00:00:00',
INTERVAL '1 day') AS time;
query o/p : INSERT 0 1462 Query returned successfully in 555 msec.
- Next we will observe how much time does the refresh job take to add these points to the hourly aggregated view
refresh job time -> 19.059796 seconds
select count(*) from devices_data_summary_hourly -> 36527
Brief Observations :
Output of step 3 & 4 :
Points added to main hypertable -> 25246081
Refresh job time to add these points to CAGG -> 19.078569 seconds
Points added to CAGG -> 35065
Output of step 5 & 6 :
Points added to main hypertable -> 1462
Refresh job time to add these points to CAGG -> 19.059796 seconds
Points added to CAGG -> 1462
Conclusion :
By observing the output of step 3 and 4, we see that the CAGG takes almost same time to calculate the aggregations even though there is a huge difference in the volume of data. This potentially means, that timescaledb refreshes the entire dataset spanning 4 years irrespective of the volume of data.
Questions :
- Is this how it is supposed to be?
- Does timescaledb just consider the time range and is not intelligent enough to recalculate aggregations only for those points that have changed?
- Are we missing something in our db schema design or any other configurations that is leading to this behaviour?