0

Continuous Aggregation Refresh takes long time for even small volume of data

This is regarding the continuous aggregation and refreshing it.

We ran the following queries and noted the observations.

  1. Creating a table and converting it into hypertable with proper primary keys and indexes.
CREATE TABLE "devices_data"(
    time  TIMESTAMP WITHOUT TIME ZONE NOT NULL,
    device_id  INTEGER,
    temperature  DOUBLE PRECISION,
    PRIMARY KEY(time, device_id)
);
SELECT create_hypertable('devices_data', 'time');
CREATE INDEX ON "devices_data"(device_id, time DESC);
  1. Create a continuous aggregation view to aggregate hourly data and define a refresh policy.
CREATE MATERIALIZED VIEW devices_data_summary_hourly
WITH (timescaledb.continuous) AS
SELECT device_id,
       time_bucket(INTERVAL '1 hour', time) AS bucket,
       AVG(temperature),
       MAX(temperature),
       MIN(temperature),
       SUM(temperature),
       COUNT(*)
FROM devices_data
GROUP BY device_id, bucket
WITH NO DATA;
SELECT add_continuous_aggregate_policy('devices_data_summary_hourly',
    start_offset => NULL,
    end_offset => INTERVAL '1 h',
    schedule_interval => INTERVAL '1 minute');
  1. Next we will add some data spanning 4 years for a particular device id.
INSERT INTO devices_data
SELECT time, 1, random()*50 + 10
FROM generate_series(TIMESTAMP '2017-03-01 00:00:00',
                  TIMESTAMP '2021-03-01 00:00:00',
             INTERVAL '5 seconds') AS time;

query o/p : INSERT 0 25246081 Query returned successfully in 3 min 58 secs.

  1. Next we will observe how much time does the refresh job take to add these points to the hourly aggregated view

refresh job time -> 19.078569 seconds

select count(*) from devices_data_summary_hourly -> 35065

  1. Next we will add data for one device id but just one point per day for 4 years.
INSERT INTO devices_data
SELECT time, 2, random()*50 + 10
FROM generate_series(TIMESTAMP '2017-03-01 00:00:00',
                  TIMESTAMP '2021-03-01 00:00:00',
             INTERVAL '1 day') AS time;

query o/p : INSERT 0 1462 Query returned successfully in 555 msec.

  1. Next we will observe how much time does the refresh job take to add these points to the hourly aggregated view

refresh job time -> 19.059796 seconds

select count(*) from devices_data_summary_hourly -> 36527

Brief Observations :

Output of step 3 & 4 :

Points added to main hypertable -> 25246081

Refresh job time to add these points to CAGG -> 19.078569 seconds

Points added to CAGG -> 35065

Output of step 5 & 6 :

Points added to main hypertable -> 1462

Refresh job time to add these points to CAGG -> 19.059796 seconds

Points added to CAGG -> 1462

Conclusion :

By observing the output of step 3 and 4, we see that the CAGG takes almost same time to calculate the aggregations even though there is a huge difference in the volume of data. This potentially means, that timescaledb refreshes the entire dataset spanning 4 years irrespective of the volume of data.

Questions :

  1. Is this how it is supposed to be?
  2. Does timescaledb just consider the time range and is not intelligent enough to recalculate aggregations only for those points that have changed?
  3. Are we missing something in our db schema design or any other configurations that is leading to this behaviour?
BilalS10
  • 16
  • 2

1 Answers1

1

The anticipation is that you incrementally load current data, not back-dated data.

It is not surprising that it behaves poorly in the test you show. You are using the tool contrary to its design.

jjanes
  • 37,812
  • 5
  • 27
  • 34
  • So this is how it is supposed to work right? @jjanes for our use case which will see a lot of data backfilled, it is going to behave the way it has behaved in the tests I performed? – BilalS10 Mar 25 '21 at 07:12
  • The tool is expecting that most incoming data will be from `now`, with the occasional backfill of older data. If you're mostly uploading older data then it will behave the way your tests showed. It's possible that in the real world you'll be uploading less old data than your tests! – Topher Jul 22 '21 at 19:38