I'm trying to come up with a database design, and I can't wrap my head around a proper setup. Given my requirements, I thought about TimescaleDB and Postgis, but I quickly hit some issues with continuous aggregates.
Inputs: batches of 10 million rows per day with a timestamp, a Person ID, and a Postgis location. Let's say this represents people at specific points in time.
Outputs: the centroid of the position of each person each day.
My initial solution was:
1 hypertable for the input
CREATE TABLE datapoints(
person_id char(40),
timestamp TIMESTAMPTZ NOT NULL,
location geometry(point, 4326)
);
SELECT create_hypertable('datapoints', 'timestamp', chunk_time_interval => INTERVAL '1 day');
1 continuous aggregate using ST_Centroid
CREATE VIEW daily_centroid WITH
(timescaledb.continuous, timescaledb.refresh_interval = '1 day')
AS
SELECT
person_id,
time_bucket(INTERVAL '1 day', timestamp) as date,
ST_Centroid(ST_Collect(location)) AS user_centroid
FROM datapoints
GROUP BY person_id, time_bucket(INTERVAL '1 day', timestamp);
However, doing this results in an error
ERROR: aggregates which are not parallelizable are not supported by continuous aggregate query
Am I incorrectly assessing TimescaleDB as a good fit for this? If this doesn't work, I'm thinking I will write some scripts:
- to create one table per day for the input
- to create one materialized view per day with just
(person_id, centroid)
- to import data to the corresponding table
- to refresh the materialized views
This is similar to the approach above but both the table and the view have N versions, one for each day. I thought I would use TimescaleDB to avoid writing that automation and allowing it to scale to new requirements. e.g it's as easy as writing a view and I don't have to teach other people my automation system.
Would triggers be a better alternative to implement this automation transparently if this doesn't fit the TimescaleDB's scope?
Thanks!!