3

I'm trying to come up with a database design, and I can't wrap my head around a proper setup. Given my requirements, I thought about TimescaleDB and Postgis, but I quickly hit some issues with continuous aggregates.

Inputs: batches of 10 million rows per day with a timestamp, a Person ID, and a Postgis location. Let's say this represents people at specific points in time.

Outputs: the centroid of the position of each person each day.

My initial solution was:

1 hypertable for the input

CREATE TABLE datapoints(
    person_id char(40),
    timestamp TIMESTAMPTZ NOT NULL,
    location geometry(point, 4326)
);

SELECT create_hypertable('datapoints', 'timestamp', chunk_time_interval => INTERVAL '1 day');

1 continuous aggregate using ST_Centroid

CREATE VIEW daily_centroid WITH
(timescaledb.continuous, timescaledb.refresh_interval = '1 day')
AS
SELECT
    person_id,
    time_bucket(INTERVAL '1 day', timestamp) as date,
    ST_Centroid(ST_Collect(location)) AS user_centroid
FROM datapoints
GROUP BY person_id, time_bucket(INTERVAL '1 day', timestamp);

However, doing this results in an error

ERROR:  aggregates which are not parallelizable are not supported by continuous aggregate query

Am I incorrectly assessing TimescaleDB as a good fit for this? If this doesn't work, I'm thinking I will write some scripts:

  • to create one table per day for the input
  • to create one materialized view per day with just (person_id, centroid)
  • to import data to the corresponding table
  • to refresh the materialized views

This is similar to the approach above but both the table and the view have N versions, one for each day. I thought I would use TimescaleDB to avoid writing that automation and allowing it to scale to new requirements. e.g it's as easy as writing a view and I don't have to teach other people my automation system.

Would triggers be a better alternative to implement this automation transparently if this doesn't fit the TimescaleDB's scope?

Thanks!!

Martín Coll
  • 3,368
  • 3
  • 37
  • 52
  • Continuous aggregates require aggregate functions, which support partial aggregation. I can see that `ST_collect` doesn't provide `combinefunc` and thus doesn't support partial aggregation. If TimescaleDB is good fit or not, depends if you have any other timeseries related need. You can always define a materialised view on top of hypertables. – k_rus May 06 '20 at 09:07
  • Would it be easy to add a `combinefunc`? I'm thinking something like `combinefunc(a, b) = st_collect(st_dump(a), st_dump(b))`. I would also say our timeseries needs are targeted towards aggregating and analyzing this input data. We'll probably want to have multiple views to support a couple of other queries. – Martín Coll May 06 '20 at 13:27
  • I guess you can try this by creating new aggregate with combinefunc defined, and may be later submit a contribution to postgis. – k_rus May 06 '20 at 14:00
  • The other thing you can do is simply run some of the aggregations in a cron job and move to another table, we're looking at making some of this functionality available more within Timescale in future versions as well. – davidk May 06 '20 at 14:39
  • @davidk that would be awesome. By the way: I found `st_memcollect` does implement `combinefunc`! I'll test that and report soon :) – Martín Coll May 06 '20 at 16:40
  • Is the aggregate marked as parallel safe? – davidk May 06 '20 at 19:12
  • that one is: https://github.com/postgis/postgis/blob/b28cfb9b923d2190877065fc6dd47a1a6bd961bb/postgis/postgis.sql.in#L3880 – Martín Coll May 06 '20 at 19:47

0 Answers0