Backfilling missing data against a timeseries without using <= join?

Question

I have a table that follows roughly this schema:

Table Name: history
╔════╤══════╤══════════╤═════╤═════════════════════╗
║ id │ stat │ stat_two │ ... │ updated_at          ║
╠════╪══════╪══════════╪═════╪═════════════════════╣
║ 1  │ 100  │ 5        │ ... │ 2019-01-01 12:30 PM ║
╟────┼──────┼──────────┼─────┼─────────────────────╢
║ 1  │ 105  │ 7        │ ... │ 2019-01-02 12:30 PM ║
╟────┼──────┼──────────┼─────┼─────────────────────╢
║ 1  │ 300  │ 10       │ ... │ 2019-02-01 12:30 PM ║
╟────┼──────┼──────────┼─────┼─────────────────────╢
║ 1  │ 700  │ 20       │ ... │ 2019-05-01 12:30 PM ║
╟────┼──────┼──────────┼─────┼─────────────────────╢
║ 2  │ 50   │ 0        │ ... │ 2019-01-01 12:30 PM ║
╟────┼──────┼──────────┼─────┼─────────────────────╢
║ 2  │ 55   │ 0        │ ... │ 2019-01-02 12:30 PM ║
╟────┼──────┼──────────┼─────┼─────────────────────╢
║ 2  │ 75   │ 3        │ ... │ 2019-02-01 12:30 PM ║
╟────┼──────┼──────────┼─────┼─────────────────────╢
║ 2  │ 90   │ 7        │ ... │ 2019-05-01 12:30 PM ║
╚════╧══════╧══════════╧═════╧═════════════════════╝

The table is very large.

I am trying to produce the following result, while filtering only include some IDS (like only 1, and 2):

╔═════════╤═══════════════════╤═══════════════════════════════════════════════╤═══════════════════════════════════════════════════╗
║ month   │ count_of_ids_seen │ sum_of_(last_seen_stat_for_that_month per ID) │ sum_of_(last_seen_stat_two_for_that_month per ID) ║
╠═════════╪═══════════════════╪═══════════════════════════════════════════════╪═══════════════════════════════════════════════════╣
║ 2019-01 │ 2                 │ 160                                           │ 7                                                 ║
╟─────────┼───────────────────┼───────────────────────────────────────────────┼───────────────────────────────────────────────────╢
║ 2019-02 │ 2                 │ 375                                           │ 13                                                ║
╟─────────┼───────────────────┼───────────────────────────────────────────────┼───────────────────────────────────────────────────╢
║ 2019-03 │ 2                 │ 375                                           │ 13                                                ║
╟─────────┼───────────────────┼───────────────────────────────────────────────┼───────────────────────────────────────────────────╢
║ 2019-04 │ 2                 │ 375                                           │ 13                                                ║
╟─────────┼───────────────────┼───────────────────────────────────────────────┼───────────────────────────────────────────────────╢
║ 2019-05 │ 2                 │ 790                                           │ 27                                                ║
╚═════════╧═══════════════════╧═══════════════════════════════════════════════╧═══════════════════════════════════════════════════╝

I've tried last_value window functions, and can get the records that appear, but the issue is that I need the data lagged up, if the record does not appear in the table. It is assumed, for month 3 for example, that because there are no records, we should take the last seen record of the date before this.

My current solution used a <= join, which is the bottleneck and when attempting for millions of Ids, it is way too slow and will not run at the speeds I need it to be.

I was joining against a generate_series like so:

    FROM
        (SELECT month::date FROM generate_series('2018-03-01'::date, '2019-06-01'::date, '1 month') month) d
    LEFT JOIN
        history h
    ON date_trunc('month', h.updated_at) <= d.month

Any ideas on how to do this more efficiently and remove a <= join? That is causing a nested loop and creating the overhead to be way too large.

score 0 · Answer 1 · answered Jun 13 '19 at 11:20

At Citus, we make use of rollup tables to create intermediate results to our queries on real-time data. You can calculate aggregates for each day (or perhaps hour), and later use these intermediate values to calculate aggregates for months.

This solution will not remove your need to use joins, but the computation cost will be considerably smaller.

You can see our docs for our usage of rollup tables. Even if you do not distribute your tables with Citus, the information here can guide you

Backfilling missing data against a timeseries without using <= join?

1 Answers1