Optimize spike removal using PostgreSQL

Question

I want to remove spikes directly from my data stored in a PostgreSQL-DB with TimescaleDB.

My data is stored as values with an interval of 1 second, I want to get 5 minute averages calculated without spikes.

I determine spikes using standard deviation and excluding all data that is more than a fixed zscore away.

So in a first step I get all data relevant for my analysis (data_filtered), then I calculate average and standard deviation for each 5-minute chunk (avg_and_stddev_per_interval), then I join the initial data (data_filtered) with the calculated avg and stddev, exclude all values not meeting my criteria and finally calculate the final 5-minute average without spikes.

with data_filtered as (
    select ts, value
    from schema.table 
    where some_criteria = 42 
    and ts >= '2018-11-12 10:00:00'
    and ts < '2018-11-13 10:00:00'
), 
avg_and_stddev_per_interval as (
    select time_bucket('5 minutes', ts) as five_min,
    avg(value) as avg_value,
    stddev(value) as stddev_value,
    from data_filtered
    group by five_min   
)
select 
    time_bucket('5 minutes', ts) as tb,
    avg(value) as value,
from data_filtered
left join avg_and_stddev_per_interval 
    on data_filtered.ts >= avg_and_stddev_per_interval.five_min 
    and data_filtered.ts < avg_and_stddev_per_interval.five_min + interval '5 minutes'
    where abs((value-avg_value)/stddev_value) < 1 
    group by tb;

It all works well, but it is incredibly slow. Requesting the full data without any grouping (select * from data_filtered) and calculating my criteria locally is much faster. I want to reduce the data volume, however, so this approach is not possible in this case.

Is there any way to accelerate my query?

Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query? — 404, Nov 13 '18 at 12:23

Tom · Answer 1 · 2018-11-13T15:27:15.443

eurotrash' comment lead to a much faster code as follows:

select 
    time_bucket('5 minutes', ts) as tb, avg(value) as value
from schema.table   
left join (
    select time_bucket('5 minutes', ts) as five_min,
        avg(value) as value,
        stddev(value) as stddev_value,
        from schema.table
        where some_criteria = 42
        and ts >= '2018-11-12 00:00:00'
        and ts < '2018-11-13 00:00:00'
        group by five_min
    ) as fm
    on ts >= fm.five_min 
    and ts < fm.five_min + interval '5 minutes'         
where some_criteria = 42
    and ts >= '2018-11-12 00:00:00'
    and ts < '2018-11-13 00:00:00'
    and abs((value-avg_value)/stddev_value) < 1     
group by tb;

Here I got rid of any CTEs which were there only for readability anyhow.

This is still 8 times slower than just requesting averaged values without spike removal, though.

Explain analyze:

Sort  (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)
  Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))
  Sort Method: quicksort  Memory: 65kB
  ->  HashAggregate  (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)
        Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))
        ->  Nested Loop  (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)
              Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))
              Rows Removed by Join Filter: 24770914
              ->  Append  (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)
                    ->  Seq Scan on data  (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)
                          Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
                    ->  Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk  (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)
                          Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
                          Filter: (node_id = 8)
                          Rows Removed by Filter: 949135
              ->  Materialize  (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)
                    ->  Subquery Scan on fm  (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)
                          ->  Finalize GroupAggregate  (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)
                                Group Key: (time_bucket('00:05:00'::interval, data_1.ts))
                                ->  Sort  (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)
                                      Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))
                                      Sort Method: quicksort  Memory: 251kB
                                      ->  Gather  (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)
                                            Workers Planned: 2
                                            Workers Launched: 2
                                            ->  Partial HashAggregate  (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)
                                                  Group Key: time_bucket('00:05:00'::interval, data_1.ts)
                                                  ->  Result  (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)
                                                        ->  Append  (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)
                                                              ->  Parallel Seq Scan on data data_1  (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)
                                                                    Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
                                                              ->  Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1  (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)
                                                                    Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
                                                                    Filter: (node_id = 8)
                                                                    Rows Removed by Filter: 316378
Planning time: 17.704 ms
Execution time: 24093.223 ms

Can you post an `EXPLAIN ANALYZE` so we can see how it's being executed? — 404, Nov 13 '18 at 14:44
Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast. — 404, Nov 13 '18 at 16:15
The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem. — Tom, Nov 13 '18 at 16:30
OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment. — 404, Nov 13 '18 at 16:33
@tom: `Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx` is the big consumer here. Maybe the parallel is slowing it down (crabbing locks???) — joop, Nov 13 '18 at 17:55
Unfortunately not. Setting max_parallel_workers to 0 and therefore forcing a normal index scan leads to slightly worse query times. — Tom, Nov 15 '18 at 11:05

score 0 · Answer 2 · answered Nov 13 '18 at 14:44

0

The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.

CREATE TEMP VIEW data_filtered as
    SELECT ts, value
    FROM schema.table
    WHERE some_criteria = 42
    AND ts >= '2018-11-12 10:00:00'
    AND ts < '2018-11-13 10:00:00'
        ;

CREATE TEMP VIEW avg_and_stddev_per_interval as
    SELECT time_bucket('5 minutes', ts) as five_min
    , avg(value) as avg_value
    , stddev(value) as stddev_value
    FROM data_filtered
    GROUP BY 1
        ;

SELECT
    time_bucket('5 minutes', ts) as tb
    , avg(value) as value
FROM data_filtered df
LEFT JOIN avg_and_stddev_per_interval  av
    ON df.ts >= av.five_min
    AND df.ts < av.five_min + interval '5 minutes'
    WHERE abs((value-avg_value)/stddev_value) < 1
    GROUP BY 1
        ;

answered Nov 13 '18 at 14:44

joop

4,330
1
15
26

This is exactly as fast/slow as the solution without any CTEs/temp views. – Tom Nov 13 '18 at 15:36
In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause `WHERE abs((value-avg_value)/stddev_value) < 1` could be avoided. – joop Nov 13 '18 at 15:48
This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc. – Tom Nov 13 '18 at 16:12
I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape. – joop Nov 13 '18 at 17:40

score 0 · Answer 3 · answered Nov 13 '18 at 16:48

It appears the worst of the performance is happening in the JOIN (as per the query in your answer, not in your question). Ideally you wouldn't be joining on a subquery when it returns a lot of results, but I don't see how you can avoid it given your criteria.

So this is my suggestion:

The subquery results are put into a temp table
The temp table is indexed
The join is performed on the temp table
Encapsulate all this in a function

Now I generally hate doing this as I don't like creating temp tables, but sometimes it really does give you the best performance for something that can't be done any other way. (Not saying it can't be done another way, but I can't think of a better performant way.)

So something like this:

CREATE OR REPLACE FUNCTION schema.my_function()
    RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS
$BODY$
BEGIN
    CREATE TEMP TABLE fm ON COMMIT DROP AS
        select time_bucket('5 minutes', ts) as five_min,
            avg(value) as value,
            stddev(value) as stddev_value
        from schema.table
        where some_criteria = 42
        and ts >= '2018-11-12 00:00:00'
        and ts < '2018-11-13 00:00:00'
        group by five_min;

    CREATE INDEX ON fm (five_min);

    RETURN time_bucket('5 minutes', ts), avg(value)
    from schema.table
    left join fm
        on ts >= fm.five_min 
        and ts < fm.five_min + interval '5 minutes'
    where some_criteria = 42
    and ts >= '2018-11-12 00:00:00'
    and ts < '2018-11-13 00:00:00'
    and abs((value-avg_value)/stddev_value) < 1     
    group by tb;
END
$BODY$
    LANGUAGE plpgsql;

Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.

I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.

Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow — Tom, Nov 13 '18 at 17:20
Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join *should* be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience. — 404, Nov 13 '18 at 17:25
The temp table & index creations don't take long at all, it is really just the bare query that takes longer. — Tom, Nov 13 '18 at 17:52

Optimize spike removal using PostgreSQL

3 Answers3