I want to remove spikes directly from my data stored in a PostgreSQL-DB with TimescaleDB.
My data is stored as values with an interval of 1 second, I want to get 5 minute averages calculated without spikes.
I determine spikes using standard deviation and excluding all data that is more than a fixed zscore away.
So in a first step I get all data relevant for my analysis (data_filtered), then I calculate average and standard deviation for each 5-minute chunk (avg_and_stddev_per_interval), then I join the initial data (data_filtered) with the calculated avg and stddev, exclude all values not meeting my criteria and finally calculate the final 5-minute average without spikes.
with data_filtered as (
select ts, value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 10:00:00'
and ts < '2018-11-13 10:00:00'
),
avg_and_stddev_per_interval as (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as avg_value,
stddev(value) as stddev_value,
from data_filtered
group by five_min
)
select
time_bucket('5 minutes', ts) as tb,
avg(value) as value,
from data_filtered
left join avg_and_stddev_per_interval
on data_filtered.ts >= avg_and_stddev_per_interval.five_min
and data_filtered.ts < avg_and_stddev_per_interval.five_min + interval '5 minutes'
where abs((value-avg_value)/stddev_value) < 1
group by tb;
It all works well, but it is incredibly slow. Requesting the full data without any grouping (select * from data_filtered
) and calculating my criteria locally is much faster. I want to reduce the data volume, however, so this approach is not possible in this case.
Is there any way to accelerate my query?