I am collecting and graphing data and one of the things I need to do is calculate an exponential moving average for some of my data. I have my data stored in postgres.
Based on another stack page I read (How to calculate an exponential moving average on postgres?) I have the following function.
CREATE OR REPLACE FUNCTION ema_func(
state double precision,
inval double precision,
alpha double precision)
RETURNS double precision AS
$BODY$
begin
return case
when state is null then inval
else alpha * inval + (1-alpha) * state
end;
end
$BODY$
LANGUAGE plpgsql VOLATILE
which I am then using a an aggregate like so to put it all together:
CREATE AGGREGATE ema(double precision, double precision) (
SFUNC=ema_func,
STYPE=float8
);
I am graphing stock information so for a given day I have about 7000-8000 pieces of data. I don't need all of that information to graph the data (depending on my window settings, 1 pixel might be worth somewhere around 60 seconds) so I want to pull snapshots of my data every nth second. I wrote this function to do that for me and it has saved me some time.
CREATE OR REPLACE FUNCTION emasnapshots(
ptable varchar,
timestart timestamptz,
timeend timestamptz,
duration double precision,
psymbol varchar,
alpha double precision)
returns setof timevalue as
$BODY$
DECLARE
localstart timestamptz;
localend timestamptz;
timevalues timevalue%rowtype;
groups int := ceil((SELECT EXTRACT(EPOCH FROM (timeend - timestart))) / duration);
BEGIN
EXECUTE 'CREATE TEMP TABLE allemas ON COMMIT DROP AS select datetime, ema(value, ' || quote_literal(alpha) || ') over (order by datetime asc) from ' || quote_ident(ptable) || ' where symbol = ' || quote_literal(psymbol) || ' and datetime >= ' || quote_literal(timestart) || ' and datetime <= ' || quote_literal(timeend);
FOR i in 1 .. groups LOOP
localStart := timestart + (duration * (i - 1) * interval '1 second');
localEnd := timestart + (duration * i * interval '1 second');
EXECUTE 'select * from allemas where datetime >= ' || quote_literal(localstart) || ' and datetime <= ' || quote_literal(localend) || ' order by datetime desc limit 1' into timevalues;
return next timevalues;
end loop;
return;
END
$BODY$
LANGUAGE plpgsql VOLATILE
Running just my EMA with
select datetime::timestamptz, ema(value, 0.0952380952380952380952380952381 /* alpha */) over (order by datetime asc) from "4" where symbol = 'AAPL' and datetime >= '2015-07-01 7:30' and datetime <= '2015-07-01 14:00:00'
takes about 1.5 seconds to collect all of the data (7733 rows) and push it across the internet (my data is in another state)
Running the emasnapshot function I wrote with
select start, average from emasnapshots ('4', '2015-07-01 9:30-4', '2015-07-01 16:00-4', 60, 'AAPL', 0.0952380952380952380952380952381);
takes about 0.5 seconds to gather all of the data and push it across the internet (390 rows) btw, for clarity. I am pulling from table "4" for July 1st during stock market hours and I want snapshots every 60 seconds. The last number is my alpha and it means I am calculating the 20 second emas (alpha = 2/(period + 1))
My question is, am I doing this the fastest way possible? Is there a way to tell which part of my function is the slower part? Like is it the temp table creation or the grabbing the snapshot part? Should I be selecting the most recent date in an interval a different way? Should I be selecting the latest time in my interval from my original table (which is indexed on time) and joining that with my newly created table?
I just started writing postgres functions about a week ago. I realize that my newly created table is not indexed and so it might take a bit longer to do date-related stuff like I am asking it to do. Is there a way around this? I am dealing with lots of days worth of data with lots of different symbols so I am not sure that creating ema tables for all possibilities is a good idea. I don't want to suck all of the data down and do processing locally because if the graphing software has multiple days open, that could easily encompass 35,000 lines that would have to be transferred then processed.
Btw, I don't think it is indexing speed or anything like that because I can run:
select * from "4" where symbol = 'AAPL' and datetime >= '2015-07-01 07:30' and datetime <= '2015-07-01 14:00' order by datetime asc limit 450
and get a response in under 150ms over the internet. Obviously there is way less processing in this though.
Thank you so much for your time!
EDITED BASED ON PATRICK'S ANSWER.
I now have the query below which I modified from what Patrick said:
SELECT datetime, ema FROM (
SELECT datetime, ema, rank() OVER (PARTITION BY bucket ORDER BY datetime DESC) = 1 as rank
FROM (
SELECT datetime, ema(value, 0.0952380952380952380952380952381) OVER (ORDER BY datetime ASC) AS ema,
ceil(extract(epoch from (datetime - '2015-07-01 7:30')) / 60) AS bucket
FROM "4"
WHERE symbol = 'AAPL'
AND datetime BETWEEN '2015-07-01 7:30' AND '2015-07-01 14:00' ) x ) y
WHERE rank = true;
Because I was getting the error that I couldn't put a rank statement in a where clause so I split it into a different select statement, did I do this right? Having three select statements feels odd to me but I am an SQL newbie and trying to learn as I go so maybe that isn't that bad.
My explain statement for the above query looks like this.
Subquery Scan on y (cost=6423.35..6687.34 rows=4062 width=16)
Filter: y.rank
-> WindowAgg (cost=6423.35..6606.11 rows=8123 width=24)
-> Sort (cost=6423.35..6443.65 rows=8123 width=24)
Sort Key: x.bucket, x.datetime
-> Subquery Scan on x (cost=5591.23..5895.85 rows=8123 width=24)
-> WindowAgg (cost=5591.23..5814.62 rows=8123 width=16)
-> Sort (cost=5591.23..5611.54 rows=8123 width=16)
Sort Key: "4".datetime
-> Bitmap Heap Scan on "4" (cost=359.99..5063.74 rows=8123 width=16)
Recheck Cond: (((symbol)::text = 'AAPL'::text) AND (datetime >= '2015-07-01 07:30:00-06'::timestamp with time zone) AND (datetime <= '2015-07-01 14:00:00-06'::timestamp with time zone))
-> Bitmap Index Scan on "4_pkey" (cost=0.00..357.96 rows=8123 width=0)
Index Cond: (((symbol)::text = 'AAPL'::text) AND (datetime >= '2015-07-01 07:30:00-06'::timestamp with time zone) AND (datetime <= '2015-07-01 14:00:00-06'::timestamp with time zone))