Quickest way to return snapshots of Exponential Moving Average in postgresql

Question

I am collecting and graphing data and one of the things I need to do is calculate an exponential moving average for some of my data. I have my data stored in postgres.

Based on another stack page I read (How to calculate an exponential moving average on postgres?) I have the following function.

CREATE OR REPLACE FUNCTION ema_func(
state double precision,
inval double precision,
alpha double precision)
RETURNS double precision AS
$BODY$
begin
  return case
     when state is null then inval
     else alpha * inval + (1-alpha) * state
     end;
end
$BODY$
LANGUAGE plpgsql VOLATILE

which I am then using a an aggregate like so to put it all together:

CREATE AGGREGATE ema(double precision, double precision) (
  SFUNC=ema_func,
  STYPE=float8
);

I am graphing stock information so for a given day I have about 7000-8000 pieces of data. I don't need all of that information to graph the data (depending on my window settings, 1 pixel might be worth somewhere around 60 seconds) so I want to pull snapshots of my data every nth second. I wrote this function to do that for me and it has saved me some time.

CREATE OR REPLACE FUNCTION emasnapshots(
    ptable varchar,
    timestart timestamptz,
    timeend timestamptz,
    duration double precision,
    psymbol varchar,
    alpha double precision)
    returns setof timevalue as
$BODY$
    DECLARE
        localstart timestamptz;
        localend timestamptz;
        timevalues timevalue%rowtype;
        groups int := ceil((SELECT EXTRACT(EPOCH FROM (timeend - timestart))) / duration);
    BEGIN
        EXECUTE 'CREATE TEMP TABLE allemas ON COMMIT DROP AS select datetime, ema(value, ' || quote_literal(alpha) || ') over (order by datetime asc) from ' || quote_ident(ptable) || ' where symbol = ' || quote_literal(psymbol) || ' and datetime >= ' || quote_literal(timestart) || ' and datetime <= ' || quote_literal(timeend);
        FOR i in 1 .. groups LOOP
            localStart := timestart + (duration * (i - 1) * interval '1 second');
            localEnd := timestart + (duration * i * interval '1 second');
            EXECUTE 'select * from allemas where datetime >= ' || quote_literal(localstart) || ' and datetime <= ' || quote_literal(localend) || ' order by datetime desc limit 1' into timevalues;
            return next timevalues;
        end loop;
    return;
    END
$BODY$
LANGUAGE plpgsql VOLATILE

Running just my EMA with

select datetime::timestamptz, ema(value, 0.0952380952380952380952380952381 /* alpha */) over (order by datetime asc) from "4" where symbol = 'AAPL' and datetime >= '2015-07-01 7:30' and datetime <= '2015-07-01 14:00:00'

takes about 1.5 seconds to collect all of the data (7733 rows) and push it across the internet (my data is in another state)

Running the emasnapshot function I wrote with

select start, average from emasnapshots ('4', '2015-07-01 9:30-4', '2015-07-01 16:00-4', 60, 'AAPL', 0.0952380952380952380952380952381);

takes about 0.5 seconds to gather all of the data and push it across the internet (390 rows) btw, for clarity. I am pulling from table "4" for July 1st during stock market hours and I want snapshots every 60 seconds. The last number is my alpha and it means I am calculating the 20 second emas (alpha = 2/(period + 1))

My question is, am I doing this the fastest way possible? Is there a way to tell which part of my function is the slower part? Like is it the temp table creation or the grabbing the snapshot part? Should I be selecting the most recent date in an interval a different way? Should I be selecting the latest time in my interval from my original table (which is indexed on time) and joining that with my newly created table?

I just started writing postgres functions about a week ago. I realize that my newly created table is not indexed and so it might take a bit longer to do date-related stuff like I am asking it to do. Is there a way around this? I am dealing with lots of days worth of data with lots of different symbols so I am not sure that creating ema tables for all possibilities is a good idea. I don't want to suck all of the data down and do processing locally because if the graphing software has multiple days open, that could easily encompass 35,000 lines that would have to be transferred then processed.

Btw, I don't think it is indexing speed or anything like that because I can run:

select * from "4" where symbol = 'AAPL' and datetime >= '2015-07-01 07:30' and datetime <= '2015-07-01 14:00' order by datetime asc limit 450

and get a response in under 150ms over the internet. Obviously there is way less processing in this though.

Thank you so much for your time!

EDITED BASED ON PATRICK'S ANSWER.

I now have the query below which I modified from what Patrick said:

SELECT datetime, ema FROM (
    SELECT datetime, ema, rank() OVER (PARTITION BY bucket ORDER BY datetime DESC) = 1 as rank
        FROM (
        SELECT datetime, ema(value, 0.0952380952380952380952380952381) OVER (ORDER BY datetime ASC) AS ema, 
            ceil(extract(epoch from (datetime - '2015-07-01 7:30')) / 60) AS bucket
        FROM "4" 
        WHERE symbol = 'AAPL' 
        AND datetime BETWEEN '2015-07-01 7:30' AND '2015-07-01 14:00' ) x ) y
WHERE rank = true;

Because I was getting the error that I couldn't put a rank statement in a where clause so I split it into a different select statement, did I do this right? Having three select statements feels odd to me but I am an SQL newbie and trying to learn as I go so maybe that isn't that bad.

My explain statement for the above query looks like this.

Subquery Scan on y  (cost=6423.35..6687.34 rows=4062 width=16)
  Filter: y.rank
   ->  WindowAgg  (cost=6423.35..6606.11 rows=8123 width=24)
    ->  Sort  (cost=6423.35..6443.65 rows=8123 width=24)
          Sort Key: x.bucket, x.datetime
          ->  Subquery Scan on x  (cost=5591.23..5895.85 rows=8123 width=24)
                ->  WindowAgg  (cost=5591.23..5814.62 rows=8123 width=16)
                      ->  Sort  (cost=5591.23..5611.54 rows=8123 width=16)
                            Sort Key: "4".datetime
                            ->  Bitmap Heap Scan on "4"  (cost=359.99..5063.74 rows=8123 width=16)
                                  Recheck Cond: (((symbol)::text = 'AAPL'::text) AND (datetime >= '2015-07-01 07:30:00-06'::timestamp with time zone) AND (datetime <= '2015-07-01 14:00:00-06'::timestamp with time zone))
                                  ->  Bitmap Index Scan on "4_pkey"  (cost=0.00..357.96 rows=8123 width=0)
                                        Index Cond: (((symbol)::text = 'AAPL'::text) AND (datetime >= '2015-07-01 07:30:00-06'::timestamp with time zone) AND (datetime <= '2015-07-01 14:00:00-06'::timestamp with time zone))

Yes, this looks good, but `rank` is an `integer` so you should use `rank = 1`. In the `EXPLAIN` output you see that your final cost is 6423.35..6687.34. If you work your way down you see that most cost is incurred on `Bitmap Heap Scan on "4" (cost=359.99..5063.74 ...`; i.e. selecting the rows that match the conditions. Putting an index on `(symbol, datetime)` would make that a lot better (but inserting rows would be slower, so weigh your options). Out of curiosity, can you post the `EXPLAIN` of your original query? — Patrick, Jul 10 '15 at 00:54
@Patrick The "rank() OVER (PARTITION BY bucket ORDER BY datetime DESC) = 1" was being put in as a boolean, If I dropped the 1 then it was giving me the rank, I realized now the = 1 was because of the where statement and have modified it accordingly - thanks :). When I do explain on my function I only get "Function Scan on emasnapshotsold (cost=0.25..10.25 rows=1000 width=16)" not sure how to make it more verbose.? My primary key is (symbol, datetime) so I thought an index would automatically be created for that? — Ben Hernandez, Jul 10 '15 at 06:11

Patrick · Accepted Answer · 2015-07-10T00:38:12.957

First, a few notes about minor efficiency issues with your function:

You do not have to quote_literal() anything other than strings. It is impossible that Bobby Tables is injected into your SQL statement through a double precision or timestamp parameter.
In a dynamic SQL statement you only have to manually splice in table and column names; parameter values can be injected using the USING clause. This saves a lot of parsing time.
Move as many calculations as possible outside of your loop. For instance:

DECLARE
  ...
  dur_int interval := duration * interval '1 second';
  localStart timestamptz := timestart;
  localEnd timestamptz := localStart + dur_int;
BEGIN
  ...
  FOR i in 1 .. groups LOOP
    ...
    localStart := localStart + dur_int;
    localEnd := localEnd + dur_int;
  END LOOP;
  ...

But this is really all moot...

In your code you first populate a temp table with 7733 rows of data, from which you later extract one record at a time with a dynamic query in a loop that runs 390 times. All very, very wasteful. You can replace the entire function body by just a single statement:

RETURN QUERY EXECUTE format('SELECT datetime, ema '
   'FROM ('
     'SELECT datetime, ema, '
            'rank() OVER (PARTITION BY bucket ORDER BY datetime DESC) AS rank '
     'FROM ('
       'SELECT datetime, ema(value, $1) OVER (ORDER BY datetime ASC) AS ema, '
              'ceil(extract(epoch from (datetime - $2)) / $3) AS bucket '
       'FROM %I ' 
       'WHERE symbol = $4 '
         'AND datetime BETWEEN $2 AND $5) x '
     'WHERE rank = 1) y '
   'ORDER BY 1', ptable) USING alpha, timestart, duration, psymbol, timeend;

The principle here is that in the innermost query you calculate the "bucket" into which each processed row from the table would fall. In the next level query you calculate the rank of all rows in each bucket based on the datetime. In the main query you then pick the most recent row from each bucket, i.e. the one where rank = 1.

On speed: you should really do an EXPLAIN on all queries on the server, rather than measure on the client which includes network transfer time.

I believe I have this right, I changed my return to "record" and put your return query execute format statement between BEGIN and END (after some googling) When I run this now, I get "ERROR: window functions are not allowed in WHERE SQL state: 42P20 Context: PL/pgSQL function emasnapshots(character varying,timestamp with time zone,timestamp with time zone,double precision,character varying,double precision) line 3 at RETURN QUERY" I tried to run the query by hand outside of the function thing I got LINE 8: WHERE rank() OVER (PARTITION BY bucket ORDER BY datetime DE... — Ben Hernandez, Jul 09 '15 at 19:47
BTW, thanks for explaining the buckets thing to me. I am starting to work on some other functions that I now realize were not written as efficiently as they could be. I got rid of the error about where rank() by nesting three select statements, I don't really know anything about sql though so I have edited the bottom of my question with my new query and the explain statement and I hope you don't mind helping one more time. I don't want to be the guy just sitting around waiting for someone to fix his problem — Ben Hernandez, Jul 09 '15 at 21:03
Ah, yes, my bad. A left-over from the way I craft complex queries, if that is any excuse. Answer corrected. — Patrick, Jul 10 '15 at 00:39
When writing complex queries you work with "tables" (quotes intended). There are the actual tables, but every statement that returns records (like a `VIEW`, a `SELECT` statement or your `emasnapshots()` function) works just like any other table. In PostgreSQL all of the above are usually referred to as a "row source". You can keep combining these in complex ways. So if you have a complex issue, break it up into smaller problems and combine them: (1) get datetime, ema and bucket; (2) rank them; and (3) find datetime and ema for each bucket with the correct rank. Endless options. — Patrick, Jul 10 '15 at 00:44

Quickest way to return snapshots of Exponential Moving Average in postgresql

EDITED BASED ON PATRICK'S ANSWER.

1 Answers1