1

I am searching for a way to get the amount of active trips in a GTFS feed for each minute of a day through a postgresql query.

For every trip I have the start and endtimes (in seconds) in a denormalized table. It looks something like this:

denormalized_trips table

The query to give me trips that are active for a given timeframe (e.g. here 43000 to 43600 seconds) looks like this:

SELECT
COUNT(trips.trip_id)
FROM denormalized_trips AS trips

LEFT JOIN gtfs_calendar_dates AS calendar_dates
ON calendar_dates.service_id = trips.service_id
AND calendar_dates.agency_key = trips.agency_key
AND date = '2017-07-03'
AND exception_type = 1

INNER JOIN gtfs_calendar AS calendar
  ON trips.service_id = calendar.service_id
  AND calendar.agency_key = trips.agency_key
  AND calendar.wednesday = 1

WHERE (
  trip_start_time BETWEEN 46800 AND 47100
  AND '2017-07-03' BETWEEN calendar.start_date AND calendar.end_date
)
AND NOT EXISTS (
  SELECT 0
  FROM gtfs_calendar_dates AS date_exceptions
  WHERE date = '2017-07-03'
  AND date_exceptions.agency_key = trips.agency_key
  AND date_exceptions.service_id = calendar.service_id
  AND exception_type = 2
);

This will result in 12 trips that will start between 13:00pm and 13:05pm.

Now I want to do that for the whole day. I want to get the amount of trips that get active in an interval of e.g. 1 minute or maybe more like 5 minutes. I tried it with a loop but this seems to just give me 1 result back. Here is what I came up with for now:

CREATE OR REPLACE FUNCTION get_active_trips(n int)
RETURNS INTEGER AS
$BODY$
DECLARE 
 count INTEGER;
BEGIN 
  FOR counter IN 43130..50000 BY 60 LOOP
    SELECT
    COUNT(trips.trip_id)
    INTO count
    FROM denormalized_trips AS trips

    LEFT JOIN gtfs_calendar_dates AS calendar_dates
    ON calendar_dates.service_id = trips.service_id
    AND calendar_dates.agency_key = trips.agency_key
    AND date = '2017-07-03'
    AND exception_type = 1

    INNER JOIN gtfs_calendar AS calendar
      ON trips.service_id = calendar.service_id
      AND calendar.agency_key = trips.agency_key
      AND calendar.wednesday = 1

    WHERE (
      trip_start_time BETWEEN counter AND counter + 60
      AND '2017-07-03' BETWEEN calendar.start_date AND calendar.end_date
    )
    AND NOT EXISTS (
      SELECT 0
      FROM gtfs_calendar_dates AS date_exceptions
      WHERE date = '2017-07-03'
      AND date_exceptions.agency_key = trips.agency_key
      AND date_exceptions.service_id = calendar.service_id
      AND exception_type = 2
    );
  END LOOP;
  RETURN count;
END; 
$BODY$ LANGUAGE plpgsql STABLE;

The result of calling SELECT get_active_trips(1);

sum of entries

Now I would like to get something like a table or an array of results back instead of just 1 entry. How would I do that?

Any help is highly appreciated.

Andi-lo
  • 2,244
  • 21
  • 26
  • Instead of `RETURNS INTEGER` use `RETURNS SETOF INTEGER`; instead of `RETURN count;` use `RETURN NEXT count;` just before `END LOOP;` – Abelisto Jul 05 '17 at 13:16
  • This returns just a single entry with a 0 as value. I tried to setup a sqlfiddle but the fiddle is not working and throws errors (while locally it runs without problems) http://sqlfiddle.com/#!17/4f4ba/9 – Andi-lo Jul 05 '17 at 14:07
  • 1
    You are totally missed with semicolons in your function. Also you placed `return next count;` **after** `end loop;`, not **before** as I said. [There is working edition](http://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=984f8407990ef6492104e96e20c0d7de). – Abelisto Jul 05 '17 at 14:18
  • PS: And there is example how to make output more verbose: [dbfiddle](http://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=acd97e168aa24133d7d8473b44e3f7a9) – Abelisto Jul 05 '17 at 14:26
  • Ouh sorry for that inconvenience. Testing both answers seemed to messed things up. I could integrate your solution and it is working, thanks! If you provide the answer I will accept it. – Andi-lo Jul 05 '17 at 15:26

2 Answers2

2

There are two syntaxes to create function returning set of values: returns setof <type> and returns table(<columns definition>). Documentation.

There are also several ways to return those values from the plpgsql function: return next or return query. Documentation.

So if you want just get the series of integers you could to rewrite your function in that way:

CREATE OR REPLACE FUNCTION get_active_trips(n int)
RETURNS SETOF INTEGER AS
$BODY$
DECLARE 
 count INTEGER;
BEGIN 
  FOR counter IN 43130..50000 BY 60 LOOP
    SELECT
    COUNT(trips.trip_id)
    INTO count
    FROM denormalized_trips AS trips
    <rest of query here>
    ;
    RETURN NEXT count;
  END LOOP;
  RETURN;
END; 
$BODY$ LANGUAGE plpgsql STABLE;

or, using RETURN QUERY:

CREATE OR REPLACE FUNCTION get_active_trips(n int)
RETURNS SETOF INTEGER AS
$BODY$
BEGIN 
  FOR counter IN 43130..50000 BY 60 LOOP
    RETURN QUERY
      SELECT
      COUNT(trips.trip_id)
      FROM denormalized_trips AS trips
      <rest of query here>
      ;
  END LOOP;
  RETURN;
END; 
$BODY$ LANGUAGE plpgsql STABLE;

In case if you want to return more then one column:

CREATE OR REPLACE FUNCTION get_active_trips(n int)
RETURNS TABLE (counter_value int, active_trips_count int) AS
$BODY$
BEGIN 
  FOR counter IN 43130..50000 BY 60 LOOP
    SELECT
    COUNT(trips.trip_id)
    INTO active_trips_count
    FROM denormalized_trips AS trips
    <rest of query here>
    ;
    counter_value := counter;
    RETURN NEXT; -- There is no parameters, current values of counter_value and active_trips_count will be returned
  END LOOP;
  RETURN;
END; 
$BODY$ LANGUAGE plpgsql STABLE;

or, using RETURN QUERY:

CREATE OR REPLACE FUNCTION get_active_trips(n int)
RETURNS TABLE (counter_value int, active_trips_count int) AS
$BODY$
BEGIN 
  FOR counter IN 43130..50000 BY 60 LOOP
    RETURN QUERY
      SELECT
        counter,
        COUNT(trips.trip_id)
      FROM denormalized_trips AS trips
      <rest of query here>
      ;
  END LOOP;
  RETURN;
END; 
$BODY$ LANGUAGE plpgsql STABLE;

Finally there is alternative declaration for returns table:

CREATE OR REPLACE FUNCTION get_active_trips(
  n int,
  out counter_value int, 
  out active_trips_count int)
RETURNS SETOF RECORD AS

Update

But(!) I feeling that it is possible to simplify your task using single query, without loop.

Consider the following query (i'v used your simplified query from sqlfiddle):

select
  counter,
  count(trips.trip_id)
from
  generate_series(43130, 50000, 60) as counter left join
    denormalized_trips as trips on (trip_start_time between counter and counter + 60) 
group by counter 
order by counter;
Abelisto
  • 14,826
  • 2
  • 33
  • 41
  • Excellent write up. Are there any performance differences in the different examples (just if you know of). – Andi-lo Jul 05 '17 at 17:13
  • 1
    @Andi-lo No any differences as I know. However look at the updated answer. – Abelisto Jul 05 '17 at 17:38
  • 1
    @Andi-lo BTW there is the bug in my examples in dbfiddle. Instead of `trip_start_time BETWEEN counter_value AND counter_value + 60` should be `trip_start_time BETWEEN counter AND counter + 60` – Abelisto Jul 05 '17 at 17:41
  • your updated answer is executing A LOT faster! Really nice because when running it against the whole day the query before was like really really slow. – Andi-lo Jul 06 '17 at 08:47
  • @Andi-lo Glad to see that I made your day a little better :) Good luck. – Abelisto Jul 06 '17 at 08:55
  • @Andi-lo And PS: I think that it should be `trip_start_time between counter and counter + 59` to avoid data doubling at the edges. – Abelisto Jul 06 '17 at 09:07
0

Create type like below

drop type if exists get_active_trips_out;
create type get_active_trips_out as
(
   Count int
);

use that table type while returning the data from the function like below

CREATE OR REPLACE FUNCTION get_active_trips(n int)
RETURNS setof get_active_trips_out AS
$BODY$
DECLARE 
 count INTEGER;
r get_active_trips_out%rowtype;
BEGIN 
 create temp table tmp_count
(
  Count int
)on commit drop;
  FOR counter IN 43130..50000 BY 60 LOOP
    Insert into tmp_count
    SELECT
    COUNT(trips.trip_id)
    FROM denormalized_trips AS trips

    LEFT JOIN gtfs_calendar_dates AS calendar_dates
    ON calendar_dates.service_id = trips.service_id
    AND calendar_dates.agency_key = trips.agency_key
    AND date = '2017-07-03'
    AND exception_type = 1

    INNER JOIN gtfs_calendar AS calendar
      ON trips.service_id = calendar.service_id
      AND calendar.agency_key = trips.agency_key
      AND calendar.wednesday = 1

    WHERE (
      trip_start_time BETWEEN counter AND counter + 60
      AND '2017-07-03' BETWEEN calendar.start_date AND calendar.end_date
    )
    AND NOT EXISTS (
      SELECT 0
      FROM gtfs_calendar_dates AS date_exceptions
      WHERE date = '2017-07-03'
      AND date_exceptions.agency_key = trips.agency_key
      AND date_exceptions.service_id = calendar.service_id
      AND exception_type = 2
    );
  END LOOP;
 for r in 
 Select * from tmp_count
 loop
 return next r
 end loop;
END; 
  • Is there any reason to create composite type with only one field? Is there any reason to create temporary table (and do not drop it BTW after using) instead of returning values inside the main loop? Your answer is excessively complex and also contains the bug: calling this function several times in one session will causes the error like `table "tmp_count" already exists` – Abelisto Jul 05 '17 at 13:24
  • I edited the query. Added on commit drop which will not throw "tmp_count" already exists error. To return the results in the tabular format, I used type. It would be helpful if you wanted to return any other columns – Anuraag Veerapaneni Jul 05 '17 at 13:30
  • Sadly your edit turns into a syntax error at "commit drop": ERROR: syntax error at or near "(" LINE 9: ( ^ – Andi-lo Jul 05 '17 at 13:37
  • Seems like you can't create a table in a function? ERROR: CREATE TABLE is not allowed in a non-volatile function CONTEXT: SQL statement "create temp table tmp_count ( Count int )on commit drop" PL/pgSQL function get_active_trips(integer) line 6 at SQL statement – Andi-lo Jul 05 '17 at 14:13