0

I'm trying to find the number of revenue miles/kilometers of a "Route" by: day, month, and year; by querying a GTFS database that has the structure described here:

https://developers.google.com/transit/gtfs/reference

And see here for a very clear sketch of the structure:

http://blog.openplans.org/2012/08/the-openplans-guide-to-gtfs-data/

"Revenue distance traveled" definition:

("Available for passengers to use" distance)

The number of miles/kilometers traveled from the first actual bus stop where a passenger can board, to the last drop-off at the last bus stop, for that particular route and bus run. (then aggregated together for all service runs taken by all buses for that particular route)

-

"Revenue hours" definition:

("Available for passengers to use" time span)

The number of hours from the moment the vehicle arrives at the first bus stop, until the moment it drops off its last passenger at the last bus stop. (then aggregated together for all service runs taken by all buses for that particular route)

I'm using SQL Server/MSSQL. Though SQL Lite, or MySQL, or any SQL examples would be perfectly fine.

Basically, I need to be able to SELECT a route, and then correlate the data in the routes, calendar_dates, calendar, stop-times, stops, and trips tables to find how many miles/kilometers were covered from the first stop (stop_times and stops tables) to the last, how many hours elapsed, and find this for a particular service_id (in trips and calendar tables), and then also for all service_ids for a particular route, and be able to get all this for a particular date (in calendar_dates table), or spans of dates (day, month, 3-month period, year, etc).

If a few different queries are needed, that's fine. The revenue distance traveled per route, and the revenue hours per route, can be separate queries.

Has anyone who has done this before be willing to share their query structure for this or has anyone figured this out? Are there any examples of how to write this query? I've been looking everywhere online for weeks.

Here is a diagram image of the database I have created with all relationships shown in detail:

GTFS Database Diagram

FirstFraktal
  • 358
  • 4
  • 6
  • Have you tried any queries yourself? The query to join routes, trips, stops and calendar information is not complex. This should give you a good starting point. – tospig Jun 28 '15 at 23:03
  • 1
    [Something like this](http://stackoverflow.com/a/21213873/4002530) to get you started? – tospig Jun 28 '15 at 23:08
  • I've been trying, but I'm pretty new to this. So I don't yet have anything that works. I can join the tables, but finding and putting together the miles from the first in-service stop, to the last in-service stop for a particular route, and then combining all runs on that route for a particular time span, is a bit beyond me. Even a partial example would be appreciated. – FirstFraktal Jun 28 '15 at 23:09
  • I've been looking at that post you linked to for a few days (and all others on SO), but haven't yet been able to adapt it to my needs... – FirstFraktal Jun 28 '15 at 23:11
  • Do you have the optional shape data for the routes you want to calculate revenue for? Without shape data the only way to measure distance would be to use something like the haversine formula to calculate distances based on lat/long and that won't be exact as public transit routes often does not travel in straight lines. – jpw Jun 28 '15 at 23:11
  • 1
    @jpw the `stop_times` table has a `shape_dist_traveled` field that gives the distance between stops – tospig Jun 28 '15 at 23:12
  • @tospig Indeed, but it's an optional field and sometime not included in the source data (it wasn't included in the gtfs feed I used when I messed around with this some years ago). – jpw Jun 28 '15 at 23:15
  • Couldn't the `shape_dist_traveled` in `stop_times` be used instead of the `shapes` table? I do have a working haversine query, but, as you mentioned, it only gives straight line distances, which isn't useful in this case. Is a GTFS database not capable of yielding this information without added tables and data? – FirstFraktal Jun 28 '15 at 23:16
  • Ah, you beat me to it. I do indeed have `shape_dist_traveled` in `stop_times` – FirstFraktal Jun 28 '15 at 23:17
  • Yes, if you have that piece of data then you can use it. – jpw Jun 28 '15 at 23:17
  • I know it's a lot to ask, but even a partial example SQL query of how to do this would be helpful. If I can come up with even a non-working one that I think is even close, I'll post it here. – FirstFraktal Jun 28 '15 at 23:20

2 Answers2

1

I have done exactly this for scheduled kms, by:

  1. Loading GTFS into a DB via GTFS SQL importer and PostGIS
  2. Making the shape table spatial
  3. Calculate distance for each shape
  4. Aggregate as below (see note on service id).

select t.route_id as id, r.route_short_name as route, sum(l.shape_dist/1000) as sched_kms 
from gtfs_shape_lengths l

inner join gtfs_trips t on t.shape_id = l.shape_id
inner join gtfs_routes r on r.route_id = t.route_id
inner join gtfs_calendar c on t.service_id = c.service_id

where c.service_id ilike '%sat%'

group by t.route_id, r.route_short_name

union all

select 'total' as id, 'total_' as name,
sum(l.shape_dist/1000) as sched_kms

from gtfs_shape_lengths l

inner join gtfs_trips t on t.shape_id = l.shape_id
inner join gtfs_calendar c on t.service_id = c.service_id

where c.service_id ilike '%sat%'

order by sched_kms desc

Original writeup here: http://transitdata.net/using-gtfs-and-postgis-to-calculate-levels-of-scheduled-service/

Tony Laidig
  • 1,048
  • 2
  • 11
  • 33
  • Thank you! I am trying to adapt this to the Microsoft SQL Server implementation I have to use. I'll mark as the answer if I can get it working. Because the client I'm working with has yet to gather all data in a GPS datalogged format to create the paths. In the short term, I may need to do this query in a more simple way by having them just enter the distance per route as an plain integer, and do the spatial path distance entry and calculations later. SQL Server 2014 (that I'm using) does have Geo data types, but I'm so new to this, any more pointers would be helpful. – FirstFraktal Jun 29 '15 at 17:13
  • 1
    This will not work if a particular trip does not follow the entire corresponding shape. This is is moderately common with larger agencies (e.g. short-turn trips or other irregular stopping patterns). I'm not sure if this usage is consistent with the GTFS spec but I've seen it several places. – abeboparebop Jun 30 '15 at 18:27
  • Question: Can this be calculated from the `shape_dist_traveled` column in `stop_times`, or is there some reason it must be `shapes`? – FirstFraktal Jun 30 '15 at 21:55
  • Also: Is the main reason to make the data spatial, because of the curvature of the earth (which would elongate the paths and give different/longer distance result), or is there another reason? E.g. *could* the query be done by "simply" adding up the distances between stops in `stop_times`->`shape_dist_traveled`? (with the only drawback being slightly inaccurate shortening of distances?) – FirstFraktal Jun 30 '15 at 22:01
  • 1
    This relies on shapes that match the pattern of the trip. Speaking from my own (many US large agencies) experience, large scheduling packages (HASTUS, Trapeze, etc) are very good at making shapes that match their understanding of a trip pattern, sometimes to the point of generating too many irrelevant shapes, and sometimes erroneously due to topological errors. If you're working with manually generated GTFS w/o shapes, this is not the direct solution for you. You can naively build shapes with stops, arouting engine (eg PgRouting) and a streets dataset, but I have never done so for a GTFS. – Tony Laidig Jul 01 '15 at 16:47
  • Alright thanks. That is helpful. In this case the client isn't needing to generate shapes for a feed/visual map, but mainly wanting to aggregate the mileage and hours per route (for their internal reports), so a (possibly) more simple solution of just inputting the distance between stops from a GPS unit's data-log (loading data from a GPX or KML file into the database), and adding that up per route might meet their needs. But as you have so much experience with this, do you see any red flags/problems with that approach? Do you think that will that cause issues later? Am I missing something? – FirstFraktal Jul 01 '15 at 17:06
  • just saw your message now. I don't see a fault with that. Hope you had good luck with tit. – Tony Laidig Jul 24 '15 at 17:46
  • Alright thank you. I did realize later that I need to do this on the "trips" table, not "routes" table like in my diagram. I have edited the diagram to show the changes. I added a "trip_total_distance" and a "trip_service_distance" (one can be subtracted from the other to get "revenue distance") to "trips" to help with agencies that don't yet have the per-stop distances or shapes, and for internal Federal reporting purposes (though I know that won't work for a GTFS feed—it's temporary). – FirstFraktal Jul 29 '15 at 18:14
1

Alright, I have come up with the following for getting the Service Hours. In my example, the arrival_time and departure_time columns in the stop_times table are an integer datatype with the number data stored representing "minutes since midnight" (e.g. "29 hours and 45 minutes since midnight" would be "1785 minutes"... Midnight being measured from noon on the service day, minus 12 hours—as the spec requires. Which is also the best way to do it). Also note: I added the column trip_date to the trips table because I'm using this GTFS database for operational/internal Federal reporting uses, and not just for service feeds to the public; so it's necessary to know the individual trip date (and I don't want to make an entry for each day in calendar_dates for this purpose, as some agencies do). This example is for MSSQL/SQL Server:

-- FIRST/LAST TRIPS OF THE DAY AND SPAN OF SERVICE

SELECT

    joinedTables.service_id                  AS 'Service Number',
    joinedTables.trip_date                   AS 'Date',
    joinedTables.route_id                    AS 'Route',

    MIN ( joinedTables.starting_departure )  AS 'First Departure in Minutes',
    MAX ( joinedTables.ending_arrival )      AS 'Last Departure in Minutes',

    -- Decimal hours of minutes integers.
    CAST (
                 (
                       (
                             MAX (ending_arrival) - MIN (starting_departure)
                       ) / 60.00
                 ) AS DECIMAL (9, 2)
         )                                 AS 'Service Hours'


FROM
    (
        SELECT
            SelectedTripsColumns.service_id,
            SelectedTripsColumns.trip_id,
            SelectedTripsColumns.route_id,
            SelectedTripsColumns.trip_date,
            MIN (departure_time) AS starting_departure,
            MAX (arrival_time) AS ending_arrival

        FROM
            stop_times AS stopTimesTable

        JOIN (
                 SELECT
                     service_id,
                     trip_id,
                     route_id,
                     trip_date
                 FROM
                     trips
             ) AS SelectedTripsColumns

        ON stopTimesTable.trip_id = SelectedTripsColumns.trip_id


        JOIN routes

        ON SelectedTripsColumns.route_id = routes.route_id


        GROUP BY
            SelectedTripsColumns.service_id,
            SelectedTripsColumns.trip_id,
            SelectedTripsColumns.route_id,
            SelectedTripsColumns.trip_date

    ) AS joinedTables

-- WHERE trip_date = '2015-07-27'

GROUP BY
    service_id,
    route_id,
    trip_date

ORDER BY
    service_id,
    route_id,
    trip_date;
FirstFraktal
  • 358
  • 4
  • 6