3

My goal is to convert GTFS stop and trip information into a graph where the vertices are the stops (from GTFS's stops.txt) and edges are trips (from GTFS's stop_times.txt). The first steps are evident:

> library(igraph)

#Reading in GTFS files
> stops<-read.csv("stops.txt")
> stop_times<-read.csv("stop_times.txt")

My first instinct was simply to use graph_from_data_frame function from iGraph, but there is a serious drawback: stop_times DF isn't really structured into the desired scheme. It's scheme is the following:

>head(stop_times)
  trip_id stop_id arrival_time departure_time stop_sequence shape_dist_traveled
1 A895151  F04272     06:20:00       06:20:00            10                   0
2 A895151  F04184     06:22:00       06:22:00            20                 648
3 A895151  F04319     06:24:00       06:24:00            30                1224
4 A895151  F04369     06:27:00       06:27:00            40                2779
5 A895151  008264     06:31:00       06:31:00            50                5620
6 A895151  F01520     06:33:00       06:33:00            60                6691

which means that it contains the stop_ids with arrival and departure times at the respective stop, while I'd like to get start_stop_id, end_stop_id, start_time, end_time per rows (actually, not "stops" but "transits" converted from stops). But this conversion seems me challenging because I should iterate over rows in stop_times and decide whether they're in the same trip_id and if so, calculate the start-end data, if not so, insert NULL or find another solution to separate trips... this is very confusing for me.

Is there any elegant way to combine all these two data frames into the desired graph?

Hendrik
  • 1,158
  • 4
  • 15
  • 30

1 Answers1

2

The 'from' and 'to' can be generated by 'shifting' the values from the following row, up to the 'current' row. And the stop information can simply be joined on

Let me explain with an example, and the use of library(data.table)

## here I"m using Melbourne's GTFS ("http://transitfeeds.com/p/ptv/497/latest/download")

#dt_stop_times <- lst[[6]]$stop_times
#dt_stops <- lst[[7]]$stops

#setDT(dt_stop_times)
#setDT(dt_stops)


## join on whatever stop information you want
dt_stop_times <- dt_stop_times[ dt_stops, on = c("stop_id"), nomatch = 0]

## set the order of stops for each group (in this case, each group is a trip_id)
setorder(dt_stop_times, trip_id, stop_sequence)

## create a new column by shifting the stop_id of the following row up 
dt_stop_times[, stop_id_to := shift(stop_id, type = "lead"), by = .(trip_id)]

## you will have NAs at this point because the last stop doesn't go anywhere.

## you can do the same operation on multiple columns at the same time
dt_stop_times[, `:=`(stop_id_to = shift(stop_id, type = "lead"), 
                     arrival_time_stop_to = shift(arrival_time, type = "lead"),
                     departure_time_stop_to = shift(departure_time, type = "lead")),
              by = .(trip_id)]

## now you have your 'from' and 'to' columns from which you can make your igraph

## here's a subset of the result
dt_stop_times[, .(trip_id, stop_id, stop_name_from = stop_name, arrival_time, stop_id_to, arrival_time_stop_to)]

#                           trip_id stop_id                                                  stop_name_from arrival_time stop_id_to
# 1:          1.T0.3-86-A-mjp-1.7.R    4174                                    71-RMIT/Plenty Rd (Bundoora)     25:42:00       4485
# 2:          1.T0.3-86-A-mjp-1.7.R    4485                            70-Janefield Dr/Plenty Rd (Bundoora)     25:43:00       4486
# 3:          1.T0.3-86-A-mjp-1.7.R    4486                              69-Taunton Dr/Plenty Rd (Bundoora)     25:44:00       4487
# 4:          1.T0.3-86-A-mjp-1.7.R    4487                           68-Greenhills Rd/Plenty Rd (Bundoora)     25:45:00       4488
# 5:          1.T0.3-86-A-mjp-1.7.R    4488                      67-Bundoora Square SC/Plenty Rd (Bundoora)     25:46:00       4489
# ---                                                                                                                         
# 9415793: 9999.UQ.3-19-E-mjp-1.1.H   17871           7-Queen Victoria Market/Elizabeth St (Melbourne City)     23:25:00      17873
# 9415794: 9999.UQ.3-19-E-mjp-1.1.H   17873       5-Melbourne Central Station/Elizabeth St (Melbourne City)     23:27:00      17875
# 9415795: 9999.UQ.3-19-E-mjp-1.1.H   17875              3-Bourke Street Mall/Elizabeth St (Melbourne City)     23:30:00      17876
# 9415796: 9999.UQ.3-19-E-mjp-1.1.H   17876                      2-Collins St/Elizabeth St (Melbourne City)     23:31:00      17877
# 9415797: 9999.UQ.3-19-E-mjp-1.1.H   17877 1-Flinders Street Railway Station/Elizabeth St (Melbourne City)     23:32:00         NA
#          arrival_time_stop_to
# 1:                   25:43:00
# 2:                   25:44:00
# 3:                   25:45:00
# 4:                   25:46:00
# 5:                   25:47:00
# ---                     
# 9415793:             23:27:00
# 9415794:             23:30:00
# 9415795:             23:31:00
# 9415796:             23:32:00
# 9415797:                   NA

Now, to use graph_from_data_frame{igraph} you just need to:

# get a df with nodes
  nodes <- dt_stops[, .(stop_id, stop_lon, stop_lat)]

# links beetween stops
  links <- dt_stop_times[,.(stop_id, stop_id_to, trip_id)]

# create graph
  g <- graph_from_data_frame(links , directed=TRUE, vertices=nodes)

Mind you however, that in a GTFS.zip file you might have more than one transport mode (train, bus, subway etc) and that some pairs of stops have much higher connectivity than others due to variations in service frequency. It's not clear to me yet how these two points should be considered when building a graph from a GTFS.zip. Probably the way forward would be to weight each edge according to its frequency and build a multilayered network with some stops in common across each transport mode treated as an interdependent layer.

rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
SymbolixAU
  • 25,502
  • 4
  • 67
  • 139