1

I am working on implementing the Connexion Scan Algorithm in python because I need to have access to shortests public transport paths. So I am trying to create a connexion table from gtfs files.

I have a dataframe (stop_times) that contains the following columns:

    trip_id    arrival_time  departure_time  stop_sequence  stop_id
0   id1        06:02:00      06:02:00        0              stop_id1
1   id1        06:05:00      06:05:00        1              stop_id2
2   id1        06:06:00      06:06:00        2              stop_id3
3   id1        06:08:00      06:08:00        3              stop_id4

The original file is much longer and contains the data of many trips which are defined by their trip_id.

I want to save some of the values contained in that first dataframe in a second one that would list the connexion between stations and basically have four columns:

    departure_station  arrival_station  departure_time   arrival_time   

My goal is to extract values from the stop_times dataframe and insert them in the right rows in the empty one I created. I, however, encounter problems with that andI have been stuck for quite a while now.


I need to iterate over the stop_times dataframe 2 "rows at a time" and starting the new iteration at the previous row. The first iteration would be on indexes 0-1, the second on 1-2, the third on 2-3 etc.

For now I was only able to make the iterations on rows 0-1, 2-3 etc. with the following code but it is not what I am trying to do here.

for i, g in course.groupby(np.arange(len(course)) // 2):

Any idea how I could manage that?


Now let's consider the first iteration on rows 0-1: I need to append the empty dataframe first row with:

  • the departure_time of the stop_times first row
  • the arrival_time of the stop_times second row
  • the stop_sequence of the stop_times first row (corresponding to the departure_station column)
  • the stop_sequence of the stop_times second row (corresponding to the arrival_station column)

That would give me the following:

    departure_station  arrival_station  departure_time   arrival_time
0   0                  1                06:02:00         06:05:00

And then repeat that for the rest of the dataframe:

    departure_station  arrival_station  departure_time   arrival_time
0   0                  1                06:02:00         06:05:00
1   1                  2                06:05:00         06:06:00
2   2                  3                06:06:00         06:08:00

This is what I tried so far:

stop_time = pd.read_csv('/Users/im/Downloads/IDFM_gtfs/stop_times.txt')
stop_time = stop_time[:30]

course = stop_time.loc[stop_time['trip_id'] == 'id1']

for i, g in course.groupby(np.arange(len(course)) // 2):
    connexion = g.reset_index()
    connexion = connexion[['trip_id', 'arrival_time', 'departure_time', 'stop_id', 'stop_sequence']]

    dep_hor = connexion.loc[connexion.index == 0, ['departure_time']]
    arriv_hor = connexion.loc[connexion.index == 1, ['arrival_time']]

    table_horaire = table_horaire.append(dep_hor)
    table_horaire = table_horaire.append(arriv_hor)

Which gives me the following dataframe:

    arrival_time    departure_time  arrival_station  departure_station
0   NaN             06:02:00        NaN              NaN
1   06:05:00        NaN             NaN              NaN
0   NaN             06:06:00        NaN              NaN
1   06:08:00        NaN             NaN              NaN
0   NaN             06:10:00        NaN              NaN
1   06:12:00        NaN             NaN              NaN
0   NaN             06:14:00        NaN              NaN
1   06:16:00        NaN             NaN              NaN

Any help would be greatly appreciated and please do tell me if some parts are not explained well, I am still quite new at programming and don't know all the right terms yet.

I.M.
  • 344
  • 3
  • 14
  • Iterating on a big dataframe is computationally expensive. Have you tried to use vectorized approach instead? If you need to append new columns from your first dataframe to the second, have you tried using `map` since you have ids as unique identifier. – Joe Aug 27 '19 at 13:20
  • Hi, is `stop_id` a synonym for station in your data? – jottbe Aug 27 '19 at 15:48
  • @Joe Thank you for your input. I thought about the computational cost of that and was thinking about making my way around that. I never heard of any vectorized approach but I'll definitely look into it. – I.M. Aug 27 '19 at 17:28
  • @jottbe stop_id is the identifier of the station, here I simplified the dataframe for an easier reading but each station has an id that allows me to get its name afterward. – I.M. Aug 27 '19 at 17:30

1 Answers1

0

If I got your question right, you don't need groupby at all and can use a combination of shift(1) and concat to get, what you want:

import numpy as np
# make sure the dataframe is sorted by trip_id and arrival_time
# please choose what is better according your data arrival_time 
# or stop_sequence (in case your public transport goes near the
# speed of light :-)
df.sort_values(['trip_id', 'arrival_time'], inplace=True)

# shift the columns, we need for the departure part 
# by one row and rename the columns
df_departure= df[['trip_id', 'stop_id', 'arrival_time']].shift(1)
df_departure.columns= ['departure_trip_id', 'departuere_station', 'departure_time']

# create a subset of the dataframe with the arrival-columns
df_arrival= df[['trip_id', 'arrival_time', 'stop_id']].copy()
df_arrival.columns= ['trip_id', 'arrival_time', 'arrival_station']

# concat both together
df_combined= pd.concat([df_departure, df_arrival], axis='columns')

# now take care of the rows at the beginning of each group 
# of rows that belong to the same trip_id and delete the 
# departure values of theses rows since they belong to another 
# trip
df_combined.loc[df_combined['trip_id'] != df_combined['departure_trip_id'], ['departuere_station', 'departure_time']]= (np.NaN, np.NaN)
df_combined.drop(['departure_trip_id'], axis='columns', inplace=True)

With the following test data:

raw="""    trip_id    arrival_time  departure_time  stop_sequence  stop_id
0   id1        06:02:00      06:02:30        0              stop_id1
1   id1        06:05:00      06:05:30        1              stop_id2
2   id1        06:06:00      06:06:30        2              stop_id3
3   id1        06:08:00      06:08:30        3              stop_id4
4   id2        06:12:00      06:12:30        4              stop_id5
5   id2        06:15:00      06:15:30        5              stop_id6
6   id2        06:16:00      06:16:30        6              stop_id7
7   id2        06:18:00      06:18:30        7              stop_id8
"""
df= pd.read_csv(io.StringIO(raw), index_col=0, sep='\s+')

The code above outputs:

Out[65]: 
  departuere_station departure_time trip_id arrival_time arrival_station
0                NaN            NaN     id1     06:02:00        stop_id1
1           stop_id1       06:02:00     id1     06:05:00        stop_id2
2           stop_id2       06:05:00     id1     06:06:00        stop_id3
3           stop_id3       06:06:00     id1     06:08:00        stop_id4
4                NaN            NaN     id2     06:12:00        stop_id5
5           stop_id5       06:12:00     id2     06:15:00        stop_id6
6           stop_id6       06:15:00     id2     06:16:00        stop_id7
7           stop_id7       06:16:00     id2     06:18:00        stop_id8

If stop_id is not a synonym for station, you can just do a merge (or map) to translate it just before you do the shift.

Hope that's what you were searching for.

jottbe
  • 4,228
  • 1
  • 15
  • 31
  • Thank you very much, this is a very effective way of getting the connexion table. I have a question, I am working with a very large dataframe (and I want to take into account transfers between stations) which will result in a huge connexion table. Wouldn't it be computationally heavy to create both df_departure and df_arrival to get df_combined? – I.M. Aug 27 '19 at 17:45
  • I can't tell without certainty, but by Intuition I would say, it should be efficient, because `shift` is probably only a simple copy statement and surely it is much more efficient than a `groupby`. The most cost intensive operation is probably the `sort`, but if you maintain your data in that order anyway, you can skip it. Maybe also the concat costs more (not sure on that). If so and you want to optimize, you can check, if `ignore_index=True` helps for the `concat` (but only if you don't need the index values). – jottbe Aug 28 '19 at 06:32
  • So to sum it up, I expect, that the `shift` will only have a bigger impact in case memory consumption is an issue, otherwise it probably is not the big influence, as it should have linear complexity (runtime should increase linearly with the number of records) while operations like `sort` are non-linear. For `concat` I'm not sure, but most likely it also will be no problem (should also be linear in the way it is perfromed here) – jottbe Aug 28 '19 at 06:35