I am working on implementing the Connexion Scan Algorithm in python because I need to have access to shortests public transport paths. So I am trying to create a connexion table from gtfs files.
I have a dataframe (stop_times) that contains the following columns:
trip_id arrival_time departure_time stop_sequence stop_id
0 id1 06:02:00 06:02:00 0 stop_id1
1 id1 06:05:00 06:05:00 1 stop_id2
2 id1 06:06:00 06:06:00 2 stop_id3
3 id1 06:08:00 06:08:00 3 stop_id4
The original file is much longer and contains the data of many trips which are defined by their trip_id.
I want to save some of the values contained in that first dataframe in a second one that would list the connexion between stations and basically have four columns:
departure_station arrival_station departure_time arrival_time
My goal is to extract values from the stop_times dataframe and insert them in the right rows in the empty one I created. I, however, encounter problems with that andI have been stuck for quite a while now.
I need to iterate over the stop_times dataframe 2 "rows at a time" and starting the new iteration at the previous row. The first iteration would be on indexes 0-1, the second on 1-2, the third on 2-3 etc.
For now I was only able to make the iterations on rows 0-1, 2-3 etc. with the following code but it is not what I am trying to do here.
for i, g in course.groupby(np.arange(len(course)) // 2):
Any idea how I could manage that?
Now let's consider the first iteration on rows 0-1: I need to append the empty dataframe first row with:
- the departure_time of the stop_times first row
- the arrival_time of the stop_times second row
- the stop_sequence of the stop_times first row (corresponding to the departure_station column)
- the stop_sequence of the stop_times second row (corresponding to the arrival_station column)
That would give me the following:
departure_station arrival_station departure_time arrival_time
0 0 1 06:02:00 06:05:00
And then repeat that for the rest of the dataframe:
departure_station arrival_station departure_time arrival_time
0 0 1 06:02:00 06:05:00
1 1 2 06:05:00 06:06:00
2 2 3 06:06:00 06:08:00
This is what I tried so far:
stop_time = pd.read_csv('/Users/im/Downloads/IDFM_gtfs/stop_times.txt')
stop_time = stop_time[:30]
course = stop_time.loc[stop_time['trip_id'] == 'id1']
for i, g in course.groupby(np.arange(len(course)) // 2):
connexion = g.reset_index()
connexion = connexion[['trip_id', 'arrival_time', 'departure_time', 'stop_id', 'stop_sequence']]
dep_hor = connexion.loc[connexion.index == 0, ['departure_time']]
arriv_hor = connexion.loc[connexion.index == 1, ['arrival_time']]
table_horaire = table_horaire.append(dep_hor)
table_horaire = table_horaire.append(arriv_hor)
Which gives me the following dataframe:
arrival_time departure_time arrival_station departure_station
0 NaN 06:02:00 NaN NaN
1 06:05:00 NaN NaN NaN
0 NaN 06:06:00 NaN NaN
1 06:08:00 NaN NaN NaN
0 NaN 06:10:00 NaN NaN
1 06:12:00 NaN NaN NaN
0 NaN 06:14:00 NaN NaN
1 06:16:00 NaN NaN NaN
Any help would be greatly appreciated and please do tell me if some parts are not explained well, I am still quite new at programming and don't know all the right terms yet.