2

I am currently parsing historic delay data from a public transport network in Sweden. I have ~5700 files (one from every 15 seconds) from the 27th of January containing momentary delay data for vehicles on active trips in the network. It's, unfortunately, a lot of overhead / duplicate data, so I want to parse out the relevant stuff to do visualizations on it.

However, when I try to parse and filter out the relevant delay data on a trip level using the script below it performs really slow. It has been running for over 1,5 hours now (on my 2019 Macbook Pro 15') and isn't finished yet.

  • How can I optimize / improve this python parser?
  • Or should I reduce the number of files, and i.e. the frequency of the data collection, for this task?

Thank you so much in advance.

from google.transit import gtfs_realtime_pb2
import gzip
import os
import datetime
import csv
import numpy as np

directory = '../data/tripu/27/'
datapoints = np.zeros((0,3), int)
read_trips = set()

# Loop through all files in directory
for filename in os.listdir(directory)[::3]:

    try:
        # Uncompress and parse protobuff-file using gtfs_realtime_pb2
        with gzip.open(directory + filename, 'rb') as file:
            response = file.read()
            feed = gtfs_realtime_pb2.FeedMessage()
            feed.ParseFromString(response)

            print("Filename: " + filename, "Total entities: " + str(len(feed.entity)))

            for trip in feed.entity:
                if trip.trip_update.trip.trip_id not in read_trips:

                    try:
                        if len(trip.trip_update.stop_time_update) == len(stopsOnTrip[trip.trip_update.trip.trip_id]):
                            print("\t","Adding delays for",len(trip.trip_update.stop_time_update),"stops, on trip_id",trip.trip_update.trip.trip_id)

                            for i, stop_time_update in enumerate(trip.trip_update.stop_time_update[:-1]):

                                # Store the delay data point (arrival difference of two ascending nodes)
                                delay = int(trip.trip_update.stop_time_update[i+1].arrival.delay-trip.trip_update.stop_time_update[i].arrival.delay)

                                # Store contextual metadata (timestamp and edgeID) for the unique delay data point
                                ts = int(trip.trip_update.stop_time_update[i+1].arrival.time)
                                key = int(str(trip.trip_update.stop_time_update[i].stop_id) + str(trip.trip_update.stop_time_update[i+1].stop_id))

                                # Append data to numpy array
                                datapoints = np.append(datapoints, np.array([[key,ts,delay]]), axis=0)

                            read_trips.add(trip.trip_update.trip.trip_id)
                    except KeyError:
                        continue
                else:
                    continue
    except OSError:
        continue
eriknson
  • 43
  • 4
  • 1
    Difficult to tell, I strongly suspect that much of the time is being spent in `ParseFromString`, but there is no way of knowing just from this code. Also, `readTrips` is never updated, so your `"if ... not in readTrips:"` code isn't helping any. (Might also want to make `read_trips` a set instead of a list for more optimal searching, but I'm 99-44/100% sure this is not where your performance bottleneck is.) For better responses, post a small sample data file, and the code for `ParseFromString`. Plus actual profiling would be good too. – PaulMcG Mar 10 '20 at 12:50
  • @PaulMcG Thank you for this response! My bad, I now added read_trip as a set. Edited above. It seems like the script is now reading in data really fast in the beginning and then slows down a lot. Is this any clue of something specific? ParseFromString is from [google.transit](https://developers.google.com/transit/gtfs-realtime/examples/python-sample). I'll upload some data as well. – eriknson Mar 10 '20 at 15:13
  • What is the total size of data, and what is the available memory? – Serge Ballesta Mar 10 '20 at 15:35

2 Answers2

2

I suspect the problem here is repeatedly calling np.append to add a new row to a numpy array. Because the size of a numpy array is fixed when it is created, np.append() must create a new array, which means that it has to copy the previous array. On each loop, the array is bigger and so all these copies add a quadratic factor to your execution time. This becomes significant when the array is quite big (which apparently it is in your application).

As an alternative, you could just create an ordinary Python list of tuples, and then if necessary convert that to a complete numpy array at the end.

That is (only the modified lines):

datapoints = []
# ...
                            datapoints.append((key,ts,delay))
# ...
npdata = np.array(datapoints, dtype=int)
rici
  • 234,347
  • 28
  • 237
  • 341
1

I still think the parse routine is your bottleneck (even if it did come from Google), but all those '.'s were killing me! (And they do slow down performance somewhat.) Also, I converted your i, i+1 iterating to using two iterators zipping through the list of updates, this is a little more advanced style of working through a list. Plus the cur/next_update names helped me keep straight when you wanted to reference one vs. the other. Finally, I remove the trailing "else: continue", since you are at the end of the for loop anyway.

for trip in feed.entity:
    this_trip_update = trip.trip_update 
    this_trip_id = this_trip_update.trip.trip_id
    if this_trip_id not in read_trips:

        try:
            if len(this_trip_update.stop_time_update) == len(stopsOnTrip[this_trip_id]):
                print("\t", "Adding delays for", len(this_trip_update.stop_time_update), "stops, on trip_id",
                      this_trip_id)

                # create two iterators to walk through the list of updates
                cur_updates = iter(this_trip_update.stop_time_update)
                nxt_updates = iter(this_trip_update.stop_time_update)
                # advance the nxt_updates iter so it is one ahead of cur_updates
                next(nxt_updates)

                for cur_update, next_update in zip(cur_updates, nxt_updates):
                    # Store the delay data point (arrival difference of two ascending nodes)
                    delay = int(nxt_update.arrival.delay - cur_update.arrival.delay)

                    # Store contextual metadata (timestamp and edgeID) for the unique delay data point
                    ts = int(next_update.arrival.time)
                    key = "{}/{}".format(cur_update.stop_id, next_update.stop_id)

                    # Append data to numpy array
                    datapoints = np.append(datapoints, np.array([[key, ts, delay]]), axis=0)

                read_trips.add(this_trip_id)
        except KeyError:
            continue

This code should be equivalent to what you posted, and I don't really expect major performance gains either, but perhaps this will be more maintainable when you come back to look at it in 6 months.

(This probably is more appropriate for CodeReview, but I hardly ever go there.)

PaulMcG
  • 62,419
  • 16
  • 94
  • 130
  • The original data is evidently in a Google Protobuf, and the parse routine is part of the protobuf implementation. It should be well optimised, although it's not clear to me whether the application is using the native Python parser or a bridge to a C++ parser (which would be somewhat faster, but only linearly). – rici Mar 10 '20 at 16:26
  • Thank you a lot @PaulMcG , I'll change my code to this tomorrow morning, I like it way better. Have a nice day! – eriknson Mar 10 '20 at 19:31