2

I'm trying to covert a protobuf feed to pandas dataframe for one of my hobby projects. I tried several different techniques to accomplish this but nothing seems to really solve my issue.

I use following code to retrieve GTFS-RT TripUpdates feed:

feed = gtfs_realtime_pb2.FeedMessage()
headers = {
    'Accept': 'application/octet-stream',
    'Accept-encoding': 'br, gzip, deflate'
}

response = requests.get('<PROVIDER:APIKEY>', headers=headers, stream=True)

feed.ParseFromString(response.content)
test_dict = protobuf_to_dict(feed)

The result of using protobuf_to_dict is a a dict with one single line:

{'header': {'gtfs_realtime_version': '2.0', 'incrementality': 0, 'timestamp': 1641582104}, 'entity': [{'id': '14050001276385923' [...]

I've tried several things get around this issue.

Reading feed message as JSON: did not work because the JSON object must be str, bytes or bytearray, not dict.

Iterating through dict:

for entity in test_dict.entity:
    if entity.HasField('vehicle')
        [logic for building dataframe]

It didn't work either, because 'dict' object has no attribute 'entity'.

Ok! After several hours of reading I tried to flatten and normalize feed message as described here and some other threads. Unfortunately, neither json_normalize or flatten_json did solve the issue.

At this point I feel like going in circle and not seeing something very obvious that might help me. The end-goal is to create a dataframe which contains TripUpdates data which later will be merged with another dataframe to update arrival and departure times.

user07345
  • 25
  • 1
  • 7
  • Currently trying to use [this](https://stackoverflow.com/questions/63587225/how-to-deal-with-nested-json-in-python-and-pandas) answer to solve my issue. – user07345 Jan 07 '22 at 22:39
  • After reading about Google protobuf I think I found the solution, `MessageToJson(feed)` seems to solve a lot of issues. Going to update the question once I have found fully working solution. – user07345 Jan 08 '22 at 18:33

2 Answers2

0

The issue can be solved by iterating through feed message using simple foor loops:

feed = gtfs_realtime_pb2.FeedMessage()
headers = {
    'Accept': 'application/octet-stream',
    'Accept-encoding': 'br, gzip, deflate'
}

response = requests.get('<PROVIDER:APIKEY>', headers=headers, stream=True)

feed.ParseFromString(response.content)

for entity in feed.entity:
    if entity.HasField('trip_update'):
        # Accessing values in feed message
        if entity.trip_update.trip.trip_id == something:
            [add to list]

Later, list will be converted to pandas dataframe.

user07345
  • 25
  • 1
  • 7
0

An easy thing to do would be to create a python class first, lets say PyFeed. PyFeed is just python class counterpart of protobuf Feed message. Then, you can use following pseudocode:

import pandas as pd

for entity in feed.entities:
    if entity satisfies my condition:
       myPyFeedList.append(PyFeed(entity)):
# use in-built __dict__
df = pd.DataFrame((pyFeedEntity.__dict__ for pyFeedEntity in myPyFeedList))
return df
everGreen
  • 121
  • 1
  • 5