The answers above are excellent, but here's something a little different. The Awkward Array library (note: I'm the author) is meant for working with nested data structures like this at large scale. As a coincidence, I used a GeoJSON file as a motivating example in the documentation, though I'm working on a few more tutorials that take larger Parquet files as example data, unrelated to geography.
(That's how this is different from @kamal-barshevich's geopandas answer: geopandas is a domain-specific library that "knows about" geography and will likely have functionality relevant to domain experts in that field. Awkward Array is a generic library for manipulating data structures that doesn't know anything about geography.)
The documentation I've linked above has some examples of manipulating a GeoJSON file with the array functions themselves, without Pandas, starting from here:
>>> import urllib.request
>>> import awkward as ak
>>>
>>> url = "https://raw.githubusercontent.com/Chicago/osd-bike-routes/master/data/Bikeroutes.geojson"
>>> bikeroutes_json = urllib.request.urlopen(url).read()
>>> bikeroutes = ak.from_json(bikeroutes_json)
>>> bikeroutes
<Record ... [-87.7, 42], [-87.7, 42]]]}}]} type='{"type": string, "crs": {"type"...'>
But in this answer, I'll make the Pandas structure that you want. The ak.to_pandas function turns nested lists into a MultiIndex. Applying it to just the "coordinates"
inside "geometry"
inside "features"
:
>>> bikeroutes.features.geometry.coordinates
<Array [[[[-87.8, 41.9], ... [-87.7, 42]]]] type='1061 * var * var * var * float64'>
>>>
>>> ak.to_pandas(bikeroutes.features.geometry.coordinates)
values
entry subentry subsubentry subsubsubentry
0 0 0 0 -87.788573
1 41.923652
1 0 -87.788646
1 41.923651
2 0 -87.788845
... ...
1060 0 8 1 41.950493
9 0 -87.714819
1 41.950724
10 0 -87.715284
1 41.951042
[96724 rows x 1 columns]
The list nesting is three levels deep, the last of which are longitude, latitude pairs (e.g. [-87.788573, 41.923652]
). You want these in separate columns:
>>> bikeroutes.features.geometry.coordinates[..., 0]
<Array [[[-87.8, -87.8, ... -87.7, -87.7]]] type='1061 * var * var * float64'>
>>> bikeroutes.features.geometry.coordinates[..., 1]
<Array [[[41.9, 41.9, 41.9, ... 42, 42, 42]]] type='1061 * var * var * float64'>
This is using a slice like NumPy's (Awkward Array is a generalization of NumPy), taking everything in the all dimensions except the last (...
); the first expression extracts items 0
(longitude) and the second extracts items 1
(latitude).
We can combine these in a new record type using ak.zip to give them column names:
>>> ak.to_pandas(ak.zip({
... "longitude": bikeroutes.features.geometry.coordinates[..., 0],
... "latitude": bikeroutes.features.geometry.coordinates[..., 1],
... }))
longitude latitude
entry subentry subsubentry
0 0 0 -87.788573 41.923652
1 -87.788646 41.923651
2 -87.788845 41.923650
3 -87.788951 41.923649
4 -87.789092 41.923648
... ... ...
1060 0 6 -87.714026 41.950199
7 -87.714335 41.950388
8 -87.714486 41.950493
9 -87.714819 41.950724
10 -87.715284 41.951042
[48362 rows x 2 columns]
and this is pretty close to what you're looking for. The one last thing you wanted was to match each of these with one of the "properties"
from "features"
. My GeoJSON file doesn't have "MARKET"
:
>>> bikeroutes.features.properties.type
1061 * {"STREET": string, "TYPE": string, "BIKEROUTE": string, "F_STREET": string, "T_STREET": option[string]}
but "STREET"
may be a good stand-in. These properties are at a different level of nesting than the coordinates:
>>> bikeroutes.features.geometry.coordinates[..., 0].type
1061 * var * var * float64
>>> bikeroutes.features.properties.STREET.type
1061 * string
The longitude points are two levels of nested lists deeper than the street names, but ak.zip broadcasts them down (similar to NumPy's concept of broadcasting, with necessary extensions for variable-length lists).
The final expression is:
>>> ak.to_pandas(ak.zip({
... "longitude": bikeroutes.features.geometry.coordinates[..., 0],
... "latitude": bikeroutes.features.geometry.coordinates[..., 1],
... "street": bikeroutes.features.properties.STREET,
... }))
longitude latitude street
entry subentry subsubentry
0 0 0 -87.788573 41.923652 W FULLERTON AVE
1 -87.788646 41.923651 W FULLERTON AVE
2 -87.788845 41.923650 W FULLERTON AVE
3 -87.788951 41.923649 W FULLERTON AVE
4 -87.789092 41.923648 W FULLERTON AVE
... ... ... ...
1060 0 6 -87.714026 41.950199 N ELSTON AVE
7 -87.714335 41.950388 N ELSTON AVE
8 -87.714486 41.950493 N ELSTON AVE
9 -87.714819 41.950724 N ELSTON AVE
10 -87.715284 41.951042 N ELSTON AVE
[48362 rows x 3 columns]
Since you just want to relate market with longitude, latitude points, you can ignore the MultiIndex, or you can use Pandas functions to turn the components of that MultiIndex into columns.