I have a large GIS dataset (167x25e6) that was generated from GeoJSON, via .csv to now parquet. This is my first time that I really have to deal with out-of-memory dataframes and I am still trying to find out if Polars is the right option for my task, so if you know a solution that uses a different library, I am just as glad to know about that.
Lazily loading the file is no problem, but I need to convert certain values and that's where I am stuck. Since the data is coming from GeoJSON I have for example a column 'geometry.coordinates'
that includes stringified lists of coordinate pairs:'[[122.9491889477, 24.4571703672], [122.946780324, 24.4541877508]]'
.
In the example of the column above I will need to apply a function that returns me the average longitude and average latitude and writes this into a new column.
I tried casting the column to a pl.List
datatype and also tried to use apply
with a lambda that does a JSON string load:
df_pl.select(pl.col('geometry.coordinates')).with_column(pl.col('geometry.coordinates').cast(pl.List)).collect()
df_pl.select(pl.col('geometry.coordinates')).with_column(pl.col('geometry.coordinates').apply(lambda x: json.loads(x)).collect()
Unfortunately the first one throws a NotYetImplementedError: Casting from LargeUtf8 to LargeList not supported. The second makes the Python kernel crash immediately since it's not working out-of-memory.
Another thing I tried is sink_parquet
instead of using collect
such that I could stream the results to the disk and later merge with the actually needed data, but this throws a PanicException: sink_parquet not yet supported in standard engine. Use 'collect().write_parquet()'
I'd appreciate your help!