apply function to column out-of-memory in Python Polars

Question

I have a large GIS dataset (167x25e6) that was generated from GeoJSON, via .csv to now parquet. This is my first time that I really have to deal with out-of-memory dataframes and I am still trying to find out if Polars is the right option for my task, so if you know a solution that uses a different library, I am just as glad to know about that.

Lazily loading the file is no problem, but I need to convert certain values and that's where I am stuck. Since the data is coming from GeoJSON I have for example a column 'geometry.coordinates' that includes stringified lists of coordinate pairs:'[[122.9491889477, 24.4571703672], [122.946780324, 24.4541877508]]'.

In the example of the column above I will need to apply a function that returns me the average longitude and average latitude and writes this into a new column.

I tried casting the column to a pl.List datatype and also tried to use apply with a lambda that does a JSON string load:

df_pl.select(pl.col('geometry.coordinates')).with_column(pl.col('geometry.coordinates').cast(pl.List)).collect()

df_pl.select(pl.col('geometry.coordinates')).with_column(pl.col('geometry.coordinates').apply(lambda x: json.loads(x)).collect()

Unfortunately the first one throws a NotYetImplementedError: Casting from LargeUtf8 to LargeList not supported. The second makes the Python kernel crash immediately since it's not working out-of-memory.

Another thing I tried is sink_parquet instead of using collect such that I could stream the results to the disk and later merge with the actually needed data, but this throws a PanicException: sink_parquet not yet supported in standard engine. Use 'collect().write_parquet()'

I'd appreciate your help!

What did you use to convert the dataset from .csv to .geojson and then to .parquet? It shouldn't have saved the geometry as a string and that's why you're having issues. That being said polars doesn't support geometry (GIS) operations so you'll have to look to geopandas. There is a geopolars project but it's in prototype phase so is probably lacking the features you need at the moment. — Dean MacGregor, Jan 11 '23 at 12:01
Please include a sample of your data, so that we can show you possible workflows, even if we don't approach the memory limits. — mdurant, Jan 11 '23 at 19:30
`but this throws a PanicException: sink_parquet not yet supported in standard engine. Use 'collect().write_parquet()'` - Exactly , it happens when you use any filter/predicate or joins — Niladri, Jan 31 '23 at 17:00

apply function to column out-of-memory in Python Polars

0 Answers0