0

I have a large GIS dataset (167x25e6) that was generated from GeoJSON, via .csv to now parquet. This is my first time that I really have to deal with out-of-memory dataframes and I am still trying to find out if Polars is the right option for my task, so if you know a solution that uses a different library, I am just as glad to know about that.

Lazily loading the file is no problem, but I need to convert certain values and that's where I am stuck. Since the data is coming from GeoJSON I have for example a column 'geometry.coordinates' that includes stringified lists of coordinate pairs:'[[122.9491889477, 24.4571703672], [122.946780324, 24.4541877508]]'.

In the example of the column above I will need to apply a function that returns me the average longitude and average latitude and writes this into a new column.

I tried casting the column to a pl.List datatype and also tried to use apply with a lambda that does a JSON string load:

df_pl.select(pl.col('geometry.coordinates')).with_column(pl.col('geometry.coordinates').cast(pl.List)).collect()

df_pl.select(pl.col('geometry.coordinates')).with_column(pl.col('geometry.coordinates').apply(lambda x: json.loads(x)).collect()

Unfortunately the first one throws a NotYetImplementedError: Casting from LargeUtf8 to LargeList not supported. The second makes the Python kernel crash immediately since it's not working out-of-memory.

Another thing I tried is sink_parquet instead of using collect such that I could stream the results to the disk and later merge with the actually needed data, but this throws a PanicException: sink_parquet not yet supported in standard engine. Use 'collect().write_parquet()'

I'd appreciate your help!

  • What did you use to convert the dataset from .csv to .geojson and then to .parquet? It shouldn't have saved the geometry as a string and that's why you're having issues. That being said polars doesn't support geometry (GIS) operations so you'll have to look to geopandas. There is a geopolars project but it's in prototype phase so is probably lacking the features you need at the moment. – Dean MacGregor Jan 11 '23 at 12:01
  • Please include a sample of your data, so that we can show you possible workflows, even if we don't approach the memory limits. – mdurant Jan 11 '23 at 19:30
  • `but this throws a PanicException: sink_parquet not yet supported in standard engine. Use 'collect().write_parquet()'` - Exactly , it happens when you use any filter/predicate or joins – Niladri Jan 31 '23 at 17:00

0 Answers0