0

I have created a dask dataframe from geopandas futures that each yield a pandas dataframe following the example here: https://gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca

daskdf = dd.from_delayed(lazy_dataframes,lazy_dataframes, meta=lazy_dataframes[0].compute())

All dtypes seem reasonable

daskdf.dtypes
left          float64
bottom        float64
right         float64
top           float64
score          object
label          object
height        float64
area          float64
geometry     geometry
shp_path       object
geo_index      object
Year            int64
Site           object
dtype: object

but dd groupby operations fails

daskdf.groupby(['Site']).height.mean().compute()
...
"/Users/ben/miniconda3/envs/crowns/lib/python3.7/site-packages/dask/dataframe/utils.py", line 577, in _nonempty_series
    data = np.array([entry, entry], dtype=dtype)
builtins.TypeError: data type not understood

whereas pandas has no problem with the same process on the same data.

daskdf.compute().groupby(['Site']).height.mean()
Site
SOAP    15.102355
Name: height, dtype: float64

What might be happening here with the metadata types that could cause this. As I scale my workflow, I would like to perform distributed operations on persisted data.

bw4sz
  • 2,237
  • 2
  • 29
  • 53

1 Answers1

0

The problem is the 'geometry' dtype which comes from geopandas. My pandas dataframe came from loading a shapefile using geopandas.read_file(). Future users beware, drop this column when creating a dask dataframe. I know there was a dask-geopandas attempt some time ago. This was harder to follow since the statement

daskdf.groupby(['Site']).height.mean().compute()

does not involve the geometry column. Dask must check the dtypes of all columns, not just the ones used in an operation. Be careful!

Dropping the geometry column yields the expected result.

daskdf.drop(columns="geometry")
daskdf.groupby(['Site']).height.mean().compute()

Tagging with geopandas in hopes future users can find this.

bw4sz
  • 2,237
  • 2
  • 29
  • 53
  • well, I fall in the opposite trouble with dask-geopandas who "need" a geometry column . but after some aggregate function, I don't want to keep geometry and i get an exception ` Unknown column geometry` . – Hugo Roussaffa Jan 04 '22 at 21:54