1

I have used groupby on 2 columns (the df have about 70 columns all float except the date that is datetime) to get a dask dataframe:

result_ddf = base_ddf.groupby(["firts_integer_column","second_integer_column"])

I can not use the result because it is in some strange format:

dask.dataframe.groupby.DataFrameGroupBy

how can I use the result as a dask dataframe, because when I just try to .head(), or .compute() I get erros.

CODE 1

result_ddf.get_partition(1)

ERROR 1

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/envs/rapids/lib/python3.7/site-packages/dask/dataframe/groupby.py in __getattr__(self, key)
   1779         try:
-> 1780             return self[key]
   1781         except KeyError as e:

~/anaconda3/envs/rapids/lib/python3.7/site-packages/dask/dataframe/groupby.py in __getitem__(self, key)
   1765         # error is raised from pandas
-> 1766         g._meta = g._meta[key]
   1767         return g

~/anaconda3/envs/rapids/lib/python3.7/site-packages/pandas/core/groupby/generic.py in __getitem__(self, key)
   1609             )
-> 1610         return super().__getitem__(key)
   1611 

~/anaconda3/envs/rapids/lib/python3.7/site-packages/pandas/core/base.py in __getitem__(self, key)
    227             if key not in self.obj:
--> 228                 raise KeyError(f"Column not found: {key}")
    229             return self._gotitem(key, ndim=1)

KeyError: 'Column not found: get_partition'

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
<ipython-input-279-2c7697a2a4f8> in <module>
----> 1 result_ddf.get_partition(1)

~/anaconda3/envs/rapids/lib/python3.7/site-packages/dask/dataframe/groupby.py in __getattr__(self, key)
   1780             return self[key]
   1781         except KeyError as e:
-> 1782             raise AttributeError(e) from e
   1783 
   1784     @derived_from(pd.core.groupby.DataFrameGroupBy)

AttributeError: 'Column not found: get_partition'

CODE 2

result_ddf.head()

ERROR 2

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/envs/rapids/lib/python3.7/site-packages/dask/dataframe/groupby.py in __getattr__(self, key)
   1779         try:
-> 1780             return self[key]
   1781         except KeyError as e:

~/anaconda3/envs/rapids/lib/python3.7/site-packages/dask/dataframe/groupby.py in __getitem__(self, key)
   1765         # error is raised from pandas
-> 1766         g._meta = g._meta[key]
   1767         return g

~/anaconda3/envs/rapids/lib/python3.7/site-packages/pandas/core/groupby/generic.py in __getitem__(self, key)
   1609             )
-> 1610         return super().__getitem__(key)
   1611 

~/anaconda3/envs/rapids/lib/python3.7/site-packages/pandas/core/base.py in __getitem__(self, key)
    227             if key not in self.obj:
--> 228                 raise KeyError(f"Column not found: {key}")
    229             return self._gotitem(key, ndim=1)

KeyError: 'Column not found: head'

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
<ipython-input-277-bf3c0aecfa21> in <module>
----> 1 result_ddf.head()

~/anaconda3/envs/rapids/lib/python3.7/site-packages/dask/dataframe/groupby.py in __getattr__(self, key)
   1780             return self[key]
   1781         except KeyError as e:
-> 1782             raise AttributeError(e) from e
   1783 
   1784     @derived_from(pd.core.groupby.DataFrameGroupBy)

AttributeError: 'Column not found: head'

Things I have tried

sogu
  • 2,738
  • 5
  • 31
  • 90

2 Answers2

2

You are missing the operation after .groupby, e.g. if you are interested in summing some other variable third_integer_column, the code would look like this:

result_ddf = base_ddf.groupby(["firts_integer_column","second_integer_column"]).agg({'third_integer_column': 'sum'})

After that you can run result_ddf.head() to see what the first few results look like.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
2

After applying the groupby function you need to perform an operation on the output of the groupby function. The list of operations that can be performed on the output of the groupby operation is provided here.

If you wish to perform a custom operation on the output of groupby then you can use the apply function.

Once this is done then you can use head() or compute() to see the content of the dataframe

saloni
  • 296
  • 1
  • 7