2

I am reading in large csv data files using dask and I am attempting to perform a groupby on the resulting dataframe. However, I continue to receive

KeyError: 'Column not found: 0'

on the resulting dask dataframe

I have replicated the problem on both Dask 1.2.2 and 2.1.0. I do not see the problem with Pandas on the same dataframe. I am using Python 3.6 in all cases

To help illustrate the problem I have been able to simplify the code and replicate the problem on a much simpler dataset.

import pandas as pd
from dask import dataframe as dd
from dask import multiprocessing
from dask.distributed import Client

client = Client(processes=False)

data = {
    'col1': [1, 1, 1, 2, 2, 2, 3, 3, 3],
    'col2': ['apple','bananna','orange','apple','bananna','orange','apple','bananna','orange'],
    'col3': [34, 12, 1, 36, 22, 6, 22, 16, 4]
    }
pdf = pd.DataFrame(data=data)
print('*************  Pandas DataFrame')
print(pdf.head(5))

print('')
print('Performing groupby on Pandas DataFrame')
pgroup = pdf.groupby(by='col2')
for name, group in pgroup:
    print('')
    print(f'Group: {name}')
    print(group.head(5))


print(' ')
print(' ')


ddf = dd.from_pandas(data=pdf, npartitions=1)
print('*************  Dask DataFrame')
print(ddf.head(5))

print('')
print('Performing groupby on Dask DataFrame')
dgroup = ddf.groupby(by='col2')
for name, group in dgroup:
    print('')
    print(f'Group: {name}')
    print(group.head(5))

I would have expected the dask dataframe to provide the same groupby result as the Pandas results. However, I received the following output and error

*************  Pandas DataFrame
   col1     col2  col3
0     1    apple    34
1     1  bananna    12
2     1   orange     1
3     2    apple    36
4     2  bananna    22

Performing groupby on Pandas DataFrame

Group: apple
   col1   col2  col3
0     1  apple    34
3     2  apple    36
6     3  apple    22

Group: bananna
   col1     col2  col3
1     1  bananna    12
4     2  bananna    22
7     3  bananna    16

Group: orange
   col1    col2  col3
2     1  orange     1
5     2  orange     6
8     3  orange     4


*************  Dask DataFrame
   col1     col2  col3
0     1    apple    34
1     1  bananna    12
2     1   orange     1
3     2    apple    36
4     2  bananna    22

Performing groupby on Dask DataFrame
Traceback (most recent call last):
  File "C:\Users\Craig\source\repos\cevans3098\MarketData_preProcessor\module1.py", line 37, in <module>
    for name, group in dgroup:
  File "F:\anaconda3\lib\site-packages\dask\dataframe\groupby.py", line 1525, in __getitem__
    g._meta = g._meta[key]
  File "F:\anaconda3\lib\site-packages\pandas\core\base.py", line 275, in __getitem__
    raise KeyError("Column not found: {key}".format(key=key))
KeyError: 'Column not found: 0'
Craig Evans
  • 73
  • 1
  • 9
  • Can you show the output of `dgroup` – Akaisteph7 Jul 28 '19 at 23:51
  • @Akaisteph7 thank you for asking. I assumed the groupby was causing the error, but that call actually went through. It is the statement for name, group in dgroup that is failing. dgroup type is dask.dataframe.groupby.DataFrameGroupBy object. I just need to figure out how to retrieve the individual groups that the groupby generated. any ideas? the for statement doesn't seem to work – Craig Evans Jul 29 '19 at 03:55
  • I think the same issue was discussed in https://stackoverflow.com/questions/39731098/iterate-over-groupby-object-in-dask – Craig Evans Jul 29 '19 at 03:59
  • As a general suggestion: first take a sample for your csv in pandas and try the function you want to use in apply, take note of the meta and then use in in dask. – rpanai Jul 29 '19 at 14:54

1 Answers1

4

DataFrameGroupBy.__iter__ isn't implemented for Dask Dataframe yet: https://github.com/dask/dask/issues/5124

TomAugspurger
  • 28,234
  • 8
  • 86
  • 69