3

I have the following dask dataframe

@timestamp                        datetime64[ns]
@version                                  object
dst                                       object
dst_port                                  object
host                                      object
http_req_header_contentlength             object
http_req_header_host                      object
http_req_header_referer                   object
http_req_header_useragent                 object
http_req_method                           object
http_req_secondleveldomain                object
http_req_url                              object
http_req_version                          object
http_resp_code                            object
http_resp_header_contentlength            object
http_resp_header_contenttype              object
http_user                                 object
local_time                                object
path                                      object
src                                       object
src_port                                  object
tags                                      object
type                                       int64
dtype: object

I am trying to get a group by operations

grouped_by_df = df.groupby(['http_user', 'src'])['@timestamp'].agg(['min', 'max']).reset_index()

when running grouped_by_df.count().compute()` I get the following error:

Traceback (most recent call last):
  File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-62-9acb48b4ac67>", line 1, in <module>
    user_host_map.count().compute()
  File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/dask/base.py", line 98, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/dask/base.py", line 205, in compute
    results = get(dsk, keys, **kwargs)
  File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/distributed/client.py", line 1893, in get
    results = self.gather(packed)
  File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/distributed/client.py", line 1355, in gather
direct=direct, local_worker=local_worker)
  File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/distributed/client.py", line 531, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/distributed/utils.py", line 234, in sync
    six.reraise(*error[0])
  File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/distributed/utils.py", line 223, in f
    result[0] = yield make_coro()
  File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/distributed/client.py", line 1235, in _gather
traceback)
  File "/home/avlach/virtualenvs/dask/local/lib/python2.7/site-packages/distributed/protocol/pickle.py", line 59, in loads
    return pickle.loads(x)
TypeError: itemgetter expected 1 arguments, got 0

I am using dask version 0.15.1 and LocalCLuster Client. What could be causing the issue?

Apostolos
  • 7,763
  • 17
  • 80
  • 150
  • Few questions: 1. are "http_user" and "src" strings or complex objects? 2. Did you try removing the "@" from "@timestamp"? – Daniel Severo Mar 12 '19 at 03:25

1 Answers1

0

We just had a similar error, we were running something of the form:

df[['col1','col2']].groupby('col1').agg("count")

and getting a similar error with this at the end:

    return pickle.loads(x)
TypeError: itemgetter expected 1 arguments, got 0

but then when we reformatted the groupby to be of the form:

df.groupby('col1')['col2'].count()

We stopped getting that error. Which we have now repeated a few times and doesn't just seem to be fluke. Not sure at all why that happens, but worth a try if someone is struggling with the same issue.

Louise Fallon
  • 2,705
  • 2
  • 9
  • 14