3

I'm getting a MemoryError when I try to drop duplicate timestamps on a large dataframe with the following code.

import dask.dataframe as dd

path = f's3://{container_name}/*'
ddf = dd.read_parquet(path, storage_options=opts, engine='fastparquet')
ddf = ddf.reset_index().drop_duplicates(subset='timestamp_utc').set_index('timestamp_utc')
...

Profiling shows that it was using up about 14GB of RAM on a dataset of 265MB of gzipped parquet files containing about 40 million rows of data.

Is there an alternative way I can drop duplicate indexes on my data without Dask using so much memory?

The traceback below

Traceback (most recent call last):
  File "/anaconda/envs/surb/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda/envs/surb/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/chengkai/surbana_lift/src/consolidate_data.py", line 62, in <module>
    consolidate_data()
  File "/home/chengkai/surbana_lift/src/consolidate_data.py", line 37, in consolidate_data
    ddf = ddf.reset_index().drop_duplicates(subset='timestamp_utc').set_index('timestamp_utc')
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/dataframe/core.py", line 2524, in set_index
    divisions=divisions, **kwargs)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/dataframe/shuffle.py", line 64, in set_index
    divisions, sizes, mins, maxes = base.compute(divisions, sizes, mins, maxes)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/base.py", line 407, in compute
    results = get(dsk, keys, **kwargs)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/threaded.py", line 75, in get
    pack_exception=pack_exception, **kwargs)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/local.py", line 521, in get_async
    raise_exception(exc, tb)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/compatibility.py", line 67, in reraise
    raise exc
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/local.py", line 290, in execute_task
    result = _execute_task(task, data)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/local.py", line 270, in _execute_task
    args2 = [_execute_task(a, cache) for a in args]
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/local.py", line 270, in <listcomp>
    args2 = [_execute_task(a, cache) for a in args]
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/local.py", line 267, in _execute_task
    return [_execute_task(a, cache) for a in arg]
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/local.py", line 267, in <listcomp>
    return [_execute_task(a, cache) for a in arg]
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/local.py", line 271, in _execute_task
    return func(*args2)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/dataframe/core.py", line 69, in _concat
    return args[0] if not args2 else methods.concat(args2, uniform=True)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/dask/dataframe/methods.py", line 329, in concat
    out = pd.concat(dfs3, join=join)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 226, in concat
    return op.get_result()
  File "/anaconda/envs/surb/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 423, in get_result
    copy=self.copy)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/pandas/core/internals.py", line 5418, in concatenate_block_manage
rs
    [ju.block for ju in join_units], placement=placement)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/pandas/core/internals.py", line 2984, in concat_same_type
    axis=self.ndim - 1)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/pandas/core/dtypes/concat.py", line 461, in _concat_datetime
    return _concat_datetimetz(to_concat)
  File "/anaconda/envs/surb/lib/python3.6/site-packages/pandas/core/dtypes/concat.py", line 506, in _concat_datetimetz
    new_values = np.concatenate([x.asi8 for x in to_concat])
MemoryError

1 Answers1

4

It is not too surprising that the data becomes very big in memory. Parquet is a pretty efficient format in terms of space, especially with gzip compression, and strings all become python objects (so expensive in memory).

In addition, you have a number of worker threads operating on parts of the overall dataframe. That involves data copying, intermediates, and concatenation of results; the latter being pretty inefficient in pandas.

One suggestion: instead of reset_index, you can remove one step by specifying index=False to read_parquet.

Next suggestion: limit the number of threads you use to a smaller number than the default, which is probably your number of CPU cores. The easiest way to do that is to use the distributed client in-process

from dask.distributed import Client
c = Client(processes=False, threads_per_worker=4)

It may be better to set the index first, and then do the drop_duplicated with map_partitions to minimise cross-partition communication.

df.map_partitions(lambda d: d.drop_duplicates(subset='timestamp_utc'))
mdurant
  • 27,272
  • 5
  • 45
  • 74
  • 2
    I'm able to run the calculations now by reducing the number of workers and dropping duplicates with `map_partition` after repartitioning the dataframe with `repartition(freq='W')`. Thanks. – Lee Chengkai Jul 13 '18 at 07:53
  • Could you possibly update the question or the answer with example code of how to use `map_partition`? – Edgar H Nov 20 '18 at 11:20
  • @mdurant am doing .drop_duplicates(split_out=n) on a 80GB data.frame, but I always run into memory errors. I noticed doing simple drop_duplicates always creates 1 resulting partition, which I dont want. All my partitions are already deduped, though. I have 64GB ram and 2 workers with 25GB each doing the job on 600 partitions. map_partitions will only drop dupes per partition but not accross. any idea what to do? – user670186 May 13 '19 at 20:07
  • You probably want to set the index to the column you want to dedup on. This causes a shuffle/sort, but means you can act across partitions in parallel. – mdurant May 13 '19 at 20:16
  • Could you please provide sample code for your last comment @mdurant – user2672299 Jun 18 '20 at 09:09