6

I am trying to index and save large csvs that cannot be loaded into memory. My code to load the csv, perform a computation and index by the new values works without issue. A simplified version is:

cluster = LocalCluster(n_workers=6, threads_per_worker=1)
client = Client(cluster, memory_limit='1GB')

df = dd.read_csv(filepath, header=None, sep=' ', blocksize=25e7)
df['new_col'] = df.map_partitions(lambda x: some_function(x))
df = df.set_index(df.new_col, sorted=False)

However, when I use large files (i.e. > 15gb) I run into a memory error when saving to dataframe to csv with:

df.to_csv(os.path.join(save_dir, filename+'_*.csv'), index=False, chunksize=1000000)

I have tried setting the chunksize=1000000 to see if this would help, but it didn't.

The full stack trace is:

Traceback (most recent call last):
  File "/home/david/data/pointframes/examples/dask_z-order.py", line 44, in <module>
    calc_zorder(fp, save_dir)
  File "/home/david/data/pointframes/examples/dask_z-order.py", line 31, in calc_zorder
    df.to_csv(os.path.join(save_dir, filename+'_*.csv'), index=False, chunksize=1000000)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.py", line 1159, in to_csv
    return to_csv(self, filename, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/io/csv.py", line 654, in to_csv
    delayed(values).compute(scheduler=scheduler)
  File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 398, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/threaded.py", line 76, in get
    pack_exception=pack_exception, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/local.py", line 459, in get_async
    raise_exception(exc, tb)
  File "/usr/local/lib/python2.7/dist-packages/dask/local.py", line 230, in execute_task
    result = _execute_task(task, data)
  File "/usr/local/lib/python2.7/dist-packages/dask/core.py", line 118, in _execute_task
    args2 = [_execute_task(a, cache) for a in args]
  File "/usr/local/lib/python2.7/dist-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/shuffle.py", line 426, in collect
    res = p.get(part)
  File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 73, in get
    return self.get([keys], **kwargs)[0]
  File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 79, in get
    return self._get(keys, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/partd/encode.py", line 30, in _get
    for chunk in raw]
  File "/usr/local/lib/python2.7/dist-packages/partd/pandas.py", line 175, in deserialize
    for (h, b) in zip(headers[2:], bytes[2:])]
  File "/usr/local/lib/python2.7/dist-packages/partd/pandas.py", line 136, in block_from_header_bytes
    copy=True).reshape(shape)
  File "/usr/local/lib/python2.7/dist-packages/partd/numpy.py", line 126, in deserialize
    result = result.copy()
MemoryError

I am running dask v1.1.0 on a Ubuntu 18.04 system in python 2.7. My computers memory is 32GB. This code works as expected with small files that can fit into memory anyway but not with larger ones. Is there something I am missing here?

D.Griffiths
  • 2,248
  • 3
  • 16
  • 30
  • Did you try to use Dask Client and dashboard to see where and when the issue happens? Also did you try to reduce `chunksize` even further (10,000 for example)? – Qusai Alothman Jan 31 '19 at 16:10
  • Reducing the chunk size still allows it to work sometimes but not always. I have tried assigning 50mb chunks to 6 cores (total 300mb), but it is still using up a very large chunk of memory: https://pasteboard.co/HZVrGa7.png . This was just before it crashed with the error `distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting`, `distributed.nanny - WARNING - Worker process 26453 was killed by signal 15`, `distributed.scheduler.KilledWorker` – D.Griffiths Feb 06 '19 at 15:59
  • this line `df = df.set_index(df.new_col, sorted=False)` loads all the data as its not lazy. try running the code without it. see this [Dask DataFrame Performance Tips](http://docs.dask.org/en/latest/dataframe-performance.html#avoid-shuffles) – moshevi Feb 06 '19 at 17:08
  • The some_function() is generating an index column, so the `set_index()` is essential for me. What I don't understand is, if I set small partitions (50mb) and only a few workers (i.e. 6), how is it possible that some workers are exceeding the memory limit of 4gb? Is this not to point of having the data in partitions? I'm having this issue will pretty much all of my dask code. – D.Griffiths Feb 06 '19 at 17:15
  • 2
    to my understanding `set_index` loads all the data. this is why you get a memory error. at the end you are saving to a csv file (no indices). try doing your operations on `new_col` as a column and not as an index. filtering and what not. writing a larger then memory pipeline in dask is quite delicate. – moshevi Feb 06 '19 at 17:28

1 Answers1

1

I encourage you to try smaller chunks of data. You should control this in the read_csv part of your computation rather than the to_csv part.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Is that with the `blocksize` argument? When I set this to 250mb it usually breaks on larger datasets. I have 8 workers, so 250mb*8=2gb. Why would this run out of memory when 32GB is available? Where does all of the other memory demand come from? – D.Griffiths Feb 20 '19 at 11:13
  • Honestly I couldn't tell you. Nothing you're doing seems odd. Perhaps your data is larger than you think it is in memory? Text data is particularly bloated in Pandas. – MRocklin Feb 21 '19 at 02:09