How do Dask dataframes handle larger-than-memory datasets?

Question

The documentation of the Dask package for dataframes says:

Dask dataframes look and feel like pandas dataframes, but operate on datasets larger than memory using multiple threads.

But later in the same page:

One dask DataFrame is comprised of several in-memory pandas DataFrames separated along the index.

Does Dask read the different DataFrame partitions from disk sequentally and perform computations to fit into memory? Does it spill some partitions to disk when needed? In general, how does Dask manage the memory <--> disk IO of data to allow larger-than-memory data analysis?

I tried to perform some basic computations (e.g. mean rating) on the 10M MovieLens dataset and my laptop (8GB RAM) started to swap.

You might consider a more detailed question about what's going on in your movielens problem. — MRocklin, Mar 28 '16 at 19:40
Apparently my movielens problem was being caused by sep '::' being interpreted as regex by Pandas. sed-ed it to ';' and now it loads in main memory without issues. — dukebody, Apr 03 '16 at 14:51

score 7 · Accepted Answer · answered Mar 28 '16 at 19:39

7

Dask.dataframe loads data lazily and attempts to perform your entire computation in one linear scan through the dataset. Surprisingly, this is usually doable.

Intelligently dumping down to disk is also an option that it can manage, especially when shuffles are required, but generally there are ways around this.

answered Mar 28 '16 at 19:39

MRocklin

55,641
23
163
235

2

Thanks. Do you know any place where I can find docs about how exactly does Dask manage the disk/memory io? Official documentation doesn't make this very clear. – dukebody Mar 31 '16 at 14:46
@dukebody It's a bit late but here it is :D https://stackoverflow.com/questions/36269461/how-do-dask-dataframes-handle-larger-than-memory-datasets#answer-53942974 – Nabin Dec 27 '18 at 10:03

Nabin · Answer 2 · 2023-07-19T14:43:03.607

I happen to come to this page after 2 years and now there is an easy option to limit memory usage by each worker. I think that was included by @MRocklin after this thread got inactive.

$ dask-worker tcp://scheduler:port --memory-limit=auto  # total available RAM on the machine
$ dask-worker tcp://scheduler:port --memory-limit=4e9  # four gigabytes per worker process.

This feature is called Spill-to-disk policy for workers and details can be found here in the documentation.

Apparently, extra data will be spilled to a directory as specified by the command below:

$ dask-worker tcp://scheduler:port --memory-limit 4e9 --local-directory /scratch

That data is still available and will be read back from disk when necessary.

How do Dask dataframes handle larger-than-memory datasets?

2 Answers2