How do I use dask to efficiently calculate many simple statistics

Question

Problem

I want to calculate a bunch of "easy to gather" statistics using Dask. Speed is my primary concern and objective, and so I am looking to throw a wide cluster at the problem. Ideally I would like to finish the described problem in less than an hour. I expect to use 100-1000s of workers. Currently in benchmark testing I am running this on large machines (160 core, 4 TB RAM), but plan to move to Kubernetes soon.

Setup

I have some data in a dataframe (pandas, dask, csv, parquet, etc.) I also have a lot of subsets of that data (with arbitrary column filters) that I would like to calculate statistics for.

DataFrame Size: between 5 GB and 5 TB of data. (100 million rows, 1000s of columns). Expecting 50-100 TB in future.

Statistics Size: Around 5000 unique filters and between 1 to 500 stats per unique filter. (5k-5M statistics)

Toy example below:

requested_statistics = [
    {'filters': [{'col': 'col_1', 'op': 'lt', 'value': 0.8},
                 {'col': 'col_38', 'op': 'lt', 'value': 0.4},
                 {'col': 'col_224', 'op': 'gt', 'value': 0.12333}],
     'output': {'col': 'col_3', 'op': 'sum'},
     'name': 'stat_1'},
     {'filters': [{'col': 'col_551', 'op': 'lt', 'value': 0.8},
                  {'col': 'col_112', 'op': 'gt', 'value': '2018-01-13'},
                  {'col': 'col_1', 'op': 'lt', 'value': 0.8}],
      'output': {'col': 'col_2', 'op': 'avg'},
      'name': 'stat_2'}
]

I can write a simple parser, that runs on dask or pandas:

def filter_index(df, filter):
    filter_ops = {'lt': lambda x, y: x < y, 'gt': lambda x, y: x > y, 'eq': lambda x, y: x == y}
    return filter_ops[filter['op']](df[filter['col']], filter['value'])

def get_indexer(df, filters):
    if len(filters) == 1:
        return filter_index(df, filters[0])
    return np.logical_and(filter_index(df, filters[0]), get_indexer(df, filters[1:]))

def get_statistic(df, statistic):
    indexer = get_indexer(df, statistic['filters'])
    agg_ops = {'sum': np.sum, 'avg': np.mean, 'unique_count': lambda x: x.unique().size}
    return agg_ops[statistic['output']['op']](df[statistic['output']['col']][indexer])

all_stats = {x['name']: get_statistic(df, x) for x in requested_statistics}

Some optimizations I have tried.

1) Just rely on dask: future_stats = client.compute(all_stats).

This didn't work because the computation time to optimize the graph (or just serialize to scheduler) takes way too long. In small scale tests, this works fine, but as I scale up npartitions, this seems to scale much worse than O(N) in time.

2) Run the calculation on each statistic (client.compute(stat, sync=True) or client.compute(stat).result()).

This adds too much overhead talking to the scheduler, and for the ~100,000 statistics I'm trying to calculate would take way too long.

3) Cache (via persist) intermediate results (indexers), so that I can re-use.

Given that filters can sometimes share indexers, I have added caching to filter_index and get_indexer fields.

Specifically, create a hash and indexer = client.persist(indexer), returning the persisted indexer on future calls. For the get_indexer I have also added a combinations check that tries to see if any subset of the filters exists in the cache. I have also optimized the order in which I call the statistics to optimally only require at most 1 new changing indexer per next set. (eg. do all of the operations that share the same filters together at once, then move to next).

These have the unfortunate consequence of requiring huge amounts of memory to keep all the boolean masks.

I haven't tried rolling the cache yet (as the computation runs, cache.pop(index_id), once the computation will no longer require it in persist), but that is my next step.

Key problems at hand

Solution (3) listed above is what I have currently implemented, but it's still not performing quite as well as I'd hope.

Memory cost is very high (effectively creating a full new column for every unique filter)
scheduler/graph serialization seem to be expensive
looking at htop shows most of the time it's only dask-scheduler that is running at 100%, and workers are mostly idle.

Questions

1) What are other approaches I could take or are there any clear misses in my approach listed above?

2) I considered df.query(string), but since this operates on the entire dataframe it seems like it would be inefficient (lots of duplication of data). Is this true, or are there some wins by using the built in syntax parser (I noticed the dask graphs were smaller for this, but wasn't sure if it'd be worth it).

3) The scheduler and single-threaded (?) dask graph creator seem to be the bottle-neck, is there any clear path to parallelize these?

4) When I look at the distributed bokeh status watcher, I often have noticed that it hangs as well during these computations, making it hard to debug and makes me curious if using the webserver actually hurts the scheduler's performance? Is this true?

5) In the logs I get a lot of Event loop was unresponsive in Worker for Xs. warnings. Is there anything I can do to help balance the work or re-write the tasks that get destributed to a worker or make the scheduler more responsive?

6) From a desire to reduce the complexity of the dask graph, I have repartition(npartitions=num_workers*2), but I'm not sure if this is a good heuristic or what I should use?

Here's an example of the tasks that the scheduler is managing (this is for ~25 unique filters, each with ~50 statistics, total of ~1,000 statistics being calculated.

https://i.stack.imgur.com/MBEyi.png

Thanks for any help or guiding suggestions on how to consider optimizing this.

I found some issues on `dask.distributed` github that I'll link here for people with similar issues. https://github.com/dask/distributed/issues/2181 and https://github.com/dask/dask/issues/3514 and https://github.com/dask/dask/issues/1916 — bluecoconut, Dec 19 '18 at 04:38
It might be the case that my issues with scaling were caused by memory leak? Posted a possible issue on github: https://github.com/dask/distributed/issues/2433 and will re-address here if any updates come of it. — bluecoconut, Dec 20 '18 at 06:13

score 3 · Answer 1 · answered Dec 20 '18 at 13:34

Two general suggestions come to mind, but it's hard to diagnose such a problem without hands-on experience with it. It sounds like you're already looking at the dashboard, which is great to hear. I'll focus here on two suggestions to avoid scheduling overhead, since that's something that you mentioned specifically.

Use larger partitions

The default partition size for operations like dd.read_csv are small enough to work on consumer laptops. (I suspect that they're around 128MB). Given the size of your nodes you can probably increase this by a factor of 10x (or more) and be fine. This would cut down your scheduler overhead also by 10x.

Use high level graph fusion

As of 2018-12-20 this is still in development branches, but dask.dataframe is starting to fuse at the expression level rather than the task level. This should help significantly to reduce the overhead on your thousands of statistics, possibly turning them into one task from Dask's perspective.

You might want to track the following PRs:

I also encourage you to provide a synthetic example of your use case as a GitHub issue so that it can inform future development. I recommend using dask.datasets.timeseries() to make a fake dataframe, and then something simple to generate a large number of simple statistics from that. (simple is much better if possible so that maintainers don't have to dive too deeply).

Wow! The high level graph functionality _dramatically_ changed things. I'll put together a quick result and post in github and will link here soon. — bluecoconut, Dec 20 '18 at 19:36
I was able to successfully run a simple stat-generator using the high level graph fusion PRs that didn't work off of current master (1.0.0), so there's definitely some big benefits to the high level graph fusion for this type of task. https://github.com/dask/dask/pull/4229#issuecomment-449123512 — bluecoconut, Dec 20 '18 at 20:21