Problem
I want to calculate a bunch of "easy to gather" statistics using Dask. Speed is my primary concern and objective, and so I am looking to throw a wide cluster at the problem. Ideally I would like to finish the described problem in less than an hour. I expect to use 100-1000s of workers. Currently in benchmark testing I am running this on large machines (160 core, 4 TB RAM), but plan to move to Kubernetes soon.
Setup
I have some data in a dataframe (pandas, dask, csv, parquet, etc.) I also have a lot of subsets of that data (with arbitrary column filters) that I would like to calculate statistics for.
DataFrame Size: between 5 GB and 5 TB of data. (100 million rows, 1000s of columns). Expecting 50-100 TB in future.
Statistics Size: Around 5000 unique filters and between 1 to 500 stats per unique filter. (5k-5M statistics)
Toy example below:
requested_statistics = [
{'filters': [{'col': 'col_1', 'op': 'lt', 'value': 0.8},
{'col': 'col_38', 'op': 'lt', 'value': 0.4},
{'col': 'col_224', 'op': 'gt', 'value': 0.12333}],
'output': {'col': 'col_3', 'op': 'sum'},
'name': 'stat_1'},
{'filters': [{'col': 'col_551', 'op': 'lt', 'value': 0.8},
{'col': 'col_112', 'op': 'gt', 'value': '2018-01-13'},
{'col': 'col_1', 'op': 'lt', 'value': 0.8}],
'output': {'col': 'col_2', 'op': 'avg'},
'name': 'stat_2'}
]
I can write a simple parser, that runs on dask or pandas:
def filter_index(df, filter):
filter_ops = {'lt': lambda x, y: x < y, 'gt': lambda x, y: x > y, 'eq': lambda x, y: x == y}
return filter_ops[filter['op']](df[filter['col']], filter['value'])
def get_indexer(df, filters):
if len(filters) == 1:
return filter_index(df, filters[0])
return np.logical_and(filter_index(df, filters[0]), get_indexer(df, filters[1:]))
def get_statistic(df, statistic):
indexer = get_indexer(df, statistic['filters'])
agg_ops = {'sum': np.sum, 'avg': np.mean, 'unique_count': lambda x: x.unique().size}
return agg_ops[statistic['output']['op']](df[statistic['output']['col']][indexer])
all_stats = {x['name']: get_statistic(df, x) for x in requested_statistics}
Some optimizations I have tried.
1) Just rely on dask: future_stats = client.compute(all_stats)
.
This didn't work because the computation time to optimize the graph (or just serialize to scheduler) takes way too long. In small scale tests, this works fine, but as I scale up npartitions, this seems to scale much worse than O(N) in time.
2) Run the calculation on each statistic (client.compute(stat, sync=True)
or client.compute(stat).result()
).
This adds too much overhead talking to the scheduler, and for the ~100,000 statistics I'm trying to calculate would take way too long.
3) Cache (via persist) intermediate results (indexers), so that I can re-use.
Given that filters can sometimes share indexers, I have added caching to filter_index
and get_indexer
fields.
Specifically, create a hash and indexer = client.persist(indexer)
, returning the persisted indexer on future calls. For the get_indexer
I have also added a combinations
check that tries to see if any subset of the filters exists in the cache. I have also optimized the order in which I call the statistics to optimally only require at most 1 new changing indexer per next set.
(eg. do all of the operations that share the same filters together at once, then move to next).
These have the unfortunate consequence of requiring huge amounts of memory to keep all the boolean masks.
I haven't tried rolling the cache yet (as the computation runs, cache.pop(index_id)
, once the computation will no longer require it in persist), but that is my next step.
Key problems at hand
Solution (3) listed above is what I have currently implemented, but it's still not performing quite as well as I'd hope.
Memory cost is very high (effectively creating a full new column for every unique filter)
scheduler/graph serialization seem to be expensive
looking at
htop
shows most of the time it's onlydask-scheduler
that is running at 100%, and workers are mostly idle.
Questions
1) What are other approaches I could take or are there any clear misses in my approach listed above?
2) I considered df.query(string)
, but since this operates on the entire dataframe it seems like it would be inefficient (lots of duplication of data). Is this true, or are there some wins by using the built in syntax parser (I noticed the dask graphs were smaller for this, but wasn't sure if it'd be worth it).
3) The scheduler and single-threaded (?) dask graph creator seem to be the bottle-neck, is there any clear path to parallelize these?
4) When I look at the distributed bokeh status watcher, I often have noticed that it hangs as well during these computations, making it hard to debug and makes me curious if using the webserver actually hurts the scheduler's performance? Is this true?
5) In the logs I get a lot of Event loop was unresponsive in Worker for Xs.
warnings. Is there anything I can do to help balance the work or re-write the tasks that get destributed to a worker or make the scheduler more responsive?
6) From a desire to reduce the complexity of the dask graph, I have repartition(npartitions=num_workers*2)
, but I'm not sure if this is a good heuristic or what I should use?
Here's an example of the tasks that the scheduler is managing (this is for ~25 unique filters, each with ~50 statistics, total of ~1,000 statistics being calculated.
https://i.stack.imgur.com/MBEyi.png
Thanks for any help or guiding suggestions on how to consider optimizing this.