How do I share a large read-only object across Dask distributed workers

Question

The Problem

I'm trying to send a 2GB CPython read-only object (can be pickled) to dask distributed workers via apply(). This ends up consuming a lot of memory for processes/ threads (14+ GB).

Is there a way to load the object only once into memory and have the workers concurrently use the object?

Things I've tried

running with the single dask scheduler takes about 2.5 hours to process, but finishes with correct results.
running with dask distributed normally results in:

distributed.worker - WARNING - Memory use is high but worker has no data to 
store to disk. Perhaps some other process is leaking memory? Process memory:  
2.85 GB -- Worker memory limit: 3.00 GB

running with dask distributed with memory limit increased to 8GB/16GB:

Threads

distributed.worker - WARNING - Memory use is high but worker has no 
data to  store to disk. Perhaps some other process is leaking 
memory? 
Process memory:  14.5 GB -- Worker memory limit: 16.00 GB 
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting

Processes Takes more than 2.5 hours to process and I've never seen it finish (left it running for 8+ hours before cancelling). It also consumes 10+ GB of memory

Using the vectorized string operation Source_list.str.find_all(Pattern_list) takes more than 2.5 hours.
Storing the object in a global variable and calling it results in the same error as in point 3 for processes and threads.
Using map_partitions + loop/map on Source_list gives same results as point 3.

Dask Distributed Code

# OS = Windows 10
# RAM = 16 GB
# CPU cores = 8
# dask version 1.1.1

import dask.dataframe as dd
import ahocorasick
from dask.distributed import Client, progress

def create_ahocorasick_trie(pattern_list):
    A = ahocorasick.Automaton()
    for index, item in pattern_list.iteritems():
         A.add_word(item,item)
    A.make_automaton()
    return A 

if __name__ == '__main__':
    client = Client(memory_limit="12GB",processes=False)

    # Using Threading, because, the large_object seems to get copied in memory 
    # for each process when processes = True

    Source_list = dd.read_parquet("source_list.parquet") 
    Pattern_list = dd.read_parquet("pattern_list.parquet")

    # Note: 'source_list.parquet' and 'pattern_list.parquet' are generated via dask

    large_object = create_ahocorasick_trie(Pattern_list)

    result = Source_list.apply(lambda source_text: {large_object.iter(source_text)}, meta=(None,'O'))

    # iter() is an ahocorasick Cpython method

    progress(result.head(10))

    client.close()

score 3 · Answer 1 · answered Feb 20 '19 at 01:03

3

The short answer is to wrap it in a dask.delayed call

big = dask.delayed(big)
df.apply(func, extra=big)

Dask will move it around as necessary and treat it as its own piece of data. That being said, it will need to exist on every worker, so you should have significantly more RAM per worker than that thing takes up. (at least 4x or so more).

answered Feb 20 '19 at 01:03

MRocklin

55,641
23
163
235

Can I use the same technique for map or map_partitions ? I can't see `extra=` argument in API. – spiralarchitect Dec 28 '19 at 17:29
Yes, this works for those methods as well. I'm using the term `extra` above as a placeholder for any keyword argument to your user defined function. – MRocklin Dec 29 '19 at 20:31

How do I share a large read-only object across Dask distributed workers

The Problem

More details about the problem

Things I've tried

Dask Distributed Code

1 Answers1