The Problem
I'm trying to send a 2GB CPython read-only object (can be pickled) to dask distributed workers via apply()
. This ends up consuming a lot of memory for processes/ threads (14+ GB).
Is there a way to load the object only once into memory and have the workers concurrently use the object?
More details about the problem
I have 2 Dask series Source_list and Pattern_list, containing 7 Million and 3 Million strings respectively. I'm trying to find all sub-string matches in Source_list (7M) from Pattern_list(3M).
To speed up the sub-string search, I use the pyahocorasick package to create a Cpython data-structure (a class object) from Pattern_list (The object is pickle-able).
Things I've tried
- running with the single dask scheduler takes about 2.5 hours to process, but finishes with correct results.
- running with dask distributed normally results in:
distributed.worker - WARNING - Memory use is high but worker has no data to
store to disk. Perhaps some other process is leaking memory? Process memory:
2.85 GB -- Worker memory limit: 3.00 GB
running with dask distributed with memory limit increased to 8GB/16GB:
Threads
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 14.5 GB -- Worker memory limit: 16.00 GB distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
Processes Takes more than 2.5 hours to process and I've never seen it finish (left it running for 8+ hours before cancelling). It also consumes 10+ GB of memory
- Using the vectorized string operation
Source_list.str.find_all(Pattern_list)
takes more than 2.5 hours. - Storing the object in a global variable and calling it results in the same error as in point 3 for processes and threads.
- Using map_partitions + loop/map on Source_list gives same results as point 3.
Dask Distributed Code
# OS = Windows 10
# RAM = 16 GB
# CPU cores = 8
# dask version 1.1.1
import dask.dataframe as dd
import ahocorasick
from dask.distributed import Client, progress
def create_ahocorasick_trie(pattern_list):
A = ahocorasick.Automaton()
for index, item in pattern_list.iteritems():
A.add_word(item,item)
A.make_automaton()
return A
if __name__ == '__main__':
client = Client(memory_limit="12GB",processes=False)
# Using Threading, because, the large_object seems to get copied in memory
# for each process when processes = True
Source_list = dd.read_parquet("source_list.parquet")
Pattern_list = dd.read_parquet("pattern_list.parquet")
# Note: 'source_list.parquet' and 'pattern_list.parquet' are generated via dask
large_object = create_ahocorasick_trie(Pattern_list)
result = Source_list.apply(lambda source_text: {large_object.iter(source_text)}, meta=(None,'O'))
# iter() is an ahocorasick Cpython method
progress(result.head(10))
client.close()