I have data at a scale where a DataFrame merge is unlikely to be successful -- previous attempts have resulted in excessive data shuffling, out of memory errors on the scheduler, and communication timeouts in the workers, even with indexing, partitioning, significant count of workers, total memory, etc.
I've had some success "manually" merging by writing data out to small files and reading them back in when lookups are necessary. We're currently doing this in dask.delayed functions. This obviously requires significant disk I/O.
The Dask delayed best practices (https://docs.dask.org/en/latest/delayed-best-practices.html) warn against sending DataFrame to delayed, mention not calling delayed form delayed, and tell us to avoid global state in distributed scenarios. These best practices lead me to believe there isn't a safe way to use DataFrame from delayed functions -- am I correct in this understanding?
Unfortunately the scale and sensitivity of the data make it difficult to share here as a working example, but consider a 20+gb lookup table (on the small side) joining to a 65+gb table (on the very small side). Individually they work in Dask DataFrame distributed memory without a problem. Our processing requires an index on one column, whereas the merge requires a separate index (forcing the large shuffle and repartition).
Are there different approaches towards merging large DataFrame that I may be missing?