0

System Info: CentOS, python 3.5.2, 64 cores, 96 GB ram

So I'm trying to load a large array (50GB) from a hdf file into ram (96GB). Each chunk is around 1.5GB less than the worker memory limit. It never seems to complete sometimes crashing or restarting workers also I don't see the memory usage on the web dashboard increasing or tasks being executed.

Should this work or am I missing something obvious here?

import dask.array as da
import h5py

from dask.distributed import LocalCluster, Client
from matplotlib import pyplot as plt

lc = LocalCluster(n_workers=64)
c = Client(lc)

f = h5py.File('50GB.h5', 'r')
data = f['data']
# data.shape = 2000000, 1000
x = da.from_array(data, chunks=(2000000, 100))
x = c.persist(x)
dead_zero
  • 15
  • 1
  • 5
  • 1
    50GB is the size on-disc? – mdurant Nov 13 '18 at 19:17
  • Have you tried to load a single chunk and calculate (using `x.nbytes`) the memory is using? – rpanai Nov 13 '18 at 20:14
  • 1
    I think this is just a misunderstanding on my part I thought each worker would get one chunk of of the Dask array but it seems to try and load the entire array on a single worker which triggers a memory limit, restarting that worker. – dead_zero Nov 14 '18 at 11:52
  • @dead_zero that is exactly what is trying to do. In case your data is nicely partitioned for the calculation you want to perform you can try to use the corrispondent of `map_partitions` for `dask.dataframe` or use a distributed loop. – rpanai Nov 14 '18 at 14:40
  • Ok I'm going to mark this as answered – dead_zero Nov 14 '18 at 16:42

1 Answers1

0

This was a misunderstanding on the way chunks and workers interact. Specifically changing the way the LocalCluster is initialised fixes the issue as described.

lc = LocalCluster(n_workers=1) # This way 1 works has 90GB of mem so can be persisted
dead_zero
  • 15
  • 1
  • 5