Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.
Questions tagged [dask-distributed]
1090 questions
3
votes
1 answer
Dask: Continue with others task if one fails
I have a simple (but large) task Graph in Dask. This is a code example
results = []
for params in SomeIterable:
a = dask.delayed(my_function)(**params)
b = dask.delayed(my_other_function)(a)
…

Andrex
- 602
- 1
- 7
- 22
3
votes
1 answer
Unaccountable Dask memory usage
I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find…

Severin
- 281
- 2
- 8
3
votes
0 answers
How to restart dask worker subprocess after task is done?
In django-q we have recycle which is The number of tasks a worker will process before recycling . Useful to release memory resources on a regular basis.
When I start dask-worker with --nprocs 2, I get two worker subprocesses.
I would like to recycle…

nurettin
- 11,090
- 5
- 65
- 85
3
votes
0 answers
Disable xarray's automatic use of dask within a dask task
Background
I'm using dask to manage tens, sometimes hundreds of thousands of jobs, each of which involves reading in zarr data, transforming the data in some way, and writing out output (one output per job). I'm using a pangeo/daskhub-style…

Michael Delgado
- 13,789
- 3
- 29
- 54
3
votes
1 answer
Read single large zipped csv (too large for memory) using Dask
I have a use case where I have an S3 bucket containing a list of a few hundred gzipped files. Each individual file, when unzipped and loaded into a dataframe, occupies more than the available memory. I'd like to read these files and perform some…

David Moye
- 701
- 4
- 13
3
votes
1 answer
Can dask dashboard be used on SageMaker (Labs 1.2.*)?
I don't have browser access to the lab environment, and the available dask extension for lab didn't work for me so far.
I want to be able to see the progress and performance data for my dask projects, no luck for now.
compute() sometimes take hours…

Alejandro
- 519
- 1
- 6
- 32
3
votes
0 answers
Dask distributed.core - ERROR - 'tuple' object does not support item assignment
I am using Dask and cython in my project, where I am invoking cython code after register with the client and collect the obtained result from cython code to my dask-python code. When I make a cluster with processes=True, It works fine. But, as soon…

Rahul
- 31
- 3
3
votes
1 answer
Dask: How to return a tuple of futures in client.submit
I need to return a tuple from a task which has to be unpacked in the main process because each element of the tuple will go to different dask tasks. I would like to avoid unnecessary communication so I think that the tuple elements should be…

z4m0
- 33
- 4
3
votes
2 answers
Dask - how to assign task to the specific CPU
I'm using Dask to process research batches, which are quite heavy (from few minutes to few hours). There's no communication between the tasks and they produce only side results. I'm using a machine which already virtualizes resources beneath it (~…

Piotr Rarus
- 884
- 8
- 16
3
votes
0 answers
Make Dask-Yarn More Robust to Node Failures
We're using Dask to distribute compute work across an EMR cluster. We're using Dask-Yarn. We've noticed that when we experience node failures sometimes those failures will take out the container running the Scheduler and our jobs fail. I was going…

gallamine
- 865
- 2
- 12
- 26
3
votes
1 answer
resample and groupby on big dask array with xarray - using map_blocks?
I have a custom workflow, that requires using resample to get to a higher temporal frequency, applying a ufunc, and groupby + mean to compute the final result.
I would like to apply this to a big xarray dataset, which is backed by a chunked dask…

Val
- 6,585
- 5
- 22
- 52
3
votes
0 answers
How to View Dask Daskboard in Dask Gateway when using a private IP address/VPC?
We deployed Dask Gateway on Kubernetes on Google Cloud Platform. We are currently using an internal TCP load balancer to expose the traefik proxy for security purposes. Our users are able to create a client connection to the cluster generated…

Riley Hun
- 2,541
- 5
- 31
- 77
3
votes
0 answers
Dask - how to efficiently execute the right number of tasks
I am trying to mask and then apply a unique operation on one column. A simplified version of the code i am using is reported below:
import numpy as np
import pandas as pd
import dask.dataframe as dd
data = np.random.randint(0,100,(1000,2))
ddf =…

Guido Muscioni
- 1,203
- 3
- 15
- 37
3
votes
2 answers
dask distributed: How to increase timeout for worker connections? connect() didn't finish in time
OSError: Timed out trying to connect to 'tcp://127.0.0.1:40475' after 10 s: Timed out trying to connect to 'tcp:// 8.56.11:40475' after 10 s: connect() didn't finish in time
Having some huge operations running, I would like to increase the timeout…

gies0r
- 4,723
- 4
- 39
- 50
3
votes
0 answers
Dask Groupby Multi-index Level
I want to groupby dask multi-index data frame by its level. I want to do the following pandas equivalent in dask:
df.groupby(level=0)['TARGET']\
.apply(lambda x: x.shift().rolling(min_periods=1, window=7).sum()).fillna(0)\
…

Krishnang K Dalal
- 2,322
- 9
- 34
- 55