Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.
Questions tagged [dask-distributed]
1090 questions
3
votes
1 answer
File Not Found Error in Dask program run on cluster
I have 4 machines, M1, M2, M3, and M4. The scheduler, client, worker runs on M1. I've put a csv file in M1. Rest of the machines are workers.
When I run the program with read_csv file in dask. It gives me Error, file not found

Dhruv Kumar
- 399
- 2
- 13
3
votes
1 answer
Dask: set multiprocessing method from Python
Is there a way to set the multiprocessing method from Python? I do not see a method in the Client() API docs of Dask.distributed that indicates how to set this property.
Update:
For example, is there:
client =…

ericmjl
- 13,541
- 12
- 51
- 80
3
votes
0 answers
Dask scheduler behavior while reading/retrieving large datasets
This is a follow-up to this question.
I'm experiencing problems with persisting a large dataset in distributed memory. I have a scheduler running on one machine and 8 workers each running on their own machines connected by 40 gigabit ethernet and a…

A.C.
- 53
- 4
3
votes
1 answer
Dask scatter broadcast a list
what is the appropriate way to scatter broadcast a list using Dask disitributed?
case 1 - wrapping the list:
[future_list] = client.scatter([my_list], broadcast=True)
case 2 - not wrapping the list:
future_list = client.scatter(my_list,…

Thomas Moerman
- 882
- 8
- 16
3
votes
1 answer
How to map a dask Series with a large dict
I'm trying to figure out the best way to map a dask Series with a large mapping. The straightforward series.map(large_mapping) issues UserWarning: Large object of size MB detected in task graph and suggests using client.scatter and client.submit…

gsakkis
- 1,569
- 1
- 15
- 24
3
votes
1 answer
Can we create a Dask cluster having multiple CPU machines as well as multiple GPU machines both.?
Can we create a dask-cluster with some CPU and some GPU machines together. If yes, how to control a certain task must run only on CPU machine, or some other type of task should run only on GPU machine, and if not specified, it should pick whichever…

TheCodeCache
- 820
- 1
- 7
- 27
3
votes
1 answer
How to put a dataset on a gcloud kubernetes cluster?
I have a gcloud Kubernetes cluster initialized, and I'm using a Dask Client on my local machine to connect to the cluster, but I can't seem to find any documentation on how to upload my dataset to the cluster.
I originally tried to just run Dask…

Brendan Martin
- 561
- 6
- 17
3
votes
1 answer
Directly running a task on a dedicated dask worker
A simple code-snippet is as follows: comment followed by ### is important..
from dask.distributed import Client
### this code-piece will get executed on a dask worker.
def task_to_perform():
print("task in progress.")
## do something…

TheCodeCache
- 820
- 1
- 7
- 27
3
votes
0 answers
how to combine dask and classes?
I am trying to rewrite an entire project that has been developed with classes. Little by little, the heaviest computational chunks should be parallelized, clearly we have a lot of independent sequential loops. An example with classes that mimicks…

Sergio Lucero
- 862
- 1
- 12
- 21
3
votes
1 answer
Iterate sequentially over a dask bag
I need to submit the elements of a very large dask.bag to a non-threadsafe store, ie I need something like
for x in dbag:
store.add(x)
I can not use compute since the bag is to large to fit in memory.
I need something more like…

Daniel Mahler
- 7,653
- 5
- 51
- 90
3
votes
1 answer
With dask-distributed how to generate futures from long running tasks fed by queues
I'm using a disk-distributed long running task along the lines of this example http://matthewrocklin.com/blog/work/2017/02/11/dask-tensorflow where a long running worker task gets its inputs from a queue as in the tensorflow example and delivers…

Bruce Church
- 33
- 4
3
votes
1 answer
Saving dataframe divisions to parquet with dask
I am currently trying to save and read information from dask to parquet files. But when trying to save a dataframe with dask "to_parquet" and loading it afterwards again with "read_parquet" it seems like the division information gets…

lennart
- 33
- 3
3
votes
2 answers
Automatically adding a dataset to Dask scheduler on startup
TL;DR
I want to pre-load a dataset into the Dask Distributed scheduler when it's starting up.
Background
I'm using Dask in a realtime query fashion with a smaller-then-memory dataset. Because it's realtime it's important that the workers can trust…

Niklas B
- 1,839
- 18
- 36
3
votes
1 answer
Lazy repartitioning of dask dataframe
After several stages of lazy dataframe processing, I need to repartition my dataframe before saving it. However, the .repartition() method requires me to know the number of partitions (as opposed to size of partitions) and that depends on size of…

evilkonrex
- 255
- 2
- 10
3
votes
0 answers
Reduce i/o by storing data into a dictionary shared between workers on node using dask.distributed
I am using dask.distributed scheduler and workers to process some large microscopy images on a cluster. I run multiple workers per node (1 core = 1 worker). Each core in the node share 200Gb of RAM.
Issue
I would like to decrease the writing…

s1mc0d3
- 523
- 2
- 15