Questions tagged [dask-distributed]

Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.

1090 questions
3
votes
1 answer

File Not Found Error in Dask program run on cluster

I have 4 machines, M1, M2, M3, and M4. The scheduler, client, worker runs on M1. I've put a csv file in M1. Rest of the machines are workers. When I run the program with read_csv file in dask. It gives me Error, file not found
Dhruv Kumar
  • 399
  • 2
  • 13
3
votes
1 answer

Dask: set multiprocessing method from Python

Is there a way to set the multiprocessing method from Python? I do not see a method in the Client() API docs of Dask.distributed that indicates how to set this property. Update: For example, is there: client =…
ericmjl
  • 13,541
  • 12
  • 51
  • 80
3
votes
0 answers

Dask scheduler behavior while reading/retrieving large datasets

This is a follow-up to this question. I'm experiencing problems with persisting a large dataset in distributed memory. I have a scheduler running on one machine and 8 workers each running on their own machines connected by 40 gigabit ethernet and a…
A.C.
  • 53
  • 4
3
votes
1 answer

Dask scatter broadcast a list

what is the appropriate way to scatter broadcast a list using Dask disitributed? case 1 - wrapping the list: [future_list] = client.scatter([my_list], broadcast=True) case 2 - not wrapping the list: future_list = client.scatter(my_list,…
Thomas Moerman
  • 882
  • 8
  • 16
3
votes
1 answer

How to map a dask Series with a large dict

I'm trying to figure out the best way to map a dask Series with a large mapping. The straightforward series.map(large_mapping) issues UserWarning: Large object of size MB detected in task graph and suggests using client.scatter and client.submit…
gsakkis
  • 1,569
  • 1
  • 15
  • 24
3
votes
1 answer

Can we create a Dask cluster having multiple CPU machines as well as multiple GPU machines both.?

Can we create a dask-cluster with some CPU and some GPU machines together. If yes, how to control a certain task must run only on CPU machine, or some other type of task should run only on GPU machine, and if not specified, it should pick whichever…
TheCodeCache
  • 820
  • 1
  • 7
  • 27
3
votes
1 answer

How to put a dataset on a gcloud kubernetes cluster?

I have a gcloud Kubernetes cluster initialized, and I'm using a Dask Client on my local machine to connect to the cluster, but I can't seem to find any documentation on how to upload my dataset to the cluster. I originally tried to just run Dask…
3
votes
1 answer

Directly running a task on a dedicated dask worker

A simple code-snippet is as follows: comment followed by ### is important.. from dask.distributed import Client ### this code-piece will get executed on a dask worker. def task_to_perform(): print("task in progress.") ## do something…
TheCodeCache
  • 820
  • 1
  • 7
  • 27
3
votes
0 answers

how to combine dask and classes?

I am trying to rewrite an entire project that has been developed with classes. Little by little, the heaviest computational chunks should be parallelized, clearly we have a lot of independent sequential loops. An example with classes that mimicks…
Sergio Lucero
  • 862
  • 1
  • 12
  • 21
3
votes
1 answer

Iterate sequentially over a dask bag

I need to submit the elements of a very large dask.bag to a non-threadsafe store, ie I need something like for x in dbag: store.add(x) I can not use compute since the bag is to large to fit in memory. I need something more like…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
3
votes
1 answer

With dask-distributed how to generate futures from long running tasks fed by queues

I'm using a disk-distributed long running task along the lines of this example http://matthewrocklin.com/blog/work/2017/02/11/dask-tensorflow where a long running worker task gets its inputs from a queue as in the tensorflow example and delivers…
3
votes
1 answer

Saving dataframe divisions to parquet with dask

I am currently trying to save and read information from dask to parquet files. But when trying to save a dataframe with dask "to_parquet" and loading it afterwards again with "read_parquet" it seems like the division information gets…
lennart
  • 33
  • 3
3
votes
2 answers

Automatically adding a dataset to Dask scheduler on startup

TL;DR I want to pre-load a dataset into the Dask Distributed scheduler when it's starting up. Background I'm using Dask in a realtime query fashion with a smaller-then-memory dataset. Because it's realtime it's important that the workers can trust…
Niklas B
  • 1,839
  • 18
  • 36
3
votes
1 answer

Lazy repartitioning of dask dataframe

After several stages of lazy dataframe processing, I need to repartition my dataframe before saving it. However, the .repartition() method requires me to know the number of partitions (as opposed to size of partitions) and that depends on size of…
evilkonrex
  • 255
  • 2
  • 10
3
votes
0 answers

Reduce i/o by storing data into a dictionary shared between workers on node using dask.distributed

I am using dask.distributed scheduler and workers to process some large microscopy images on a cluster. I run multiple workers per node (1 core = 1 worker). Each core in the node share 200Gb of RAM. Issue I would like to decrease the writing…