Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.
Questions tagged [dask-distributed]
1090 questions
5
votes
1 answer
Limit Dask CPU and Memory Usage (Single Node)
I am running Dask on a single computer where running .compute() to perform the computations on a huge parquet file will cause dask to use up all the CPU cores on the system.
import dask as dd
df = dd.read_parquet(parquet_file) # very large…

Nyxynyx
- 61,411
- 155
- 482
- 830
5
votes
1 answer
Memory clean up of Dask workers
I am running multiple parallel tasks on a multi-node distributed Dask cluster. However, once the tasks are finished, workers still hold large memory and cluster gets filled up soon.
I have tried client.restart() after every task and…

spiralarchitect
- 880
- 7
- 19
5
votes
3 answers
Forcing Locality on Dask Dataframe Subsets
I'm trying to distribute a large Dask Dataframe across multiple machines for (later) distributed computations on the dataframe. I'm using dask-distributed for this.
All the dask-distributed examples/docs I see are populating the initial data load…

CoderOfTheNight
- 944
- 2
- 8
- 21
5
votes
1 answer
Trigger Dask workers to release memory
I'm distributing the computation of some functions using Dask. My general layout looks like this:
from dask.distributed import Client, LocalCluster, as_completed
cluster = LocalCluster(processes=config.use_dask_local_processes,
…

gallamine
- 865
- 2
- 12
- 26
5
votes
1 answer
Many distributed dask workers idle after one evaluation, or never receive any work, when there are more task
We’re using dask to optimize deep-learner (DL) architectures by generating designs and then sending them to dask workers that, in turn, use pytorch for training. We observe some of the workers do not appear to start, and those that do complete…

Mark Coletti
- 195
- 11
5
votes
1 answer
How to use client.scatter correctly and when in Dask
When executing a "large" number of tasks I am receiving this error:
Consider scattering large objects ahead of time with client.scatter to
reduce scheduler burden and keep data on workers
And I also am getting a bunch of messages like…

muammar
- 951
- 2
- 13
- 32
5
votes
2 answers
Dask Distributed: Reading .csv from HDFS
I'm performance testing Dask using "Distributed Pandas on a Cluster with Dask DataFrames" as a guide.
In Matthew's example, he has a 20GB file and 64 workers (8 physical nodes).
In my case, I have a 82GB file and 288 workers (12 physical nodes;…

jonathan
- 91
- 7
5
votes
1 answer
Dask Client can't connect to dask-scheduler
I'm on dask 1.1.1 (latest version) and I have started a dask scheduler at the commandline with this command:
$ dask-scheduler --port 9796 --bokeh-port 9797 --bokeh-prefix my_project
distributed.scheduler - INFO -…

MetaStack
- 3,266
- 4
- 30
- 67
5
votes
1 answer
Why does Dask fill in "foo" and 1 in my Dataframe
I've read in around 15 csv files:
df = dd.read_csv("gs://project/*.csv", blocksize=25e6,
storage_options={'token': fs.session.credentials})
Then I persisted the Dataframe (it uses 7.33 GB memory):
df = df.persist()
I set a new…

Stanko
- 4,275
- 3
- 23
- 51
5
votes
3 answers
Loading data from S3 to dask dataframe
I can load the data only if I change the "anon" parameter to True after making the file public.
df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'anon':False})
This is not recommended for obvious reasons. How do I load the data…

shantanuo
- 31,689
- 78
- 245
- 403
5
votes
1 answer
Dask compute is very slow
I have a dataframe that consist of 5 million records. I am trying to process it using below code by leveraging the dask dataframes in python
import dask.dataframe as dd
dask_df = dd.read_csv(fullPath)
…

Neno M.
- 123
- 1
- 6
5
votes
1 answer
memory usage when indexing a large dask dataframe on a single multicore machine
I am trying to turn the Wikipedia CirrusSearch dump into Parquet backed dask dataframe indexed by title on a 450G 16-core GCP instance.
CirrusSearch dumps come as a single json line formatted file.
The English Wipedia dumps contain 5M recards and…

Daniel Mahler
- 7,653
- 5
- 51
- 90
5
votes
2 answers
How to replicate data when it is faster to compute than transfer in dask distributed?
I have a largish object (150 MB) that I need to broadcast to all dask distributed workers so it can be used in future tasks. I've tried a couple of approaches:
Client.scatter(broadcast=True): This required sending all the data from one machine…

Stan Seibert
- 53
- 3
5
votes
0 answers
Dask groupby with multiple columns issue
I have the following dataframe created by using dataframe.from_delayed method tha has the following columns
_id hour_timestamp http_method total_hits username hour weekday.
Some details on the source…

Apostolos
- 7,763
- 17
- 80
- 150
5
votes
0 answers
Dask assigning to dataframe column by index throws ValueError
I have a pipeline of trasnformations on a grouped by dataframe. All functions get a DataframeGroupBy and compute some features. Those features are then stored in a Dataframe. The index of the dataframe is the same since all features are derived by…

Apostolos
- 7,763
- 17
- 80
- 150