Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.
Questions tagged [dask-distributed]
1090 questions
3
votes
1 answer
Using Dask from script
Is it possible to run dask from a python script?
In interactive session I can just write
from dask.distributed import Client
client = Client()
as described in all tutorials. If I write these lines however in a script.py file and execute it python…

DerWeh
- 1,721
- 1
- 15
- 26
3
votes
1 answer
Dask - Quickest way to get row length of each partition in a Dask dataframe
I'd like to get the length of each partition in a number of dataframes. I'm presently getting each partition and then getting the size of the index for each partition. This is very, very slow. Is there a better way?
Here's a simplified snippet of…

dan
- 183
- 13
3
votes
1 answer
reading a Dask DataFrame from CSVs in a deep S3 path hierarchy
I am trying to read a set of CSVs in S3 in a Dask DataFrame.
The bucket has a deep hierarchy and contains some metadata files as well.
the call looks like
dd.read_csv('s3://mybucket/dataset/*/*/*/*/*/*.csv')
This causes Dask to hang. The real…

Daniel Mahler
- 7,653
- 5
- 51
- 90
3
votes
1 answer
Dask Distributed - Same persist data multiple clients
We are trying Dask Distributed to make some heavy computes and visualization for a frontend.
Now we have one worker with gunicorn that connects to an existing Distributed Dask cluster, the worker uploads the data currently with read_csv and persist…

CValenzu
- 31
- 2
3
votes
1 answer
How can I get result of Dask compute on a different machine than the one that submitted it?
I am using Dask behind a Django server and the basic setup I have is summarised here: https://github.com/MoonVision/django-dask-demo/ where the Dask client can be found here:…

Matt Nicolls
- 173
- 1
- 7
3
votes
1 answer
Actors and dask-workers
client = Client('127.0.0.1:8786',direct_to_workers=True)
future1 = client.submit(Counter, workers= 'ninja',actor=True)
counter1 = future1.result()
print(counter1)
All is well but what if the client gets restarted? How do I…

chak
- 31
- 2
3
votes
1 answer
How do I share a large read-only object across Dask distributed workers
The Problem
I'm trying to send a 2GB CPython read-only object (can be pickled) to dask distributed workers via apply(). This ends up consuming a lot of memory for processes/ threads (14+ GB).
Is there a way to load the object only once into memory…

Hyperspace
- 65
- 1
- 8
3
votes
0 answers
Worker crashes during simple aggregation
I am trying to aggregate various columns on a 450 million row data set. When I use Dask's built in aggregations like 'min', 'max', 'std', 'mean' keep crashing a worker in the process.
The file that I am using can be found here:…

DannyK
- 103
- 2
- 10
3
votes
1 answer
Scheduler closing stream warning
I have a periodic batch job running on my laptop. The code looks like this:
client = Client()
print(client.scheduler_info())
topic='raw_data'
start = datetime.datetime.now()
delta = datetime.timedelta(minutes=2)
while True:
end = start + delta
…

Apostolos
- 7,763
- 17
- 80
- 150
3
votes
0 answers
split bigquery dataframe into chunks using dask
I searched and tested different ways to find if I can be able to split bigquery dataframe into chunks of 75 rows, but couldn't find a way to do so. here is the senario:
I got a very large bigquery dataframe (millions of rows) using python and gcp…

MT467
- 668
- 2
- 15
- 31
3
votes
0 answers
Writing Dask/XArray to NetCDF - Parallel IO
I am using Dask/Xarray with a ~150 GB dataset on a distributed cluster on a HPC system. I have the computation component complete, which takes about ~30 minutes. I want to save the final result to a NETCDF4 file, but writing the data to a NETCDF…

Rowan_Gaffney
- 452
- 5
- 17
3
votes
1 answer
jupyter lab open an iframe on a tab for monitoring dask scheduler
I am developping with dask distributed and this package provides a very useful debugging view as a bokeh application.
I want to have this application next to my notebook in a jupyterlab tab.
I have managed to do so by opening the jupyter lab…

redoules
- 33
- 4
3
votes
1 answer
How do I get adaptive dask workers to run some code on startup?
I'm creating a dask scheduler using dask-kubernetes and putting it into adaptive mode.
from dask-kubernetes import KubeCluster
cluster = KubeCluster()
cluster.adapt(minimum=0, maximum=40)
I need each worker to run some setup code when they are…

Jacob Tomlinson
- 3,341
- 2
- 31
- 62
3
votes
1 answer
How to reliably clean up dask scheduler/worker
I'm starting up a dask cluster in an automated way by ssh-ing into a bunch of machines and running dask-worker. I noticed that I sometimes run into problems when processes from a previous experiment are still running. Wha'ts the best way to clean up…

John
- 935
- 6
- 17
3
votes
1 answer
Tensorflow + joblib: limited to 8 processes?
I created a statistical estimator using TensorFlow. I followed sklearn's estimators, so I have a class that packages everything including importing Tensorflow and starting TF's session (if I import TF outside the class nothing works in parallel at…

Luk17
- 222
- 1
- 11