Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.
Questions tagged [dask-distributed]
1090 questions
4
votes
1 answer
How to check if dask dataframe is empty if lazily evaluated?
I am aware of this question. But check the code(minimal-working example) below:
import dask.dataframe as dd
import pandas as pd
# intialise data of lists.
data = {'Name': ['Tom', 'nick', 'krish', 'jack'], 'Age': [20, 21, 19, 18]}
# Create…

MehmedB
- 1,059
- 1
- 16
- 42
4
votes
0 answers
dask 100GB dataframe sorting / set_index on new column out of memory issues
I have a dask dataframe of around 100GB and 4 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster.
I converted the dataframe to 150 partiitions (700MB each). However
My simple set_index()…

user670186
- 2,588
- 6
- 37
- 55
4
votes
2 answers
How to use Dask on Databricks
I want to use Dask on Databricks. It should be possible (I cannot see why not). If I import it, one of two things happens, either I get an ImportError but when I install distributed to solve this DataBricks just says Cancelled without throwing any…

SARose
- 3,558
- 5
- 39
- 49
4
votes
0 answers
Pycharm debugger throws Bad file descriptor error when using dask distributed
I am using the most lightweight/simple dask multiprocessing which is the non-cluster local Client:
from distributed import Client
client = Client()
Even so: the first instance of invoking dask.bag.compute() results in the following:
Connected to…

WestCoastProjects
- 58,982
- 91
- 316
- 560
4
votes
1 answer
Initializing state on dask-distributed workers
I am trying to do something like
resource = MyResource()
def fn(x):
something = dosemthing(x, resource)
return something
client = Client()
results = client.map(fn, data)
The issue is that resource is not serializable and is expensive to…

Daniel Mahler
- 7,653
- 5
- 51
- 90
4
votes
1 answer
dask read_csv timeout on Amazon s3 with big files
dask read_csv timeout on s3 for big files
s3fs.S3FileSystem.read_timeout = 5184000 # one day
s3fs.S3FileSystem.connect_timeout = 5184000 # one day
client = Client('a_remote_scheduler_ip_here:8786')
df =…

Võ Trường Duy
- 121
- 1
- 7
4
votes
1 answer
How do I use dask to efficiently calculate many simple statistics
Problem
I want to calculate a bunch of "easy to gather" statistics using Dask.
Speed is my primary concern and objective, and so I am looking to throw a wide cluster at the problem.
Ideally I would like to finish the described problem in less than…

bluecoconut
- 63
- 1
- 5
4
votes
0 answers
Dask Distributed client takes to long to initialize in jupyter lab
Trying to initialize a client with local cluster in Jupyter lab but hangs. This behaviour happens for python 3.5 and jupyter lab 0.35.
import dask.dataframe as dd
from dask import delayed
from distributed import Client
from distributed import…

Apostolos
- 7,763
- 17
- 80
- 150
4
votes
1 answer
Tornado unexpected exception in Future after timeout
I have set up a dask cluster. I can access a web dashboard, but when I'm trying to connect to the scheduler:
from dask.distributed import Client
client = Client('192.168.0.10:8786')
I get the following error:
tornado.application - ERROR - Exception…

Vladyslav Moisieienkov
- 4,118
- 4
- 25
- 32
4
votes
1 answer
How to assign tasks to specific worker within Dask.Distributed
I am interesting in using Dask Distributed as task executor.
In Celery it is possible to assign task to specific worker. How is it possible using Dask Distributed?

Sklavit
- 2,225
- 23
- 29
4
votes
1 answer
dask how to define a custom (time fold) function that operates in parallel and returns a dataframe with a different shape
I am trying to implement a time fold function to be 'map'ed to various partitions of a dask dataframe which in turn changes the shape of the dataframe in question (or alternatively produces a new dataframe with the altered shape). This is how far I…

PhaKuDi
- 141
- 8
4
votes
1 answer
Running shell commands in parallel using dask distributed
I have a folder with a lot of .sh scripts. How can I use an already set up dask distributed cluster to run them in parallel?
Currently, I am doing the following:
import dask, distributed, os
# list with shell commands that I want to run
commands =…

Arco Bast
- 3,595
- 2
- 26
- 53
4
votes
0 answers
Optimal approach to create dask dataframe from parquet files(HDFS) in different directories
I am trying to create dask dataframe from large number of parquet files stored different HDFS directories. I have tried two approaches but both of them seems to take very long time.
Approach 1: call api read_parquet with glob path.…

Santosh Kumar
- 761
- 5
- 28
4
votes
0 answers
Dask restart worker(s) using client
Is there a way using dask client to restart a worker or worker list provided. Needed a way to bounce a worker after a task is executed to reset the state of the process which may have been changed by the execution.
Client.restart() restarts entire…

Ameet Shah
- 61
- 1
- 4
4
votes
1 answer
Redistribute dask tasks among the cluster
I am abusing dask as a task scheduler for long running tasks with map(, pure=False). So I am not interested in the dask graph, I just use dark as a way to distribute unix commands.
Lets say if have 1000 tasks and they run for a week on a cluster of…

MaxBenChrist
- 547
- 3
- 9