Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.
Questions tagged [dask-distributed]
1090 questions
0
votes
1 answer
Design computation graph in dask
Until now, I've used dask with get and a dictionary to define the dependencies graph of my tasks. But it means that I have to define all my graph since the beginning, and now I want to add from time to time new tasks (with dependencies on old…

user1769471
- 29
- 3
0
votes
1 answer
Can't find dependencies/Dependent not found error
I am trying to run this benchmark on a small dask cluster made of two nodes. The remote worker is simply deployed with the dask-worker command and it appears correctly in the output of client in the benchmark. I've also tried to run some simple…

Aratz
- 430
- 5
- 16
0
votes
1 answer
Dask Memory Error Grouping DF From Parquet Data
I created a parquet dataset by reading data into a pandas df, using get_dummies() on the data, and writing it to a parquet file:
df = pd.read_sql(query, engine)
encoded = pd.get_dummies(df,…

OverflowingTheGlass
- 2,324
- 1
- 27
- 75
0
votes
1 answer
Dask- Same tasks are not running in parallel on cluster of Ubuntu machines
I have 3 ubuntu machine(CPU). my dask scheduler and client both are present on the same machine, whereas the two dask workers are running on other two machines. when I launch first task, it gets scheduled on first worker, but then upon launching…

TheCodeCache
- 820
- 1
- 7
- 27
0
votes
1 answer
Is there any way to know whether a dask-worker is running on CPU device or GPU device.?
Suppose a dask cluster has some CPU devices as well as some GPU devices. Each device runs a singe dask-worker. Now, the question is how do I find that the underlying device of a dask-worker is CPU or GPU.
For example:- if the dask-worker is running…

TheCodeCache
- 820
- 1
- 7
- 27
0
votes
1 answer
Simplest way complex dask graph creation
There is a complex system of calculations over some objects.
The difficulty is that some calculations are group calculations.
This can demonstrate by the following example:
from dask distributed import client
def load_data_from_db(id):
# load…

Vladimir
- 145
- 2
- 9
0
votes
0 answers
dask jobs hangs indefinitely and inconsistently
I am running multiple concurrent dask jobs using dask-client submit api. It have come across this issue multiple times.
Thread dump of the specific worker shows below information.
Can some one guide me about this problem.
ts_data =…

Santosh Kumar
- 761
- 5
- 28
0
votes
1 answer
dask distributed.utils - ERROR - state is not a dictionary
I recently upgraded dask-0.15.3 to dask-0.16.0 and distribute-1.19.1 to distribute-1.20.2. After upgrade all dask jobs are failing with exception: _pickle.UnpicklingError: state is not a dictionary
Please let me know if I am missing any…

Santosh Kumar
- 761
- 5
- 28
0
votes
1 answer
how much time it would take for dask ec2 to setup instances?
I am new to dask.distributed. I am trying to setup a few cluster for distributed job. i am trying dask-ec2 to setup them . When i run the command with required Args ,It stucks at installing worker task. I killed it after 30 minutes.I am using port…

Naresh Kumar
- 3
- 2
0
votes
1 answer
Dask DataFrame.map_partition() to write to db table
I have a dask dataframe that contains some data after some transformations. I want to write those data back to a mysql table. I have implemented a function that takes a dataframe a db url and writes the dataframe back to database. Because I need…

Apostolos
- 7,763
- 17
- 80
- 150
0
votes
1 answer
Python + Distributed - Is it possible using Dask to utilize a set of workers to apply a function to seperate files from a folder concurrently
I want to write a program that calculates the time it takes to read in a folder of .py files and calculate the cyclomatic complexity of each of the files. I have Radon installed to calculate the complexity, but I also want to be able to implement a…

J.Doe
- 21
- 3
0
votes
1 answer
I have collection of futures which are result of persist on dask dataframe. How to do a delayed operation on them?
I have setup a scheduler and 4 worker nodes to do some processing on csv. size of the csv is just 300 mb.
df = dd.read_csv('/Downloads/tmpcrnin5ta',assume_missing=True)
df = df.groupby(['col_1','col_2']).agg('mean').reset_index()
df =…

Naresh Kumar
- 3
- 2
0
votes
1 answer
Read dask dataframe from parallel txt files
I have two (or more) parallel text files stored in S3 - i.e. line 1 in first file corresponds to line 1 in second file etc. I want to read these files as columns into a single dask dataframe. What would be the best/easiest/fastest way to do it?
PS.…

evilkonrex
- 255
- 2
- 10
0
votes
0 answers
Dask dataframe error while reading from HDFS
Here is the code that I am using to connect to hdfs and create dask dataframe.
Client(scheduler_host+":"+scheduler_port)
df=dd.read_csv("hdfs://hdfs_host/")
Error:
AttributeError: /usr/lib/libhdfs3.so: undefined symbol:…

Santosh Kumar
- 761
- 5
- 28
0
votes
1 answer
subselection of columns in dask (from pandas) by computed boolean indexer
I'm new do dask (imported as dd) and try to convert some pandas (imported as pd) code.
The goal of the following lines, is to slice the data to those columns, which's values fullfill the calculated requirement in dask.
There is a given table in…

Bastian Ebeling
- 1,138
- 11
- 38