Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.
Questions tagged [dask-distributed]
1090 questions
0
votes
1 answer
How to split a csv into multiple csv files using Dask
How to split a csv file into multiple files using Dask?
The bellow code seems to write to one file only which takes a long time to write the full thing. I believe writing to multiple files will be faster.
import dask.dataframe as ddf
import…

mongotop
- 7,114
- 14
- 51
- 76
0
votes
2 answers
What could be the explaination of this "pyarrow.lib.ArrowIOError: HDFS file does not exist" error when trying to read files in hdfs using Dask?
I'm using Dask Distributed and I'm trying to create a dataframe from a CSV stored in HDFS.
I suppose the connection to HDFS is successful as I'm able to print the dataframe columns' names.
However, I get the following error when I'm trying to use…

Sevy
- 15
- 2
- 6
0
votes
1 answer
HighLevelGraph with (local/multiprocessing) distributed
How should I use dask.highlevelgraph.HighLevelGraph in a local distributed setting.
Sequential computation e.g.
result = dask.get(some_high_level_graph, [some_targets])
works.
import dask
from dask.highlevelgraph import HighLevelGraph as CG
#…

stustd
- 303
- 1
- 10
0
votes
1 answer
problem parralleling dask code on single machine
Paralleling with dask is slower than sequential coding.
I have a nested for loops which I am trying to parallel on a local cluster but can't find the right way.
I want to parallel the inside loop.
I have 2 big numpy matrices which I am trying to…

netfr
- 1
- 4
0
votes
1 answer
Right way to set memory parameters for LocalCluster in dask
I tried the code below,
from dask.distributed import Client, LocalCluster
worker_kwargs = {
'memory_limit': '2G',
'memory_target_fraction': 0.6,
'memory_spill_fraction': 0.7,
'memory_pause_fraction': 0.8,
…

zyxue
- 7,904
- 5
- 48
- 74
0
votes
1 answer
How to specify dask client via environment variable
How can I instruct dask to use a distributed Client as the scheduler, externally from the code, e.g. via an environment variable?
The motivation is to take advantage of one of the key features of dask - namely the transparency of going from a single…

stav
- 1,497
- 2
- 15
- 40
0
votes
1 answer
dask dataframe: merge two dataframes, impute missing value and write to csv only use partial CPUs (20% in each CPU)
I want to merge two dask dataframes, impute missing values with column median and export the merged dataframe to csv files.
I got one problem: my current code cannot utilize all the 8 CPUs (~20% of each CPU)
I am not sure which part limits the CPU…

Jin Wang
- 1
- 1
0
votes
1 answer
Reshape, concatenate and aggregate multiple pandas DataFrames
I have five different pandas data frames showing results of calculations done of the same data with same number of samples , all the arrays are identical in shape. (5x10)
df shape for each data set:
(recording channels)
0 1 2 3 4 5 6 7 8…

abhishake
- 131
- 1
- 12
0
votes
1 answer
How to get results of tasks when they finish and not after all have finished in Dask?
I have a dask dataframe and want to compute some tasks that are independent. Some tasks are faster than others but I'm getting the result of each task after longer tasks have completed.
I created a local Client and use client.compute() to send…

Diego Rodriguez
- 5
- 1
0
votes
1 answer
How to get task result in dask scheduler plugin
I want to forward the result of a task with a scheduler plugin in dask. I have a class that is registered and when I log in the transition function it shows:
transition: key=, start=processing, finish=memory, *args=(), **kwargs={'worker':…

Matt Nicolls
- 173
- 1
- 7
0
votes
1 answer
How do I ignore a worker whose tasks have failed and redistribute its tasks to other workers?
I was running a function on a pool of N single-threaded workers (on N machines) with client.map and one of the workers failed. I was wondering if there is a way to automatically handle exceptions raised by a worker, to redistribute its failed tasks…

billiam
- 132
- 1
- 15
0
votes
1 answer
Can I retrieve a distributed.client instance if I know its id?
With dask there is an id associated with each instance of distributed.client. Calling .id on a client will show its id. Can I retrieve a client instance if I know its id?

billiam
- 132
- 1
- 15
0
votes
1 answer
Dask on single OSX machine - is it parallel by default?
I have installed Dask on OSX Mojave. Does it execute computations in parallel by default? Or do I need to change some settings?
I am using the DataFrame API. Does that make a difference to the answer?
I installed it with pip. Does that make a…

power
- 1,680
- 3
- 18
- 30
0
votes
1 answer
How to parallelize a nested loop with dask.distributed?
I am trying to parallelize a nested loop using dask distribute that looks this way:
@dask.delayed
def delayed_a(e):
a = do_something_with(e)
return something
@dask.delayed
def delayed_b(element):
computations = []
for e in element:
…

muammar
- 951
- 2
- 13
- 32
0
votes
2 answers
Process pool on DASK
I am new to DASK.
I can submit 10 tasks using the client.map(funct_name, iterator) where the iterator is a list which contain the 10 elements.
Now, I want to submit the next task let's say 11th task when anyone from earlier submitted 10 tasks is…

Mahendra Gaur
- 380
- 2
- 11