Questions tagged [dask-distributed]

Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.

1090 questions
0
votes
1 answer

How to split a csv into multiple csv files using Dask

How to split a csv file into multiple files using Dask? The bellow code seems to write to one file only which takes a long time to write the full thing. I believe writing to multiple files will be faster. import dask.dataframe as ddf import…
mongotop
  • 7,114
  • 14
  • 51
  • 76
0
votes
2 answers

What could be the explaination of this "pyarrow.lib.ArrowIOError: HDFS file does not exist" error when trying to read files in hdfs using Dask?

I'm using Dask Distributed and I'm trying to create a dataframe from a CSV stored in HDFS. I suppose the connection to HDFS is successful as I'm able to print the dataframe columns' names. However, I get the following error when I'm trying to use…
Sevy
  • 15
  • 2
  • 6
0
votes
1 answer

HighLevelGraph with (local/multiprocessing) distributed

How should I use dask.highlevelgraph.HighLevelGraph in a local distributed setting. Sequential computation e.g. result = dask.get(some_high_level_graph, [some_targets]) works. import dask from dask.highlevelgraph import HighLevelGraph as CG #…
stustd
  • 303
  • 1
  • 10
0
votes
1 answer

problem parralleling dask code on single machine

Paralleling with dask is slower than sequential coding. I have a nested for loops which I am trying to parallel on a local cluster but can't find the right way. I want to parallel the inside loop. I have 2 big numpy matrices which I am trying to…
netfr
  • 1
  • 4
0
votes
1 answer

Right way to set memory parameters for LocalCluster in dask

I tried the code below, from dask.distributed import Client, LocalCluster worker_kwargs = { 'memory_limit': '2G', 'memory_target_fraction': 0.6, 'memory_spill_fraction': 0.7, 'memory_pause_fraction': 0.8, …
zyxue
  • 7,904
  • 5
  • 48
  • 74
0
votes
1 answer

How to specify dask client via environment variable

How can I instruct dask to use a distributed Client as the scheduler, externally from the code, e.g. via an environment variable? The motivation is to take advantage of one of the key features of dask - namely the transparency of going from a single…
stav
  • 1,497
  • 2
  • 15
  • 40
0
votes
1 answer

dask dataframe: merge two dataframes, impute missing value and write to csv only use partial CPUs (20% in each CPU)

I want to merge two dask dataframes, impute missing values with column median and export the merged dataframe to csv files. I got one problem: my current code cannot utilize all the 8 CPUs (~20% of each CPU) I am not sure which part limits the CPU…
Jin Wang
  • 1
  • 1
0
votes
1 answer

Reshape, concatenate and aggregate multiple pandas DataFrames

I have five different pandas data frames showing results of calculations done of the same data with same number of samples , all the arrays are identical in shape. (5x10) df shape for each data set: (recording channels) 0 1 2 3 4 5 6 7 8…
abhishake
  • 131
  • 1
  • 12
0
votes
1 answer

How to get results of tasks when they finish and not after all have finished in Dask?

I have a dask dataframe and want to compute some tasks that are independent. Some tasks are faster than others but I'm getting the result of each task after longer tasks have completed. I created a local Client and use client.compute() to send…
0
votes
1 answer

How to get task result in dask scheduler plugin

I want to forward the result of a task with a scheduler plugin in dask. I have a class that is registered and when I log in the transition function it shows: transition: key=, start=processing, finish=memory, *args=(), **kwargs={'worker':…
Matt Nicolls
  • 173
  • 1
  • 7
0
votes
1 answer

How do I ignore a worker whose tasks have failed and redistribute its tasks to other workers?

I was running a function on a pool of N single-threaded workers (on N machines) with client.map and one of the workers failed. I was wondering if there is a way to automatically handle exceptions raised by a worker, to redistribute its failed tasks…
billiam
  • 132
  • 1
  • 15
0
votes
1 answer

Can I retrieve a distributed.client instance if I know its id?

With dask there is an id associated with each instance of distributed.client. Calling .id on a client will show its id. Can I retrieve a client instance if I know its id?
billiam
  • 132
  • 1
  • 15
0
votes
1 answer

Dask on single OSX machine - is it parallel by default?

I have installed Dask on OSX Mojave. Does it execute computations in parallel by default? Or do I need to change some settings? I am using the DataFrame API. Does that make a difference to the answer? I installed it with pip. Does that make a…
power
  • 1,680
  • 3
  • 18
  • 30
0
votes
1 answer

How to parallelize a nested loop with dask.distributed?

I am trying to parallelize a nested loop using dask distribute that looks this way: @dask.delayed def delayed_a(e): a = do_something_with(e) return something @dask.delayed def delayed_b(element): computations = [] for e in element: …
0
votes
2 answers

Process pool on DASK

I am new to DASK. I can submit 10 tasks using the client.map(funct_name, iterator) where the iterator is a list which contain the 10 elements. Now, I want to submit the next task let's say 11th task when anyone from earlier submitted 10 tasks is…
Mahendra Gaur
  • 380
  • 2
  • 11