Highest Voted 'dask-distributed' Questions

3

votes

1 answer

File Not Found Error in Dask program run on cluster

I have 4 machines, M1, M2, M3, and M4. The scheduler, client, worker runs on M1. I've put a csv file in M1. Rest of the machines are workers. When I run the program with read_csv file in dask. It gives me Error, file not found

asked Jun 22 '18 at 11:33

Dhruv Kumar

399
2
13

3

votes

1 answer

Dask: set multiprocessing method from Python

Is there a way to set the multiprocessing method from Python? I do not see a method in the Client() API docs of Dask.distributed that indicates how to set this property. Update: For example, is there: client =…

dask dask-distributed

asked Jun 21 '18 at 22:31

ericmjl

13,541
12
51
80

3

votes

0 answers

Dask scheduler behavior while reading/retrieving large datasets

This is a follow-up to this question. I'm experiencing problems with persisting a large dataset in distributed memory. I have a scheduler running on one machine and 8 workers each running on their own machines connected by 40 gigabit ethernet and a…

dask dask-distributed

asked Jun 18 '18 at 19:54

A.C.

53
4

3

votes

1 answer

Dask scatter broadcast a list

what is the appropriate way to scatter broadcast a list using Dask disitributed? case 1 - wrapping the list: [future_list] = client.scatter([my_list], broadcast=True) case 2 - not wrapping the list: future_list = client.scatter(my_list,…

broadcast dask dask-distributed

asked Jun 11 '18 at 10:42

Thomas Moerman

882
8
16

3

votes

1 answer

How to map a dask Series with a large dict

I'm trying to figure out the best way to map a dask Series with a large mapping. The straightforward series.map(large_mapping) issues UserWarning: Large object of size MB detected in task graph and suggests using client.scatter and client.submit…

python dask dask-distributed

asked Jun 01 '18 at 07:56

gsakkis

1,569
1
15
24

3

votes

1 answer

Can we create a Dask cluster having multiple CPU machines as well as multiple GPU machines both.?

Can we create a dask-cluster with some CPU and some GPU machines together. If yes, how to control a certain task must run only on CPU machine, or some other type of task should run only on GPU machine, and if not specified, it should pick whichever…

dask dask-distributed dask-delayed

asked Apr 16 '18 at 10:09

TheCodeCache

820
1
7
27

3

votes

1 answer

How to put a dataset on a gcloud kubernetes cluster?

I have a gcloud Kubernetes cluster initialized, and I'm using a Dask Client on my local machine to connect to the cluster, but I can't seem to find any documentation on how to upload my dataset to the cluster. I originally tried to just run Dask…

kubernetes google-cloud-platform dask dask-distributed

asked Apr 05 '18 at 13:45

Brendan Martin

561
6
17

3

votes

1 answer

Directly running a task on a dedicated dask worker

A simple code-snippet is as follows: comment followed by ### is important.. from dask.distributed import Client ### this code-piece will get executed on a dask worker. def task_to_perform(): print("task in progress.") ## do something…

dask dask-distributed

asked Feb 19 '18 at 11:25

TheCodeCache

820
1
7
27

3

votes

0 answers

how to combine dask and classes?

I am trying to rewrite an entire project that has been developed with classes. Little by little, the heaviest computational chunks should be parallelized, clearly we have a lot of independent sequential loops. An example with classes that mimicks…

python dask dask-distributed

asked Dec 19 '17 at 13:04

Sergio Lucero

862
1
12
21

3

votes

1 answer

Iterate sequentially over a dask bag

I need to submit the elements of a very large dask.bag to a non-threadsafe store, ie I need something like for x in dbag: store.add(x) I can not use compute since the bag is to large to fit in memory. I need something more like…

dask concurrent.futures dask-distributed

asked Dec 19 '17 at 08:26

Daniel Mahler

7,653
5
51
90

3

votes

1 answer

With dask-distributed how to generate futures from long running tasks fed by queues

I'm using a disk-distributed long running task along the lines of this example http://matthewrocklin.com/blog/work/2017/02/11/dask-tensorflow where a long running worker task gets its inputs from a queue as in the tensorflow example and delivers…

dask dask-distributed

asked Dec 15 '17 at 05:01

Bruce Church

33
4

3

votes

1 answer

Saving dataframe divisions to parquet with dask

I am currently trying to save and read information from dask to parquet files. But when trying to save a dataframe with dask "to_parquet" and loading it afterwards again with "read_parquet" it seems like the division information gets…

python dataframe parquet dask dask-distributed

asked Nov 22 '17 at 17:24

lennart

33
3

3

votes

2 answers

Automatically adding a dataset to Dask scheduler on startup

TL;DR I want to pre-load a dataset into the Dask Distributed scheduler when it's starting up. Background I'm using Dask in a realtime query fashion with a smaller-then-memory dataset. Because it's realtime it's important that the workers can trust…

python dask dask-distributed

asked Sep 28 '17 at 13:37

Niklas B

1,839
18
36

3

votes

1 answer

Lazy repartitioning of dask dataframe

After several stages of lazy dataframe processing, I need to repartition my dataframe before saving it. However, the .repartition() method requires me to know the number of partitions (as opposed to size of partitions) and that depends on size of…

dask dask-distributed

asked Sep 22 '17 at 08:44

evilkonrex

255
2
10

3

votes

0 answers

Reduce i/o by storing data into a dictionary shared between workers on node using dask.distributed

I am using dask.distributed scheduler and workers to process some large microscopy images on a cluster. I run multiple workers per node (1 core = 1 worker). Each core in the node share 200Gb of RAM. Issue I would like to decrease the writing…

python python-3.x parallel-processing dask dask-distributed

asked Sep 20 '17 at 17:54

s1mc0d3

523
2
15

Questions tagged [dask-distributed]