Highest Voted 'dask-distributed' Questions

5

votes

1 answer

Limit Dask CPU and Memory Usage (Single Node)

I am running Dask on a single computer where running .compute() to perform the computations on a huge parquet file will cause dask to use up all the CPU cores on the system. import dask as dd df = dd.read_parquet(parquet_file) # very large…

asked Jan 22 '20 at 18:30

Nyxynyx

61,411
155
482
830

5

votes

1 answer

Memory clean up of Dask workers

I am running multiple parallel tasks on a multi-node distributed Dask cluster. However, once the tasks are finished, workers still hold large memory and cluster gets filled up soon. I have tried client.restart() after every task and…

python dask dask-distributed

asked Jan 18 '20 at 16:01

spiralarchitect

880
7
19

5

votes

3 answers

Forcing Locality on Dask Dataframe Subsets

I'm trying to distribute a large Dask Dataframe across multiple machines for (later) distributed computations on the dataframe. I'm using dask-distributed for this. All the dask-distributed examples/docs I see are populating the initial data load…

dask dask-distributed

asked May 16 '19 at 18:06

CoderOfTheNight

944
2
8
21

5

votes

1 answer

Trigger Dask workers to release memory

I'm distributing the computation of some functions using Dask. My general layout looks like this: from dask.distributed import Client, LocalCluster, as_completed cluster = LocalCluster(processes=config.use_dask_local_processes, …

dask dask-distributed

asked Apr 30 '19 at 15:03

gallamine

865
2
12
26

5

votes

1 answer

Many distributed dask workers idle after one evaluation, or never receive any work, when there are more task

We’re using dask to optimize deep-learner (DL) architectures by generating designs and then sending them to dask workers that, in turn, use pytorch for training. We observe some of the workers do not appear to start, and those that do complete…

python dask dask-distributed

asked Apr 06 '19 at 01:54

Mark Coletti

195
11

5

votes

1 answer

How to use client.scatter correctly and when in Dask

When executing a "large" number of tasks I am receiving this error: Consider scattering large objects ahead of time with client.scatter to reduce scheduler burden and keep data on workers And I also am getting a bunch of messages like…

python-3.x parallel-processing dask dask-distributed

asked Mar 20 '19 at 00:01

muammar

951
2
13
32

5

votes

2 answers

Dask Distributed: Reading .csv from HDFS

I'm performance testing Dask using "Distributed Pandas on a Cluster with Dask DataFrames" as a guide. In Matthew's example, he has a 20GB file and 64 workers (8 physical nodes). In my case, I have a 82GB file and 288 workers (12 physical nodes;…

python hdfs dask dask-distributed

asked Feb 17 '19 at 03:03

jonathan

91
7

5

votes

1 answer

Dask Client can't connect to dask-scheduler

I'm on dask 1.1.1 (latest version) and I have started a dask scheduler at the commandline with this command: $ dask-scheduler --port 9796 --bokeh-port 9797 --bokeh-prefix my_project distributed.scheduler - INFO -…

python-3.x ssl-certificate dask-distributed

asked Feb 13 '19 at 19:51

MetaStack

3,266
4
30
67

5

votes

1 answer

Why does Dask fill in "foo" and 1 in my Dataframe

I've read in around 15 csv files: df = dd.read_csv("gs://project/*.csv", blocksize=25e6, storage_options={'token': fs.session.credentials}) Then I persisted the Dataframe (it uses 7.33 GB memory): df = df.persist() I set a new…

dataframe dask dask-distributed

asked Feb 08 '19 at 11:39

Stanko

4,275
3
23
51

5

votes

3 answers

Loading data from S3 to dask dataframe

I can load the data only if I change the "anon" parameter to True after making the file public. df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'anon':False}) This is not recommended for obvious reasons. How do I load the data…

python dask dask-distributed

asked Jan 14 '19 at 08:06

shantanuo

31,689
78
245
403

5

votes

1 answer

Dask compute is very slow

I have a dataframe that consist of 5 million records. I am trying to process it using below code by leveraging the dask dataframes in python import dask.dataframe as dd dask_df = dd.read_csv(fullPath) …

python python-3.x performance dask dask-distributed

asked Oct 07 '18 at 11:00

Neno M.

123
1
6

5

votes

1 answer

memory usage when indexing a large dask dataframe on a single multicore machine

I am trying to turn the Wikipedia CirrusSearch dump into Parquet backed dask dataframe indexed by title on a 450G 16-core GCP instance. CirrusSearch dumps come as a single json line formatted file. The English Wipedia dumps contain 5M recards and…

parquet dask dask-distributed fastparquet dask.distributed

asked Jun 29 '18 at 05:31

Daniel Mahler

7,653
5
51
90

5

votes

2 answers

How to replicate data when it is faster to compute than transfer in dask distributed?

I have a largish object (150 MB) that I need to broadcast to all dask distributed workers so it can be used in future tasks. I've tried a couple of approaches: Client.scatter(broadcast=True): This required sending all the data from one machine…

dask dask-distributed

asked Jun 25 '18 at 15:42

Stan Seibert

53
3

5

votes

0 answers

Dask groupby with multiple columns issue

I have the following dataframe created by using dataframe.from_delayed method tha has the following columns _id hour_timestamp http_method total_hits username hour weekday. Some details on the source…

python dataframe dask dask-distributed

asked Apr 02 '18 at 08:01

Apostolos

7,763
17
80
150

5

votes

0 answers

Dask assigning to dataframe column by index throws ValueError

I have a pipeline of trasnformations on a grouped by dataframe. All functions get a DataframeGroupBy and compute some features. Those features are then stored in a Dataframe. The index of the dataframe is the same since all features are derived by…

python pandas dataframe dask dask-distributed

asked Mar 22 '18 at 09:04

Apostolos

7,763
17
80
150

Questions tagged [dask-distributed]