Questions tagged [dask-distributed]

Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.

1090 questions
5
votes
1 answer

Limit Dask CPU and Memory Usage (Single Node)

I am running Dask on a single computer where running .compute() to perform the computations on a huge parquet file will cause dask to use up all the CPU cores on the system. import dask as dd df = dd.read_parquet(parquet_file) # very large…
Nyxynyx
  • 61,411
  • 155
  • 482
  • 830
5
votes
1 answer

Memory clean up of Dask workers

I am running multiple parallel tasks on a multi-node distributed Dask cluster. However, once the tasks are finished, workers still hold large memory and cluster gets filled up soon. I have tried client.restart() after every task and…
spiralarchitect
  • 880
  • 7
  • 19
5
votes
3 answers

Forcing Locality on Dask Dataframe Subsets

I'm trying to distribute a large Dask Dataframe across multiple machines for (later) distributed computations on the dataframe. I'm using dask-distributed for this. All the dask-distributed examples/docs I see are populating the initial data load…
CoderOfTheNight
  • 944
  • 2
  • 8
  • 21
5
votes
1 answer

Trigger Dask workers to release memory

I'm distributing the computation of some functions using Dask. My general layout looks like this: from dask.distributed import Client, LocalCluster, as_completed cluster = LocalCluster(processes=config.use_dask_local_processes, …
gallamine
  • 865
  • 2
  • 12
  • 26
5
votes
1 answer

Many distributed dask workers idle after one evaluation, or never receive any work, when there are more task

We’re using dask to optimize deep-learner (DL) architectures by generating designs and then sending them to dask workers that, in turn, use pytorch for training. We observe some of the workers do not appear to start, and those that do complete…
Mark Coletti
  • 195
  • 11
5
votes
1 answer

How to use client.scatter correctly and when in Dask

When executing a "large" number of tasks I am receiving this error: Consider scattering large objects ahead of time with client.scatter to reduce scheduler burden and keep data on workers And I also am getting a bunch of messages like…
muammar
  • 951
  • 2
  • 13
  • 32
5
votes
2 answers

Dask Distributed: Reading .csv from HDFS

I'm performance testing Dask using "Distributed Pandas on a Cluster with Dask DataFrames" as a guide. In Matthew's example, he has a 20GB file and 64 workers (8 physical nodes). In my case, I have a 82GB file and 288 workers (12 physical nodes;…
jonathan
  • 91
  • 7
5
votes
1 answer

Dask Client can't connect to dask-scheduler

I'm on dask 1.1.1 (latest version) and I have started a dask scheduler at the commandline with this command: $ dask-scheduler --port 9796 --bokeh-port 9797 --bokeh-prefix my_project distributed.scheduler - INFO -…
MetaStack
  • 3,266
  • 4
  • 30
  • 67
5
votes
1 answer

Why does Dask fill in "foo" and 1 in my Dataframe

I've read in around 15 csv files: df = dd.read_csv("gs://project/*.csv", blocksize=25e6, storage_options={'token': fs.session.credentials}) Then I persisted the Dataframe (it uses 7.33 GB memory): df = df.persist() I set a new…
Stanko
  • 4,275
  • 3
  • 23
  • 51
5
votes
3 answers

Loading data from S3 to dask dataframe

I can load the data only if I change the "anon" parameter to True after making the file public. df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'anon':False}) This is not recommended for obvious reasons. How do I load the data…
shantanuo
  • 31,689
  • 78
  • 245
  • 403
5
votes
1 answer

Dask compute is very slow

I have a dataframe that consist of 5 million records. I am trying to process it using below code by leveraging the dask dataframes in python import dask.dataframe as dd dask_df = dd.read_csv(fullPath) …
Neno M.
  • 123
  • 1
  • 6
5
votes
1 answer

memory usage when indexing a large dask dataframe on a single multicore machine

I am trying to turn the Wikipedia CirrusSearch dump into Parquet backed dask dataframe indexed by title on a 450G 16-core GCP instance. CirrusSearch dumps come as a single json line formatted file. The English Wipedia dumps contain 5M recards and…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
5
votes
2 answers

How to replicate data when it is faster to compute than transfer in dask distributed?

I have a largish object (150 MB) that I need to broadcast to all dask distributed workers so it can be used in future tasks. I've tried a couple of approaches: Client.scatter(broadcast=True): This required sending all the data from one machine…
5
votes
0 answers

Dask groupby with multiple columns issue

I have the following dataframe created by using dataframe.from_delayed method tha has the following columns _id hour_timestamp http_method total_hits username hour weekday. Some details on the source…
Apostolos
  • 7,763
  • 17
  • 80
  • 150
5
votes
0 answers

Dask assigning to dataframe column by index throws ValueError

I have a pipeline of trasnformations on a grouped by dataframe. All functions get a DataframeGroupBy and compute some features. Those features are then stored in a Dataframe. The index of the dataframe is the same since all features are derived by…
Apostolos
  • 7,763
  • 17
  • 80
  • 150