Questions tagged [dask-distributed]

Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.

1090 questions
0
votes
1 answer

How to prevent dask client from dying on worker exception?

I'm not understanding the resiliency model in dask distributed. Problem Exceptions raised by a workers kills embarrassingly parallel dask operation. All workers and clients die if any worker encounters an exception. Expected Behavior Reading here:…
bw4sz
  • 2,237
  • 2
  • 29
  • 53
0
votes
1 answer

What does one enter on the command line to run spark in a bokeh serve app? Do I simply separate the two command line entries by &&?

My effort does not work: /usr/local/spark/spark-2.3.2-bin-hadoop2.7/bin/spark-submit --driver-memory 6g --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.2 runspark.py && bokeh serve --show bokeh_app runspark.py contains the…
0
votes
1 answer

parallel execution of dask `DataFrame.set_index()`

I am trying to create an index on a large dask dataframe. No matter what scheduler I am unable to utilize more than the equivalent of one core for the operation. The code is: (ddf. .read_parquet(pq_in) .set_index('title', drop=True,…
0
votes
1 answer

Limitations to using LocalCluster? Crashing persisting 50GB of data to 90GB of memory

System Info: CentOS, python 3.5.2, 64 cores, 96 GB ram So I'm trying to load a large array (50GB) from a hdf file into ram (96GB). Each chunk is around 1.5GB less than the worker memory limit. It never seems to complete sometimes crashing or…
dead_zero
  • 15
  • 1
  • 5
0
votes
1 answer

Cannot start dask cluster over SSH

I'm trying to start a dask cluster over SSH, but I am encountering a strange errors like these: Exception in thread Thread-6: Traceback (most recent call last): File "/home/localuser/miniconda3/lib/python3.6/threading.py", line 916, in…
suvayu
  • 4,271
  • 2
  • 29
  • 35
0
votes
1 answer

dask jobqueue worker failure at startup 'Resource temporarily unavailable'

I'm running dask over slurm via jobqueue and I have been getting 3 errors pretty consistently... Basically my question is what could be causing these failures? At first glance the problem is that too many workers are writing to disk at once, or my…
Mr. Buttons
  • 463
  • 1
  • 3
  • 9
0
votes
1 answer

Route to dask worker debug pages

The docs say: Debug Worker pages for each worker at http://worker-address:8789. These pages have detailed diagnostic information about the worker. Like the diagnostic scheduler pages they are of more utility to developers or to people looking to…
0
votes
0 answers

Is there a way to store and display dask distributed history

Is there a way to store an display(over Bokeh) dask distributed history I would like to analyse/compare old dask distributed runs
sami
  • 501
  • 2
  • 6
  • 18
0
votes
1 answer

How to composite tasks in dask-distributed

I am trying to run a joblib parallel loop inside of a threaded dask-distributed cluster (see below the reason), but I can't get any speedup due to GIL-lock. Here's an example: def task(x): """ Sample single-process task that takes between 2 and…
A32167
  • 26
  • 2
0
votes
1 answer

Analyzing data flow of Dask dataframes

I have a dataset stored in a tab-separated text file. The file looks as follows: date time temperature 2010-01-01 12:00:00 10.0000 ... where the temperature column contains values in degrees Celsius (°C). I compute the daily average…
Giorgio
  • 5,023
  • 6
  • 41
  • 71
0
votes
1 answer

Unable to catch KeyboardInterrupt exception after starting dask.distributed Client/LocalClient

I'm trying to use Ctrl+C to gracefully stop my running code, including a local dask.distrubted Client. The code below is an example of my setup. When I use Ctrl+C, the stop() method is called properly, however dask Client seems to be improperly…
0
votes
1 answer

Dask distributed perform computations without returning data

I have a dynamic Dask Kubernetes cluster. I want to load 35 parquet files (about 1.2GB) from Gcloud storage into Dask Dataframe then process it with apply() and after saving the result to parquet file to Gcloud. During loading files from Gcloud…
Vladyslav Moisieienkov
  • 4,118
  • 4
  • 25
  • 32
0
votes
2 answers

Dask Distributed with Asynchronous Real-time Parallelism

I'm reading the documentation on dask.distributed and it looks like I could submit functions to the distributed cluster via client.submit(). I have an existing function some_func that is grabbing individual documents (say, a text file)…
slaw
  • 6,591
  • 16
  • 56
  • 109
0
votes
0 answers

YarnCluster constructor hangs in dask-yarn

Im using dask-yarn version 0.3.1. Following the basic example on https://dask-yarn.readthedocs.io/en/latest/. from dask_yarn import YarnCluster from dask.distributed import Client # Create a cluster where each worker has two cores and eight GB of…
0
votes
0 answers

Dask client scatter is taking a long time for size of file dict in memory

I'm new to Dask and have recently made my foray into parallel computing with this nice and wonderful package. However, in my implementation, I've been struggling to understand why does it take 6 mins for me to scatter a python dict in my scheduler…