Questions tagged [dask-distributed]

Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.

1090 questions
3
votes
0 answers

Dask tasks failing because they timed out trying to connect

I am trying to perform some calculations on xarray data. The data has lat, lon and time coordinates, and multiple data variables. My calculation is performed on a single timestep. In an attempt to parralellize this I am using the dask distributed…
phrasper
  • 41
  • 4
3
votes
1 answer

Growing memory usage (leak?) in Dask Distributed profiler

I have a longish running task that I submit to a Dask cluster (worker is running 1 process and 1 thread) and I use tracemalloc to track memory usage. The task can run long enough that memory usage builds up and has caused all sorts of problems. …
Alex P
  • 71
  • 5
3
votes
1 answer

Dask: Update published dataset periodically and pull data from other clients

I would like to append data on a published dask dataset from a queue (like redis). Then other python programs would be able to fetch the latest data (e.g. once per second/minute) and do some futher opertions. Would that be possible? Which append…
gies0r
  • 4,723
  • 4
  • 39
  • 50
3
votes
1 answer

"404 Not found" when trying to connect to dask dashboard

I'm using dask.distributed on a remote machine accessible via SSH, and attempting to connect to the Dask dashboard. I remember it's worked before (in other virtual environments) when I was doing my first steps with Dask, but now any time I try to…
David
  • 437
  • 1
  • 4
  • 15
3
votes
1 answer

Parallelizing a Dask aggregation

Building off of this post, I implemented the custom mode formula, but have found issues with performance on this function. Essentially, when I enter into this aggregation, my cluster only uses one of my threads, which is not great for performance. I…
3
votes
1 answer

dask.delayed KeyError with distributed scheduler

I have a function interpolate_to_particles written in c and wrapped with ctypes. I want to use dask.delayed to make a series of calls to this function. The code runs successfully without dask # Interpolate w/o dask result =…
elltrain
  • 82
  • 4
3
votes
1 answer

Computing dask array chunks asynchronously (Dask + FastAPI)

I am building a FastAPI application that will serve chunks of a Dask Array. I would like to leverage FastAPI's asynchronous functionality alongside Dask-distributed's ability to operate asynchronously. Below is a mcve that demonstrates what I'm…
jhamman
  • 5,867
  • 19
  • 39
3
votes
0 answers

How to pick proper number of threads, workers, processes for Dask when running in an ephemeral environment as single machine and cluster

Our company is currently leveraging prefect.io for data workflows (ELT, report generation, ML, etc). We have just started adding the ability to do parallel task execution, which is powered by Dask. Our flows are executed using ephemeral AWS Fargate…
braunk
  • 31
  • 2
3
votes
1 answer

Dask - How to cancel and resubmit stalled tasks?

Frequently, I encounter an issue where Dask randomly stalls on a couple tasks, usually tied to a read of data from a different node on my network (more details about this below). This can happen after several hours of running the script with no…
dan
  • 183
  • 13
3
votes
1 answer

client.upload_file() for nested modules

I have a project structured as follows; - topmodule/ - childmodule1/ - my_func1.py - childmodule2/ - my_func2.py - common.py - __init__.py From my Jupyter notebook on an edge-node of a Dask cluster, I am doing the…
Jenna Kwon
  • 1,212
  • 1
  • 12
  • 22
3
votes
1 answer

How to use group by describe with unstack operation in python dask?

I am trying to use the describe() and unstack() function in dask to get the summary statistics of the data. However, i get an error as shown below import dask.dataframe as dd df =…
The Great
  • 7,215
  • 7
  • 40
  • 128
3
votes
2 answers

Is it possible to read a .tiff file from a remote service with dask?

I'm storing .tiff files on google cloud storage. I'd like to manipulate them using a distributed Dask cluster installed with Helm on Kubernetes.. Based on the dask-image repo, the Dask documentation on remote data services, and the use of…
skeller88
  • 4,276
  • 1
  • 32
  • 34
3
votes
0 answers

Dask workers time out shortly after starting

Good Afternoon SO, I am trying to deploy a WRF post-processing solution in Python using Dask and wrf-python that is run on a cluster, however I am encountering an issue with the interactivity between the dask scheduler and the worker instances. In…
Phantom139
  • 143
  • 9
3
votes
0 answers

Dask: iterate over dataframe groups (implement a state machine given event stream)

Given an event stream for each key, I would like to maintain some internal state, and emit a state history for each event. A naive implementation would simply chunk the data by key, iterate over the events in order, maintain some internal state in…
Alexander David
  • 769
  • 2
  • 8
  • 19
3
votes
1 answer

How should I load a memory-intensive helper object per-worker in dask distributed?

I am currently trying to parse a very large number of text documents using dask + spaCy. SpaCy requires that I load a relatively large Language object, and I would like to load this once per worker. I have a couple of mapping functions that I would…
JSybrandt
  • 108
  • 1
  • 7