Questions tagged [dask-distributed]

Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.

1090 questions
3
votes
1 answer

Using Dask from script

Is it possible to run dask from a python script? In interactive session I can just write from dask.distributed import Client client = Client() as described in all tutorials. If I write these lines however in a script.py file and execute it python…
DerWeh
  • 1,721
  • 1
  • 15
  • 26
3
votes
1 answer

Dask - Quickest way to get row length of each partition in a Dask dataframe

I'd like to get the length of each partition in a number of dataframes. I'm presently getting each partition and then getting the size of the index for each partition. This is very, very slow. Is there a better way? Here's a simplified snippet of…
dan
  • 183
  • 13
3
votes
1 answer

reading a Dask DataFrame from CSVs in a deep S3 path hierarchy

I am trying to read a set of CSVs in S3 in a Dask DataFrame. The bucket has a deep hierarchy and contains some metadata files as well. the call looks like dd.read_csv('s3://mybucket/dataset/*/*/*/*/*/*.csv') This causes Dask to hang. The real…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
3
votes
1 answer

Dask Distributed - Same persist data multiple clients

We are trying Dask Distributed to make some heavy computes and visualization for a frontend. Now we have one worker with gunicorn that connects to an existing Distributed Dask cluster, the worker uploads the data currently with read_csv and persist…
CValenzu
  • 31
  • 2
3
votes
1 answer

How can I get result of Dask compute on a different machine than the one that submitted it?

I am using Dask behind a Django server and the basic setup I have is summarised here: https://github.com/MoonVision/django-dask-demo/ where the Dask client can be found here:…
Matt Nicolls
  • 173
  • 1
  • 7
3
votes
1 answer

Actors and dask-workers

client = Client('127.0.0.1:8786',direct_to_workers=True) future1 = client.submit(Counter, workers= 'ninja',actor=True) counter1 = future1.result() print(counter1) All is well but what if the client gets restarted? How do I…
chak
  • 31
  • 2
3
votes
1 answer

How do I share a large read-only object across Dask distributed workers

The Problem I'm trying to send a 2GB CPython read-only object (can be pickled) to dask distributed workers via apply(). This ends up consuming a lot of memory for processes/ threads (14+ GB). Is there a way to load the object only once into memory…
3
votes
0 answers

Worker crashes during simple aggregation

I am trying to aggregate various columns on a 450 million row data set. When I use Dask's built in aggregations like 'min', 'max', 'std', 'mean' keep crashing a worker in the process. The file that I am using can be found here:…
DannyK
  • 103
  • 2
  • 10
3
votes
1 answer

Scheduler closing stream warning

I have a periodic batch job running on my laptop. The code looks like this: client = Client() print(client.scheduler_info()) topic='raw_data' start = datetime.datetime.now() delta = datetime.timedelta(minutes=2) while True: end = start + delta …
Apostolos
  • 7,763
  • 17
  • 80
  • 150
3
votes
0 answers

split bigquery dataframe into chunks using dask

I searched and tested different ways to find if I can be able to split bigquery dataframe into chunks of 75 rows, but couldn't find a way to do so. here is the senario: I got a very large bigquery dataframe (millions of rows) using python and gcp…
MT467
  • 668
  • 2
  • 15
  • 31
3
votes
0 answers

Writing Dask/XArray to NetCDF - Parallel IO

I am using Dask/Xarray with a ~150 GB dataset on a distributed cluster on a HPC system. I have the computation component complete, which takes about ~30 minutes. I want to save the final result to a NETCDF4 file, but writing the data to a NETCDF…
Rowan_Gaffney
  • 452
  • 5
  • 17
3
votes
1 answer

jupyter lab open an iframe on a tab for monitoring dask scheduler

I am developping with dask distributed and this package provides a very useful debugging view as a bokeh application. I want to have this application next to my notebook in a jupyterlab tab. I have managed to do so by opening the jupyter lab…
3
votes
1 answer

How do I get adaptive dask workers to run some code on startup?

I'm creating a dask scheduler using dask-kubernetes and putting it into adaptive mode. from dask-kubernetes import KubeCluster cluster = KubeCluster() cluster.adapt(minimum=0, maximum=40) I need each worker to run some setup code when they are…
Jacob Tomlinson
  • 3,341
  • 2
  • 31
  • 62
3
votes
1 answer

How to reliably clean up dask scheduler/worker

I'm starting up a dask cluster in an automated way by ssh-ing into a bunch of machines and running dask-worker. I noticed that I sometimes run into problems when processes from a previous experiment are still running. Wha'ts the best way to clean up…
John
  • 935
  • 6
  • 17
3
votes
1 answer

Tensorflow + joblib: limited to 8 processes?

I created a statistical estimator using TensorFlow. I followed sklearn's estimators, so I have a class that packages everything including importing Tensorflow and starting TF's session (if I import TF outside the class nothing works in parallel at…