Highest Voted 'dask-distributed' Questions

3

votes

1 answer

Using Dask from script

Is it possible to run dask from a python script? In interactive session I can just write from dask.distributed import Client client = Client() as described in all tutorials. If I write these lines however in a script.py file and execute it python…

python-3.x dask dask-distributed

asked Aug 20 '19 at 14:26

DerWeh

1,721
1
15
26

3

votes

1 answer

Dask - Quickest way to get row length of each partition in a Dask dataframe

I'd like to get the length of each partition in a number of dataframes. I'm presently getting each partition and then getting the size of the index for each partition. This is very, very slow. Is there a better way? Here's a simplified snippet of…

dask dask-distributed dask-delayed fastparquet

asked Aug 13 '19 at 17:44

dan

183
13

3

votes

1 answer

reading a Dask DataFrame from CSVs in a deep S3 path hierarchy

I am trying to read a set of CSVs in S3 in a Dask DataFrame. The bucket has a deep hierarchy and contains some metadata files as well. the call looks like dd.read_csv('s3://mybucket/dataset/*/*/*/*/*/*.csv') This causes Dask to hang. The real…

amazon-s3 boto3 boto dask dask-distributed

asked Jun 12 '19 at 03:46

Daniel Mahler

7,653
5
51
90

3

votes

1 answer

Dask Distributed - Same persist data multiple clients

We are trying Dask Distributed to make some heavy computes and visualization for a frontend. Now we have one worker with gunicorn that connects to an existing Distributed Dask cluster, the worker uploads the data currently with read_csv and persist…

dask dask-distributed

asked May 07 '19 at 06:15

CValenzu

31
2

3

votes

1 answer

How can I get result of Dask compute on a different machine than the one that submitted it?

I am using Dask behind a Django server and the basic setup I have is summarised here: https://github.com/MoonVision/django-dask-demo/ where the Dask client can be found here:…

dask dask-distributed dask-delayed

asked Apr 04 '19 at 10:26

Matt Nicolls

173
1
7

3

votes

1 answer

Actors and dask-workers

client = Client('127.0.0.1:8786',direct_to_workers=True) future1 = client.submit(Counter, workers= 'ninja',actor=True) counter1 = future1.result() print(counter1) All is well but what if the client gets restarted? How do I…

dask dask-distributed

asked Feb 28 '19 at 05:00

chak

31
2

3

votes

1 answer

How do I share a large read-only object across Dask distributed workers

The Problem I'm trying to send a 2GB CPython read-only object (can be pickled) to dask distributed workers via apply(). This ends up consuming a lot of memory for processes/ threads (14+ GB). Is there a way to load the object only once into memory…

python python-multiprocessing dask concurrent.futures dask-distributed

asked Feb 09 '19 at 13:02

Hyperspace

65
1
8

3

votes

0 answers

Worker crashes during simple aggregation

I am trying to aggregate various columns on a 450 million row data set. When I use Dask's built in aggregations like 'min', 'max', 'std', 'mean' keep crashing a worker in the process. The file that I am using can be found here:…

dask dask-distributed

asked Dec 28 '18 at 07:20

DannyK

103
2
10

3

votes

1 answer

Scheduler closing stream warning

I have a periodic batch job running on my laptop. The code looks like this: client = Client() print(client.scheduler_info()) topic='raw_data' start = datetime.datetime.now() delta = datetime.timedelta(minutes=2) while True: end = start + delta …

python dask dask-distributed

asked Nov 27 '18 at 21:13

Apostolos

7,763
17
80
150

3

votes

0 answers

split bigquery dataframe into chunks using dask

I searched and tested different ways to find if I can be able to split bigquery dataframe into chunks of 75 rows, but couldn't find a way to do so. here is the senario: I got a very large bigquery dataframe (millions of rows) using python and gcp…

python numpy google-bigquery dask dask-distributed

asked Sep 27 '18 at 17:34

MT467

668
2
15
31

3

votes

0 answers

Writing Dask/XArray to NetCDF - Parallel IO

I am using Dask/Xarray with a ~150 GB dataset on a distributed cluster on a HPC system. I have the computation component complete, which takes about ~30 minutes. I want to save the final result to a NETCDF4 file, but writing the data to a NETCDF…

dask netcdf python-xarray dask-distributed netcdf4

asked Sep 26 '18 at 13:31

Rowan_Gaffney

452
5
17

3

votes

1 answer

jupyter lab open an iframe on a tab for monitoring dask scheduler

I am developping with dask distributed and this package provides a very useful debugging view as a bokeh application. I want to have this application next to my notebook in a jupyterlab tab. I have managed to do so by opening the jupyter lab…

jupyter-notebook jupyter dask dask-distributed jupyter-lab

asked Aug 20 '18 at 21:22

redoules

33
4

3

votes

1 answer

How do I get adaptive dask workers to run some code on startup?

I'm creating a dask scheduler using dask-kubernetes and putting it into adaptive mode. from dask-kubernetes import KubeCluster cluster = KubeCluster() cluster.adapt(minimum=0, maximum=40) I need each worker to run some setup code when they are…

python dask dask-distributed dask-kubernetes

asked Aug 01 '18 at 11:04

Jacob Tomlinson

3,341
2
31
62

3

votes

1 answer

How to reliably clean up dask scheduler/worker

I'm starting up a dask cluster in an automated way by ssh-ing into a bunch of machines and running dask-worker. I noticed that I sometimes run into problems when processes from a previous experiment are still running. Wha'ts the best way to clean up…

dask dask-distributed

asked Jul 29 '18 at 01:55

John

935
6
17

3

votes

1 answer

Tensorflow + joblib: limited to 8 processes?

I created a statistical estimator using TensorFlow. I followed sklearn's estimators, so I have a class that packages everything including importing Tensorflow and starting TF's session (if I import TF outside the class nothing works in parallel at…

python tensorflow parallel-processing joblib dask-distributed

asked Jul 14 '18 at 20:39

Luk17

222
1
11

Questions tagged [dask-distributed]