Questions tagged [dask-distributed]

Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.

1090 questions
4
votes
2 answers

How do I capture dask-worker console logs in a file?

In the below, I want to capture "dask_client_log_msg" and other task-logs in one file and "dask_worker_log_msg" and other client-logs in a separate file. As obviously client will run in a separate process altogether than the worker. So I need one…
TheCodeCache
  • 820
  • 1
  • 7
  • 27
4
votes
2 answers

How to explicitly stop a running/live task through dask.?

I have a simple task which is scheduled by dask-scheduler and is running on a worker node. My requirement is, I want to have the control to stop the task on demand as and when the user wants..
TheCodeCache
  • 820
  • 1
  • 7
  • 27
4
votes
1 answer

Dask Event loop was unresponsive - work not parallelized

This is a follow-up to this question. I'm now trying to run Dask on multiple EC2 nodes on AWS. I'm able to start up the scheduler on the first machine: I then start up workers on several other machines. From the other machines I'm able to access…
user554481
  • 1,875
  • 4
  • 26
  • 47
4
votes
1 answer

Local Dask worker unable to connect to local scheduler

While running Dask 0.16.0 on OSX 10.12.6 I'm unable to connect a local dask-worker to a local dask-scheduler. I simply want to follow the official Dask tutorial. Steps to reproduce: Step 1: run dask-scheduler Step 2: Run dask-worker…
user554481
  • 1,875
  • 4
  • 26
  • 47
4
votes
1 answer

Dask performances: workflow doubts

I'm confused about how to get the best from dask. The problem I have a dataframe which contains several timeseries (every one has its own key) and I need to run a function my_fun on every each of them. One way to solve it with pandas involves df =…
rpanai
  • 12,515
  • 2
  • 42
  • 64
4
votes
1 answer

dask distributed 1.19 client logging?

The following code used to emit logs at some point, but no longer seems to do so. Shouldn't configuration of the logging mechanism in each worker permit logs to appear on stdout? If not, what am I overlooking? import logging from distributed import…
lebedov
  • 1,371
  • 2
  • 12
  • 27
4
votes
1 answer

Multiple images mean dask.delayed vs. dask.array

Background I have a list with the paths of thousand image stacks (3D numpy arrays) preprocessed and saved as .npy binaries. Case Study I would like to calculate the mean of all the images and in order to speed the analysis I thought to parallelise…
s1mc0d3
  • 523
  • 2
  • 15
4
votes
0 answers

How to load dataframe on all dask workers

I have a few thousand CSV files in S3, and I want to load them, concatenate them together into a single pandas dataframe, and share that entire dataframe with all dask workers on a cluster. All of the files are approximately the same size (~1MB). …
Peter Lubans
  • 355
  • 2
  • 8
4
votes
1 answer

dask dataframe set_index throws error

I have a dask dataframe created from parquet file on HDFS. When creating setting index using api: set_index, it fails with below error. File "/ebs/d1/agent/conda/envs/py361/lib/python3.6/site-packages/dask/dataframe/shuffle.py", line 64, in…
Santosh Kumar
  • 761
  • 5
  • 28
4
votes
1 answer

Pickle error when submitting task using dask

I am trying to execute a simple task(an instance method) using dask(async) framework but it fails with serialization error. Can someone point me in right direction. Here is the code that I am running: from dask.distributed import Client,…
Santosh Kumar
  • 761
  • 5
  • 28
4
votes
0 answers

workers connect, but computation fails

I get dask-worker to connect to dask-scheduler. My problem occurs after issuing tasks. It looks to me (in the task stream) that the workers do perform the computation. The error log from the dask worker is very long and I don't get it - it says…
pletnes
  • 439
  • 4
  • 15
3
votes
1 answer

Dask scatter with broadcast=True extremely slow

I have created a single (remote) scheduler and ten worker on different machines on the same network and try to distribute a dataframe from a client. My problem is that it takes 30min to do the scatter. from dask.distributed import Client df =…
Philipp -
  • 33
  • 4
3
votes
1 answer

Submit worker functions in dask distributed without waiting for the functions to end

I have this python code that uses the apscheduler library to submit processes, it works fine: from apscheduler.schedulers.background import BackgroundScheduler scheduler = BackgroundScheduler() array = [ 1, 3, 5, 7] for elem in array: …
ps0604
  • 1,227
  • 23
  • 133
  • 330
3
votes
0 answers

How do I configure Dask distributed logging levels with an environment variable?

It feels like I should be able to read between the lines of https://docs.dask.org/en/latest/how-to/debug.html and https://docs.dask.org/en/latest/configuration.html to craft an environment variable name and value, but none…
Duncan McGregor
  • 17,665
  • 12
  • 64
  • 118
3
votes
1 answer

limit number of CPUs used by dask compute

Below code uses appx 1 sec to execute on an 8-CPU system. How to manually configure number of CPUs used by dask.compute eg to 4 CPUs so the below code will use appx 2 sec to execute even on an 8-CPU system? import dask from time import sleep def…
Russell Burdt
  • 2,391
  • 2
  • 19
  • 30