Questions tagged [dask-distributed]

Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.

1090 questions
9
votes
2 answers

distributed.worker Memory use is high but worker has no data to store to disk

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.91 GB -- Worker memory limit: 2.00 GB distributed.worker - WARNING - Worker is at 41% memory…
AHassett
  • 91
  • 2
  • 3
7
votes
2 answers

How to properly use dask's upload_file() to pass local code to workers

I have functions in a local_code.py file that I would like to pass to workers through dask. I've seen answers to questions on here saying that this can be done using the upload_file() function, but I can't seem to get it working because I'm still…
7
votes
2 answers

Dask Equivalent of pd.to_numeric

I am trying to read multiple CSV files, each around 15 GB using dask read_csv. While performing this task, dask interprets a particular column as float, however it has some few values which are of string type and later on it fails when I try to…
Karrtik Iyer
  • 131
  • 1
  • 6
7
votes
2 answers

How to pass multiple arguments to dask.distributed.Client().map?

import dask.distributed def f(x, y): return x, y client = dask.distributed.Client() client.map(f, [(1, 2), (2, 3)]) Does not work. [,
mathtick
  • 6,487
  • 13
  • 56
  • 101
7
votes
1 answer

Get ID of Dask worker from within a task

Is there a worker ID, or some unique identifier that a dask worker can access programmatically from within a task?
MRocklin
  • 55,641
  • 23
  • 163
  • 235
7
votes
1 answer

Convert spark dataframe to dask dataframe

Is there a way to directly convert a Spark dataframe to a Dask dataframe.? I currently am using Spark's .toPandas() function to convert it into a pandas dataframe and then into a dask dataframe. I believe this is inefficient operation and is not…
vva
  • 133
  • 4
  • 11
7
votes
1 answer

Local use of dask: to Client() or not to Client()?

I am trying to understand the use patterns for Dask on a local machine. Specifically, I have a dataset that fits in memory I'd like to do some pandas operations groupby... date parsing etc. Pandas performs these operations via a single core and…
Jonathan
  • 1,287
  • 14
  • 17
7
votes
1 answer

How do I check if there is an already running dask scheduler?

I want to start a local cluster from python with a specific number of workers, and then connect a client to it. cluster = LocalCluster(n_workers=8, ip='127.0.0.1') client = Client(cluster) But before, I want to check if there is an existing local…
medRa
  • 73
  • 1
  • 4
7
votes
3 answers

Semaphores in dask.distributed?

I have a dask cluster with n workers and want the workers to do queries to the database. But the database is only capable of handling m queries in parallel where m < n. How can I model that in dask.distributed? Only m workers should work on such a…
7
votes
2 answers

what is the default directory where dask workers store results or files.?

[mapr@impetus-i0057 latest_code_deepak]$ dask-worker 172.26.32.37:8786 distributed.nanny - INFO - Start Nanny at: 'tcp://172.26.32.36:50930' distributed.diskutils - WARNING - Found stale lock file and directory…
TheCodeCache
  • 820
  • 1
  • 7
  • 27
6
votes
1 answer

Dask map method in fuction with multiple arguments

I want to apply the Client.map method to a function that uses multiple arguments as does the Pool.starmap method of multiprocessing. Here is an example from contextlib import contextmanager from dask.distributed import Client @contextmanager def…
Andrex
  • 602
  • 1
  • 7
  • 22
6
votes
2 answers

Reload Dask worker containers automatically on code change

I have the Dask code below that submits N workers, where each worker is implemented in a Docker container: default_sums = client.map(process_asset_defaults, build_worker_args(req, numWorkers)) future_total_sum = client.submit(sum,…
ps0604
  • 1,227
  • 23
  • 133
  • 330
6
votes
3 answers

Deploying a cluster of containers in Azure

I have a Docker application that works fine in my laptop on Windows using compose and starting multiple instances of a container as a Dask cluster. The name of the service is "worker" and I start two container instances like so: docker compose up…
ps0604
  • 1,227
  • 23
  • 133
  • 330
6
votes
2 answers

Dask distributed.scheduler - ERROR - Couldn't gather keys

import joblib from sklearn.externals.joblib import parallel_backend with joblib.parallel_backend('dask'): from dask_ml.model_selection import GridSearchCV import xgboost from xgboost import XGBRegressor grid_search =…
praveen pravii
  • 193
  • 2
  • 9
6
votes
2 answers

Dask Memory leakage issue with json and requests

This is just a sample minimal test to reproduce memory leakage issue in remote Dask kubernetes cluster. def load_geojson(pid): import requests import io r =…
1
2
3
72 73