Questions tagged [dask-distributed]

Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.

1090 questions
4
votes
1 answer

How to check if dask dataframe is empty if lazily evaluated?

I am aware of this question. But check the code(minimal-working example) below: import dask.dataframe as dd import pandas as pd # intialise data of lists. data = {'Name': ['Tom', 'nick', 'krish', 'jack'], 'Age': [20, 21, 19, 18]} # Create…
MehmedB
  • 1,059
  • 1
  • 16
  • 42
4
votes
0 answers

dask 100GB dataframe sorting / set_index on new column out of memory issues

I have a dask dataframe of around 100GB and 4 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster. I converted the dataframe to 150 partiitions (700MB each). However My simple set_index()…
user670186
  • 2,588
  • 6
  • 37
  • 55
4
votes
2 answers

How to use Dask on Databricks

I want to use Dask on Databricks. It should be possible (I cannot see why not). If I import it, one of two things happens, either I get an ImportError but when I install distributed to solve this DataBricks just says Cancelled without throwing any…
SARose
  • 3,558
  • 5
  • 39
  • 49
4
votes
0 answers

Pycharm debugger throws Bad file descriptor error when using dask distributed

I am using the most lightweight/simple dask multiprocessing which is the non-cluster local Client: from distributed import Client client = Client() Even so: the first instance of invoking dask.bag.compute() results in the following: Connected to…
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
4
votes
1 answer

Initializing state on dask-distributed workers

I am trying to do something like resource = MyResource() def fn(x): something = dosemthing(x, resource) return something client = Client() results = client.map(fn, data) The issue is that resource is not serializable and is expensive to…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
4
votes
1 answer

dask read_csv timeout on Amazon s3 with big files

dask read_csv timeout on s3 for big files s3fs.S3FileSystem.read_timeout = 5184000 # one day s3fs.S3FileSystem.connect_timeout = 5184000 # one day client = Client('a_remote_scheduler_ip_here:8786') df =…
4
votes
1 answer

How do I use dask to efficiently calculate many simple statistics

Problem I want to calculate a bunch of "easy to gather" statistics using Dask. Speed is my primary concern and objective, and so I am looking to throw a wide cluster at the problem. Ideally I would like to finish the described problem in less than…
bluecoconut
  • 63
  • 1
  • 5
4
votes
0 answers

Dask Distributed client takes to long to initialize in jupyter lab

Trying to initialize a client with local cluster in Jupyter lab but hangs. This behaviour happens for python 3.5 and jupyter lab 0.35. import dask.dataframe as dd from dask import delayed from distributed import Client from distributed import…
Apostolos
  • 7,763
  • 17
  • 80
  • 150
4
votes
1 answer

Tornado unexpected exception in Future after timeout

I have set up a dask cluster. I can access a web dashboard, but when I'm trying to connect to the scheduler: from dask.distributed import Client client = Client('192.168.0.10:8786') I get the following error: tornado.application - ERROR - Exception…
4
votes
1 answer

How to assign tasks to specific worker within Dask.Distributed

I am interesting in using Dask Distributed as task executor. In Celery it is possible to assign task to specific worker. How is it possible using Dask Distributed?
Sklavit
  • 2,225
  • 23
  • 29
4
votes
1 answer

dask how to define a custom (time fold) function that operates in parallel and returns a dataframe with a different shape

I am trying to implement a time fold function to be 'map'ed to various partitions of a dask dataframe which in turn changes the shape of the dataframe in question (or alternatively produces a new dataframe with the altered shape). This is how far I…
PhaKuDi
  • 141
  • 8
4
votes
1 answer

Running shell commands in parallel using dask distributed

I have a folder with a lot of .sh scripts. How can I use an already set up dask distributed cluster to run them in parallel? Currently, I am doing the following: import dask, distributed, os # list with shell commands that I want to run commands =…
Arco Bast
  • 3,595
  • 2
  • 26
  • 53
4
votes
0 answers

Optimal approach to create dask dataframe from parquet files(HDFS) in different directories

I am trying to create dask dataframe from large number of parquet files stored different HDFS directories. I have tried two approaches but both of them seems to take very long time. Approach 1: call api read_parquet with glob path.…
Santosh Kumar
  • 761
  • 5
  • 28
4
votes
0 answers

Dask restart worker(s) using client

Is there a way using dask client to restart a worker or worker list provided. Needed a way to bounce a worker after a task is executed to reset the state of the process which may have been changed by the execution. Client.restart() restarts entire…
Ameet Shah
  • 61
  • 1
  • 4
4
votes
1 answer

Redistribute dask tasks among the cluster

I am abusing dask as a task scheduler for long running tasks with map(, pure=False). So I am not interested in the dask graph, I just use dark as a way to distribute unix commands. Lets say if have 1000 tasks and they run for a week on a cluster of…
MaxBenChrist
  • 547
  • 3
  • 9