Questions tagged [dask-distributed]

Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.

1090 questions
0
votes
1 answer

Design computation graph in dask

Until now, I've used dask with get and a dictionary to define the dependencies graph of my tasks. But it means that I have to define all my graph since the beginning, and now I want to add from time to time new tasks (with dependencies on old…
0
votes
1 answer

Can't find dependencies/Dependent not found error

I am trying to run this benchmark on a small dask cluster made of two nodes. The remote worker is simply deployed with the dask-worker command and it appears correctly in the output of client in the benchmark. I've also tried to run some simple…
Aratz
  • 430
  • 5
  • 16
0
votes
1 answer

Dask Memory Error Grouping DF From Parquet Data

I created a parquet dataset by reading data into a pandas df, using get_dummies() on the data, and writing it to a parquet file: df = pd.read_sql(query, engine) encoded = pd.get_dummies(df,…
OverflowingTheGlass
  • 2,324
  • 1
  • 27
  • 75
0
votes
1 answer

Dask- Same tasks are not running in parallel on cluster of Ubuntu machines

I have 3 ubuntu machine(CPU). my dask scheduler and client both are present on the same machine, whereas the two dask workers are running on other two machines. when I launch first task, it gets scheduled on first worker, but then upon launching…
TheCodeCache
  • 820
  • 1
  • 7
  • 27
0
votes
1 answer

Is there any way to know whether a dask-worker is running on CPU device or GPU device.?

Suppose a dask cluster has some CPU devices as well as some GPU devices. Each device runs a singe dask-worker. Now, the question is how do I find that the underlying device of a dask-worker is CPU or GPU. For example:- if the dask-worker is running…
TheCodeCache
  • 820
  • 1
  • 7
  • 27
0
votes
1 answer

Simplest way complex dask graph creation

There is a complex system of calculations over some objects. The difficulty is that some calculations are group calculations. This can demonstrate by the following example: from dask distributed import client def load_data_from_db(id): # load…
Vladimir
  • 145
  • 2
  • 9
0
votes
0 answers

dask jobs hangs indefinitely and inconsistently

I am running multiple concurrent dask jobs using dask-client submit api. It have come across this issue multiple times. Thread dump of the specific worker shows below information. Can some one guide me about this problem. ts_data =…
Santosh Kumar
  • 761
  • 5
  • 28
0
votes
1 answer

dask distributed.utils - ERROR - state is not a dictionary

I recently upgraded dask-0.15.3 to dask-0.16.0 and distribute-1.19.1 to distribute-1.20.2. After upgrade all dask jobs are failing with exception: _pickle.UnpicklingError: state is not a dictionary Please let me know if I am missing any…
Santosh Kumar
  • 761
  • 5
  • 28
0
votes
1 answer

how much time it would take for dask ec2 to setup instances?

I am new to dask.distributed. I am trying to setup a few cluster for distributed job. i am trying dask-ec2 to setup them . When i run the command with required Args ,It stucks at installing worker task. I killed it after 30 minutes.I am using port…
0
votes
1 answer

Dask DataFrame.map_partition() to write to db table

I have a dask dataframe that contains some data after some transformations. I want to write those data back to a mysql table. I have implemented a function that takes a dataframe a db url and writes the dataframe back to database. Because I need…
Apostolos
  • 7,763
  • 17
  • 80
  • 150
0
votes
1 answer

Python + Distributed - Is it possible using Dask to utilize a set of workers to apply a function to seperate files from a folder concurrently

I want to write a program that calculates the time it takes to read in a folder of .py files and calculate the cyclomatic complexity of each of the files. I have Radon installed to calculate the complexity, but I also want to be able to implement a…
0
votes
1 answer

I have collection of futures which are result of persist on dask dataframe. How to do a delayed operation on them?

I have setup a scheduler and 4 worker nodes to do some processing on csv. size of the csv is just 300 mb. df = dd.read_csv('/Downloads/tmpcrnin5ta',assume_missing=True) df = df.groupby(['col_1','col_2']).agg('mean').reset_index() df =…
0
votes
1 answer

Read dask dataframe from parallel txt files

I have two (or more) parallel text files stored in S3 - i.e. line 1 in first file corresponds to line 1 in second file etc. I want to read these files as columns into a single dask dataframe. What would be the best/easiest/fastest way to do it? PS.…
evilkonrex
  • 255
  • 2
  • 10
0
votes
0 answers

Dask dataframe error while reading from HDFS

Here is the code that I am using to connect to hdfs and create dask dataframe. Client(scheduler_host+":"+scheduler_port) df=dd.read_csv("hdfs://hdfs_host/") Error: AttributeError: /usr/lib/libhdfs3.so: undefined symbol:…
Santosh Kumar
  • 761
  • 5
  • 28
0
votes
1 answer

subselection of columns in dask (from pandas) by computed boolean indexer

I'm new do dask (imported as dd) and try to convert some pandas (imported as pd) code. The goal of the following lines, is to slice the data to those columns, which's values fullfill the calculated requirement in dask. There is a given table in…
Bastian Ebeling
  • 1,138
  • 11
  • 38
1 2 3
72
73