Questions tagged [dask-distributed]

Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.

1090 questions
0
votes
0 answers

Dask distributed with numba giving error

I am trying to implement numba with dask using a simple groupby operation on a dataset.It is working fine on a single system but as I move ahead to apply it on a distributed one ,it is giving error which I am unable to get through.Please help.Thanks…
Sweta
  • 63
  • 3
  • 13
0
votes
1 answer

Error: No module name 'Custom Class' while passing a Client object in the custom class's constructor in dask

I have been trying to write custom classes for Preprocessing followed by Feature selection and Machine Learning algorithms as well. I cracked this (preprocessing only) using @delayed. But when I read from the tutorials that the same can be achieved…
Asif Ali
  • 1,422
  • 2
  • 12
  • 28
0
votes
1 answer

Dask client runs out of memory loading from S3

I have a s3 bucket with a lot of small files, over 100K that add up to about 700GB. When loading the objects from a data bag and then persist the client always runs out of memory, consuming gigs very quickly. Limiting the scope to a few hundred…
Kevin McGrath
  • 146
  • 1
  • 5
0
votes
1 answer

dask-jobqueue does not start any worker on slurm cluster

I am trying to run dask on a research cluster managed by slurm. Launching a job with a classical sbatch script is working. But when I am doing: from dask_jobqueue import SLURMCluster cluster = SLURMCluster(cores=12, memory='24 GB', processes=1,…
LCT
  • 233
  • 1
  • 7
0
votes
1 answer

How to implement `iloc` function for dask dataframe?

I have a huge file, around 35GB stored in form of hdf5. I have to do certain calculations on some specific columns and want to insert those calculations as new columns. I know I can assign new columns directly as df['new_column'] = 0(or some other…
Urvish
  • 643
  • 3
  • 10
  • 19
0
votes
1 answer

dask distributed: adding up a collection of vectors residing on different workers

I have a large set of vectors that were computed on different data, thus they reside on different workers. Is the following code the most efficient? grads = [client.submit(compute_grad, x) for x in xs] # list of futures gradsum_future =…
John
  • 935
  • 6
  • 17
0
votes
1 answer

difference between client and executor in dask

Executor is the primary entry point for users of distributed.Similarly, Client is the primary entry point for users of dask.distributed. So, both seem like identical. In dask, can both be used interchangeably ? If yes,what is the use case to use…
Sweta
  • 63
  • 3
  • 13
0
votes
1 answer

compute() in dask not working

I am trying a simple parallel computation in Dask. This is my code. import time import dask as dask import dask.distributed as distributed import dask.dataframe as dd import dask.delayed as delayed from dask.distributed import…
Sweta
  • 63
  • 3
  • 13
0
votes
1 answer

Parallelization on cluster dask

I'm looking for the best way to parallelize on a cluster the following problem. I have several files folder/file001.csv folder/file002.csv : folder/file100.csv They are disjoints with respect to the key I want to use to groupby, that is if a set…
rpanai
  • 12,515
  • 2
  • 42
  • 64
0
votes
1 answer

processes =false in local distribution in dask

I read the documentation of DASK . It is written there in local distributed form that client = Client(processes=False) I would like to know why is the processes mentioned as false ?
Sweta
  • 63
  • 3
  • 13
0
votes
1 answer

How is dask implemented on multiple systems?

I am new to Dask library.I wanted to know if we implement parallel computation using dask on two systems ,then is the data frame on which we apply the computation stored on both the systems ? How actually does the parallel computation takes place,it…
Sweta
  • 63
  • 3
  • 13
0
votes
0 answers

Custom search in Dask

I have 1000 regex patterns which I have to search in each of the 9000 strings. Normal brute force method using pandas list took 25 min for the same task. I have used delayed function of dask to parallelize the entire function. It took 9 min to…
ANKIT JHA
  • 359
  • 1
  • 3
  • 9
0
votes
2 answers

Confusion regarding cluster scheduler and single machine distributed scheduler

In below code, why dd.read_csv is running on cluster? client.read_csv should run on cluster. import dask.dataframe as dd from dask.distributed import Client client=Client('10.31.32.34:8786') dd.read_csv('file.csv',blocksize=10e7) dd.compute() Is…
Dhruv Kumar
  • 399
  • 2
  • 13
0
votes
1 answer

Another UI for Dask except bokeh

Isn't there another Dask UI except for bokeh? I have a problem with bokeh, as it is not showing the graph and UI when running in an ec2 instance.
Dhruv Kumar
  • 399
  • 2
  • 13
0
votes
0 answers

AttributeError: 'S3File' object has not attribute 'getvalue', while running to_csv

I'm running to_csv command as follows to an ouput file on a s3 bucket with ServerSideEncryption enabled: to_csv("s3://mys3bucket/result.csv", storage_option={'s3_additional_kwargs': {'ServerSideEncryption': 'AES256'}}) I'm getting…
Dhruv Kumar
  • 399
  • 2
  • 13