Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

Retries in dask.compute() are unclear

From the documentation, Number of allowed automatic retries if computing a result fails. Does "result" refer to each individual task or the entire compute() call? If it refers to the entire call, how to implement retries for each task in…

dask dask-delayed

asked Sep 25 '19 at 12:38

Michał Zawadzki

votes

2 answers

How to properly use dask's upload_file() to pass local code to workers

I have functions in a local_code.py file that I would like to pass to workers through dask. I've seen answers to questions on here saying that this can be done using the upload_file() function, but I can't seem to get it working because I'm still…

python dask dask-distributed

asked Jul 19 '19 at 18:54

Ryan Gallagher

votes

2 answers

Dask Equivalent of pd.to_numeric

I am trying to read multiple CSV files, each around 15 GB using dask read_csv. While performing this task, dask interprets a particular column as float, however it has some few values which are of string type and later on it fails when I try to…

dask dask-distributed

asked Jun 26 '19 at 10:58

Karrtik Iyer

votes

2 answers

Parallelized bootstrapping with replacement with xarray/dask

I want to perform N=1000 bootstrapping with replacement on gridded data. One computation takes about 0.5s. I have access to a supercomputer exclusive node with 48 cores. Because the resampling are independent of each other, I naively hope to…

multiprocessing dask resampling python-xarray

asked May 23 '19 at 18:36

aaron.spring

votes

3 answers

Faster alternatives to Pandas pivot_table

I'm using Pandas pivot_table function on a large dataset (10 million rows, 6 columns). As execution time is paramount, I try to speed up the process. Currently it takes around 8 secs to process the whole dataset which is way to slow and I hope to…

python pandas performance numpy dask

asked Mar 28 '19 at 18:31

pythoneer

votes

2 answers

How to pass multiple arguments to dask.distributed.Client().map?

import dask.distributed def f(x, y): return x, y client = dask.distributed.Client() client.map(f, [(1, 2), (2, 3)]) Does not work. [,

dask dask-distributed

asked Mar 02 '19 at 23:03

mathtick

6,487
13
56
101

votes

1 answer

Get ID of Dask worker from within a task

Is there a worker ID, or some unique identifier that a dask worker can access programmatically from within a task?

dask dask-distributed

asked Jan 03 '19 at 18:47

MRocklin

55,641
23
163
235

votes

4 answers

Renaming columns in dask dataframe

I have two questions about dask. First: The documentation for dask clearly states that you can rename columns with the same syntax as pandas. I am using dask 1.0.0. Any reason why I am getting these errors below? df =…

python pandas dask

asked Dec 17 '18 at 07:52

Matt Elgazar

votes

2 answers

Where is the pydata BLAZE project heading?

I find the blaze ecosystem* amazing because it covers most of the data engineering use cases. There was definitely a lot of interest on these projects during the period 2015-2016, but of late it has been ignored. I say this looking at the commits on…

dask blaze odo datashape

asked Dec 06 '18 at 03:12

human

2,250
20
24

votes

1 answer

Convert spark dataframe to dask dataframe

Is there a way to directly convert a Spark dataframe to a Dask dataframe.? I currently am using Spark's .toPandas() function to convert it into a pandas dataframe and then into a dask dataframe. I believe this is inefficient operation and is not…

pandas pyspark dask dask-distributed

asked Jul 18 '18 at 20:11

vva

votes

1 answer

Dask For Loop In Parallel

I am trying to find the correct syntax for using a for loop with dask delayed. I have found several tutorials and other questions but none fit my condition, which is extremely basic. First, is this the correct way to run a for-loop in…

dask dask-delayed

asked Jun 29 '18 at 23:55

B_Miner

1,840
4
31
66

votes

1 answer

Local use of dask: to Client() or not to Client()?

I am trying to understand the use patterns for Dask on a local machine. Specifically, I have a dataset that fits in memory I'd like to do some pandas operations groupby... date parsing etc. Pandas performs these operations via a single core and…

python data-science dask dask-distributed

asked May 30 '18 at 23:44

Jonathan

1,287
14
17

votes

2 answers

dask dataframe head() returns empty df

I have a dask dataframe with an index on one of the columns. The issue is if I do a df.head() it always treturns an empty df, whereas df.tail always returns the correct df. I checked df.head always checks for the first n entries in the first…

python dask

asked May 25 '18 at 07:46

pranav kohli

votes

1 answer

How do I check if there is an already running dask scheduler?

I want to start a local cluster from python with a specific number of workers, and then connect a client to it. cluster = LocalCluster(n_workers=8, ip='127.0.0.1') client = Client(cluster) But before, I want to check if there is an existing local…

dask dask-distributed

asked Apr 13 '18 at 15:34

medRa

votes

2 answers

Dask dataframes: reading multiple files & storing filename in column

I regularly use dask.dataframe to read multiple files, as so: import dask.dataframe as dd df = dd.read_csv('*.csv') However, the origin of each row, i.e. which file the data was read from, seems to be forever lost. Is there a way to add this as a…

python pandas dataframe dask

asked Feb 14 '18 at 20:01

jpp

159,742
34
281
339

Prev 1 2 3

…

99 100 Next