Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
7
votes
1 answer

Retries in dask.compute() are unclear

From the documentation, Number of allowed automatic retries if computing a result fails. Does "result" refer to each individual task or the entire compute() call? If it refers to the entire call, how to implement retries for each task in…
Michał Zawadzki
  • 695
  • 6
  • 14
7
votes
2 answers

How to properly use dask's upload_file() to pass local code to workers

I have functions in a local_code.py file that I would like to pass to workers through dask. I've seen answers to questions on here saying that this can be done using the upload_file() function, but I can't seem to get it working because I'm still…
7
votes
2 answers

Dask Equivalent of pd.to_numeric

I am trying to read multiple CSV files, each around 15 GB using dask read_csv. While performing this task, dask interprets a particular column as float, however it has some few values which are of string type and later on it fails when I try to…
Karrtik Iyer
  • 131
  • 1
  • 6
7
votes
2 answers

Parallelized bootstrapping with replacement with xarray/dask

I want to perform N=1000 bootstrapping with replacement on gridded data. One computation takes about 0.5s. I have access to a supercomputer exclusive node with 48 cores. Because the resampling are independent of each other, I naively hope to…
7
votes
3 answers

Faster alternatives to Pandas pivot_table

I'm using Pandas pivot_table function on a large dataset (10 million rows, 6 columns). As execution time is paramount, I try to speed up the process. Currently it takes around 8 secs to process the whole dataset which is way to slow and I hope to…
pythoneer
  • 403
  • 2
  • 4
  • 15
7
votes
2 answers

How to pass multiple arguments to dask.distributed.Client().map?

import dask.distributed def f(x, y): return x, y client = dask.distributed.Client() client.map(f, [(1, 2), (2, 3)]) Does not work. [,
mathtick
  • 6,487
  • 13
  • 56
  • 101
7
votes
1 answer

Get ID of Dask worker from within a task

Is there a worker ID, or some unique identifier that a dask worker can access programmatically from within a task?
MRocklin
  • 55,641
  • 23
  • 163
  • 235
7
votes
4 answers

Renaming columns in dask dataframe

I have two questions about dask. First: The documentation for dask clearly states that you can rename columns with the same syntax as pandas. I am using dask 1.0.0. Any reason why I am getting these errors below? df =…
Matt Elgazar
  • 707
  • 1
  • 8
  • 21
7
votes
2 answers

Where is the pydata BLAZE project heading?

I find the blaze ecosystem* amazing because it covers most of the data engineering use cases. There was definitely a lot of interest on these projects during the period 2015-2016, but of late it has been ignored. I say this looking at the commits on…
human
  • 2,250
  • 20
  • 24
7
votes
1 answer

Convert spark dataframe to dask dataframe

Is there a way to directly convert a Spark dataframe to a Dask dataframe.? I currently am using Spark's .toPandas() function to convert it into a pandas dataframe and then into a dask dataframe. I believe this is inefficient operation and is not…
vva
  • 133
  • 4
  • 11
7
votes
1 answer

Dask For Loop In Parallel

I am trying to find the correct syntax for using a for loop with dask delayed. I have found several tutorials and other questions but none fit my condition, which is extremely basic. First, is this the correct way to run a for-loop in…
B_Miner
  • 1,840
  • 4
  • 31
  • 66
7
votes
1 answer

Local use of dask: to Client() or not to Client()?

I am trying to understand the use patterns for Dask on a local machine. Specifically, I have a dataset that fits in memory I'd like to do some pandas operations groupby... date parsing etc. Pandas performs these operations via a single core and…
Jonathan
  • 1,287
  • 14
  • 17
7
votes
2 answers

dask dataframe head() returns empty df

I have a dask dataframe with an index on one of the columns. The issue is if I do a df.head() it always treturns an empty df, whereas df.tail always returns the correct df. I checked df.head always checks for the first n entries in the first…
pranav kohli
  • 123
  • 2
  • 6
7
votes
1 answer

How do I check if there is an already running dask scheduler?

I want to start a local cluster from python with a specific number of workers, and then connect a client to it. cluster = LocalCluster(n_workers=8, ip='127.0.0.1') client = Client(cluster) But before, I want to check if there is an existing local…
medRa
  • 73
  • 1
  • 4
7
votes
2 answers

Dask dataframes: reading multiple files & storing filename in column

I regularly use dask.dataframe to read multiple files, as so: import dask.dataframe as dd df = dd.read_csv('*.csv') However, the origin of each row, i.e. which file the data was read from, seems to be forever lost. Is there a way to add this as a…
jpp
  • 159,742
  • 34
  • 281
  • 339