Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
12
votes
2 answers

Drop column using Dask dataframe

This should work: raw_data.drop('some_great_column', axis=1).compute() But the column is not dropped. In pandas I use: raw_data.drop(['some_great_column'], axis=1, inplace=True) But inplace does not exist in Dask. Any ideas?
cs0815
  • 16,751
  • 45
  • 136
  • 299
12
votes
1 answer

What causes dask job failure with CancelledError exception

I have been seeing below error message for quite some time now but could not figure out what leads to the failure. Error: concurrent.futures._base.CancelledError: ('sort_index-f23b0553686b95f2d91d4a3fda85f229', 7) On restart of dask cluster it runs…
Santosh Kumar
  • 761
  • 5
  • 28
12
votes
1 answer

dask.multiprocessing or pandas + multiprocessing.pool: what's the difference?

I'm developing a model for financial purpose. I have the entire S&P500 components inside a folder, stored as many .hdf files. Each .hdf file has its own multi-index (year-week-minute). An example of the sequential code (non parallelized): import…
ilpomo
  • 657
  • 2
  • 5
  • 19
12
votes
1 answer

Unpacking result of delayed function

While converting my program using delayed, I stumbled upon a commonly used programming pattern that doesn't work with delayed. Example: from dask import delayed @delayed def myFunction(): return 1,2 a, b = myFunction() a.compute() Raises:…
Henk
  • 145
  • 1
  • 6
12
votes
1 answer

How do I run a dask.distributed cluster in a single thread?

How can I run a complete Dask.distributed cluster in a single thread? I want to use this for debugging or profiling. Note: this is a frequently asked question. I'm adding the question and answer here to Stack Overflow just for future reuse.
MRocklin
  • 55,641
  • 23
  • 163
  • 235
12
votes
3 answers

How to call unique() on dask DataFrame

How do I call unique on a dask DataFrame ? I get the following error if I try to call it the same way as for a regular pandas dataframe: In [27]: len(np.unique(ddf[['col1','col2']].values)) AttributeError Traceback (most…
femibyte
  • 3,317
  • 7
  • 34
  • 59
12
votes
3 answers

Create an if-else condition column in dask dataframe

I need to create a column which is based on some condition on dask dataframe. In pandas it is fairly straightforward: ddf['TEST_VAR'] = ['THIS' if x == 200607 else 'NOT THIS' if x == 200608 else 'THAT' if x == 200609…
Puneet Tripathi
  • 412
  • 3
  • 15
12
votes
1 answer

Dask DataFrame Groupby Partitions

I have some fairly large csv files (~10gb) and would like to take advantage of dask for analysis. However, depending on the number of partitions I set the dask object to read in with, my groupby results change. My understanding was that dask took…
Bhage
  • 121
  • 1
  • 5
11
votes
2 answers

Nested numpy arrays in dask and pandas dataframes

A common use case in machine/deep learning code that works on image and audio is to load and manipulate large datasets of images or audio segments. Almost always, the entries in these datasets are represented by an image/audio segment and metadata…
stav
  • 1,497
  • 2
  • 15
  • 40
11
votes
1 answer

Loading hdf5 files into python xarrays

The python module xarray greatly supports loading/mapping netCDF files, even lazily with dask. The data source I have to work with are thousands of hdf5 files, with lots of groups, datasets, attributes - all created with h5py. The Question is: How…
fmfreeze
  • 197
  • 1
  • 11
11
votes
2 answers

dask: specify number of processes

I am trying to use dask to do some embarassingly parallel processing. For some reaason I have to use dask but the task could be easily achieved using multiprocessing.Pool(5).map. For example: import dask from dask import compute, delayed def…
piokuc
  • 25,594
  • 11
  • 72
  • 102
11
votes
1 answer

Sorting in Dask

I want to find an alternative of pandas.dataframe.sort_value function in dask. I came through set_index, but it would sort on a single column. How can I sort multiple columns of Dask data frame?
Dhruv Kumar
  • 399
  • 2
  • 13
11
votes
1 answer

How do I find the length of a dataframe in dask?

How do I find the length of a dataframe using dask? For example in pandas, I can do: import pandas as pd import numpy as np df = pd.DataFrame(np.random.normal(0, 1, (5, 2)), columns=["A", "B"]) print df['A'].count() print df Output: 5 A …
C. L.
  • 143
  • 1
  • 1
  • 6
11
votes
2 answers

add a dask.array column to a dask.dataframe

I have a dask dataframe and a dask array with the same number of rows in the same logical order. The dataframe rows are indexed by strings. I am trying to add one of the array columns to the dataframe. I have tried several ways all of which failed…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
11
votes
2 answers

Ways to handle exceptions in Dask distributed

I'm having a lot of success using Dask and Distributed to develop data analysis pipelines. One thing that I'm still looking forward to improving, however, is the way I handle exceptions. Right now if, I write the following def my_function (value): …
ajmazurie
  • 509
  • 4
  • 8