Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

2 answers

Drop column using Dask dataframe

This should work: raw_data.drop('some_great_column', axis=1).compute() But the column is not dropped. In pandas I use: raw_data.drop(['some_great_column'], axis=1, inplace=True) But inplace does not exist in Dask. Any ideas?

asked Aug 09 '18 at 14:29

cs0815

16,751
45
136
299

votes

1 answer

What causes dask job failure with CancelledError exception

I have been seeing below error message for quite some time now but could not figure out what leads to the failure. Error: concurrent.futures._base.CancelledError: ('sort_index-f23b0553686b95f2d91d4a3fda85f229', 7) On restart of dask cluster it runs…

dask dask-distributed

asked Oct 19 '17 at 19:25

Santosh Kumar

votes

1 answer

dask.multiprocessing or pandas + multiprocessing.pool: what's the difference?

I'm developing a model for financial purpose. I have the entire S&P500 components inside a folder, stored as many .hdf files. Each .hdf file has its own multi-index (year-week-minute). An example of the sequential code (non parallelized): import…

python multithreading pandas multiprocessing dask

asked Oct 15 '17 at 11:27

ilpomo

votes

1 answer

Unpacking result of delayed function

While converting my program using delayed, I stumbled upon a commonly used programming pattern that doesn't work with delayed. Example: from dask import delayed @delayed def myFunction(): return 1,2 a, b = myFunction() a.compute() Raises:…

python dask dask-delayed

asked Jun 17 '17 at 09:10

Henk

votes

1 answer

How do I run a dask.distributed cluster in a single thread?

How can I run a complete Dask.distributed cluster in a single thread? I want to use this for debugging or profiling. Note: this is a frequently asked question. I'm adding the question and answer here to Stack Overflow just for future reuse.

python dask

asked May 26 '17 at 05:04

MRocklin

55,641
23
163
235

votes

3 answers

How to call unique() on dask DataFrame

How do I call unique on a dask DataFrame ? I get the following error if I try to call it the same way as for a regular pandas dataframe: In [27]: len(np.unique(ddf[['col1','col2']].values)) AttributeError Traceback (most…

pandas dask

asked Nov 28 '16 at 15:54

femibyte

3,317
7
34
59

votes

3 answers

Create an if-else condition column in dask dataframe

I need to create a column which is based on some condition on dask dataframe. In pandas it is fairly straightforward: ddf['TEST_VAR'] = ['THIS' if x == 200607 else 'NOT THIS' if x == 200608 else 'THAT' if x == 200609…

python pandas dask

asked Jul 27 '16 at 09:03

Puneet Tripathi

votes

1 answer

Dask DataFrame Groupby Partitions

I have some fairly large csv files (~10gb) and would like to take advantage of dask for analysis. However, depending on the number of partitions I set the dask object to read in with, my groupby results change. My understanding was that dask took…

python pandas dask

asked Feb 06 '16 at 00:06

Bhage

votes

2 answers

Nested numpy arrays in dask and pandas dataframes

A common use case in machine/deep learning code that works on image and audio is to load and manipulate large datasets of images or audio segments. Almost always, the entries in these datasets are represented by an image/audio segment and metadata…

python pandas numpy dask

asked Mar 23 '19 at 15:36

stav

1,497
2
15
40

votes

1 answer

Loading hdf5 files into python xarrays

The python module xarray greatly supports loading/mapping netCDF files, even lazily with dask. The data source I have to work with are thousands of hdf5 files, with lots of groups, datasets, attributes - all created with h5py. The Question is: How…

python hdf5 dask h5py python-xarray

asked Feb 11 '19 at 11:15

fmfreeze

votes

2 answers

dask: specify number of processes

I am trying to use dask to do some embarassingly parallel processing. For some reaason I have to use dask but the task could be easily achieved using multiprocessing.Pool(5).map. For example: import dask from dask import compute, delayed def…

python dask

asked Jul 11 '18 at 12:24

piokuc

25,594
11
72
102

votes

1 answer

Sorting in Dask

I want to find an alternative of pandas.dataframe.sort_value function in dask. I came through set_index, but it would sort on a single column. How can I sort multiple columns of Dask data frame?

sorting dask dask-distributed dask-delayed

asked Jun 12 '18 at 04:54

Dhruv Kumar

votes

1 answer

How do I find the length of a dataframe in dask?

How do I find the length of a dataframe using dask? For example in pandas, I can do: import pandas as pd import numpy as np df = pd.DataFrame(np.random.normal(0, 1, (5, 2)), columns=["A", "B"]) print df['A'].count() print df Output: 5 A …

python pandas dask

asked May 28 '18 at 15:02

C. L.

votes

2 answers

add a dask.array column to a dask.dataframe

I have a dask dataframe and a dask array with the same number of rows in the same logical order. The dataframe rows are indexed by strings. I am trying to add one of the array columns to the dataframe. I have tried several ways all of which failed…

python dataframe dask

asked Jan 08 '18 at 21:24

Daniel Mahler

7,653
5
51
90

votes

2 answers

Ways to handle exceptions in Dask distributed

I'm having a lot of success using Dask and Distributed to develop data analysis pipelines. One thing that I'm still looking forward to improving, however, is the way I handle exceptions. Right now if, I write the following def my_function (value): …

python dask

asked Feb 28 '17 at 22:31

ajmazurie

Prev 1 2 3

…

99 100 Next