Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

How do I convert a Dask Dataframe into a Dask Array?

I have a dask dataframe object but would like to have a dask array. How do I accomplish this?

asked Aug 31 '18 at 16:15

MRocklin

55,641
23
163
235

votes

2 answers

With xarray, how to parallelize 1D operations on a multidimensional Dataset?

I have a 4D xarray Dataset. I want to carry out a linear regression between two variables on a specific dimension (here time), and keep the regression parameters in a 3D array (the remaining dimensions). I managed to get the results I want by using…

python dask python-xarray

asked Aug 30 '18 at 10:03

LCT

votes

0 answers

streamz exception on dask gather

i am trying to use streamz to manage an image processing pipeline. I am simulating a camera using the class Camera and have set up the pipeline below. import time from skimage.io import imread from skimage.color import rgb2gray class…

python-3.x parallel-processing streaming dask

asked Aug 06 '18 at 19:29

Michael Hansen

votes

1 answer

Converting Dask Scalar to integer value (or save it to text file)

I have calculated using dask by from dask import dataframe all_data = dataframe.read_csv(path) total_sum = all_data.account_balance.sum() The csv file has a column named account_balance. The total_sum is a dd.Scalar object, which seems to be…

python pandas csv dask

asked Jul 23 '18 at 16:27

user9885031

votes

1 answer

Why does Dask array throw memory error when Numpy doesn't on dot product calculation?

I am working on comparing the calculation speed of Dask and Numpy for different data sizes. I understand that Dask can perform computations of data in parallel, and it splits up the data into chunks so that the data size can be larger than RAM. When…

python numpy dask dot-product

asked Jul 13 '18 at 20:21

dtretiak

votes

3 answers

Dask DataFrame - Prediction of Keras Model

I am working for the first time with dask and trying to run predict() from a trained keras model. If I dont use dask, the function works fine (i.e. pd.DataFrame() versus dd.DataFrame () ). With Dask the error is below. Is this not a common use case…

tensorflow keras dask

asked Mar 14 '18 at 21:30

B_Miner

1,840
4
31
66

votes

1 answer

Dask prints warning to use client.scatter althought I'm using the suggested approach

In dask distributed I get the following warning, which I would not expect: /home/miniconda3/lib/python3.6/site-packages/distributed/worker.py:739: UserWarning: Large object of size 1.95 MB detected in task graph: …

python python-3.x dask dask-distributed

asked Feb 22 '18 at 14:37

dennis-w

2,166
1
13
23

votes

2 answers

dask dataframes -time series partitions

I have a timeseries pandas dataframe that I want to partition by month and year. My thought was to get a list of datetimes that would serve as the index but the break doesnt happen at the start 0:00 at the first of the…

pandas dask

asked Jan 26 '18 at 18:02

user3757265

votes

3 answers

Appending new column to dask dataframe

This is a follow up question to Shuffling data in dask. I have an existing dask dataframe df where I wish to do the following: df['rand_index'] = np.random.permutation(len(df)) However, this gives the error, Column assignment doesn't support type…

python dask

asked Oct 25 '17 at 03:04

sachinruk

9,571
12
55
86

votes

3 answers

How to read multiple parquet files (with same schema) from multiple directories with dask/fastparquet

I need to use dask to load multiple parquet files with identical schema into a single dataframe. This works when they are all in the same directory, but not when they're in separate directories. For example: import fastparquet pfile =…

dask fastparquet

asked Sep 22 '17 at 18:34

Tim Morton

votes

1 answer

Constructing Mode and Corresponding Count Functions Using Custom Aggregation Functions for GroupBy in Dask

So dask has now been updated to support custom aggregation functions for groupby. (Thanks to the dev team and @chmp for working on this!). I am currently trying to construct a mode function and corresponding count function. Basically what I envision…

python group-by aggregate dask

asked Sep 06 '17 at 16:25

user48944

votes

2 answers

Dask Distributed - how to run one task per worker, making that task running on all cores available into the worker?

I'm very new at using distributed python library. I have 4 workers and i have successfully launched some parallel runs using 14 cores (among the 16 available) for each worker, resulting in 4*14=56 tasks running in parallel. But how to proceed if I…

python distributed core worker dask

asked Jul 12 '17 at 08:36

Youcef

votes

1 answer

Do xarray or dask really support memory-mapping?

In my experimentation so far, I've tried: xr.open_dataset with chunks arg, and it loads the data into memory. Set up a NetCDF4DataStore, and call ds['field'].values and it loads the data into memory. Set up a ScipyDataStore with mmap='r', and…

numpy dask numpy-memmap python-xarray

asked Jun 24 '17 at 05:23

chrisbarber

votes

2 answers

How to rename the index of a Dask Dataframe

How would I go about renaming the index on a dask dataframe? I tried it like so df.index.name = 'foo' but rechecking df.index.name shows it still being whatever it was previously.

python pandas dask

asked Jun 02 '17 at 21:51

Samantha Hughes

votes

1 answer

Subset dask dataframe by column position

Once I have a dask dataframe, how can I selectively pull columns into an in-memory pandas DataFrame? Say I have an N x M dataframe. How can I create an N x m dataframe where m << M and is arbitrary. from sklearn.datasets import load_iris import…

python pandas dask

asked May 24 '17 at 19:31

Zelazny7

39,946
18
70
84

Prev 1 2 3

…

99 100 Next