Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
6
votes
1 answer

How do I convert a Dask Dataframe into a Dask Array?

I have a dask dataframe object but would like to have a dask array. How do I accomplish this?
MRocklin
  • 55,641
  • 23
  • 163
  • 235
6
votes
2 answers

With xarray, how to parallelize 1D operations on a multidimensional Dataset?

I have a 4D xarray Dataset. I want to carry out a linear regression between two variables on a specific dimension (here time), and keep the regression parameters in a 3D array (the remaining dimensions). I managed to get the results I want by using…
LCT
  • 233
  • 1
  • 7
6
votes
0 answers

streamz exception on dask gather

i am trying to use streamz to manage an image processing pipeline. I am simulating a camera using the class Camera and have set up the pipeline below. import time from skimage.io import imread from skimage.color import rgb2gray class…
Michael Hansen
  • 237
  • 2
  • 8
6
votes
1 answer

Converting Dask Scalar to integer value (or save it to text file)

I have calculated using dask by from dask import dataframe all_data = dataframe.read_csv(path) total_sum = all_data.account_balance.sum() The csv file has a column named account_balance. The total_sum is a dd.Scalar object, which seems to be…
user9885031
6
votes
1 answer

Why does Dask array throw memory error when Numpy doesn't on dot product calculation?

I am working on comparing the calculation speed of Dask and Numpy for different data sizes. I understand that Dask can perform computations of data in parallel, and it splits up the data into chunks so that the data size can be larger than RAM. When…
dtretiak
  • 61
  • 4
6
votes
3 answers

Dask DataFrame - Prediction of Keras Model

I am working for the first time with dask and trying to run predict() from a trained keras model. If I dont use dask, the function works fine (i.e. pd.DataFrame() versus dd.DataFrame () ). With Dask the error is below. Is this not a common use case…
B_Miner
  • 1,840
  • 4
  • 31
  • 66
6
votes
1 answer

Dask prints warning to use client.scatter althought I'm using the suggested approach

In dask distributed I get the following warning, which I would not expect: /home/miniconda3/lib/python3.6/site-packages/distributed/worker.py:739: UserWarning: Large object of size 1.95 MB detected in task graph: …
dennis-w
  • 2,166
  • 1
  • 13
  • 23
6
votes
2 answers

dask dataframes -time series partitions

I have a timeseries pandas dataframe that I want to partition by month and year. My thought was to get a list of datetimes that would serve as the index but the break doesnt happen at the start 0:00 at the first of the…
user3757265
  • 427
  • 1
  • 4
  • 11
6
votes
3 answers

Appending new column to dask dataframe

This is a follow up question to Shuffling data in dask. I have an existing dask dataframe df where I wish to do the following: df['rand_index'] = np.random.permutation(len(df)) However, this gives the error, Column assignment doesn't support type…
sachinruk
  • 9,571
  • 12
  • 55
  • 86
6
votes
3 answers

How to read multiple parquet files (with same schema) from multiple directories with dask/fastparquet

I need to use dask to load multiple parquet files with identical schema into a single dataframe. This works when they are all in the same directory, but not when they're in separate directories. For example: import fastparquet pfile =…
Tim Morton
  • 240
  • 1
  • 3
  • 11
6
votes
1 answer

Constructing Mode and Corresponding Count Functions Using Custom Aggregation Functions for GroupBy in Dask

So dask has now been updated to support custom aggregation functions for groupby. (Thanks to the dev team and @chmp for working on this!). I am currently trying to construct a mode function and corresponding count function. Basically what I envision…
user48944
  • 311
  • 1
  • 14
6
votes
2 answers

Dask Distributed - how to run one task per worker, making that task running on all cores available into the worker?

I'm very new at using distributed python library. I have 4 workers and i have successfully launched some parallel runs using 14 cores (among the 16 available) for each worker, resulting in 4*14=56 tasks running in parallel. But how to proceed if I…
Youcef
  • 223
  • 2
  • 7
6
votes
1 answer

Do xarray or dask really support memory-mapping?

In my experimentation so far, I've tried: xr.open_dataset with chunks arg, and it loads the data into memory. Set up a NetCDF4DataStore, and call ds['field'].values and it loads the data into memory. Set up a ScipyDataStore with mmap='r', and…
6
votes
2 answers

How to rename the index of a Dask Dataframe

How would I go about renaming the index on a dask dataframe? I tried it like so df.index.name = 'foo' but rechecking df.index.name shows it still being whatever it was previously.
Samantha Hughes
  • 593
  • 1
  • 6
  • 13
6
votes
1 answer

Subset dask dataframe by column position

Once I have a dask dataframe, how can I selectively pull columns into an in-memory pandas DataFrame? Say I have an N x M dataframe. How can I create an N x m dataframe where m << M and is arbitrary. from sklearn.datasets import load_iris import…
Zelazny7
  • 39,946
  • 18
  • 70
  • 84