Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
10
votes
3 answers

Dask Array from DataFrame

Is there a way to easily convert a DataFrame of numeric values into an Array? Similar to values with a pandas DataFrame. I can't seem to find any way to do this with the provided API, but I'd assume it's a common operation.
Paul English
  • 937
  • 11
  • 18
10
votes
2 answers

Applying a function along an axis of a dask array

I'm analyzing ocean temperature data from a climate model simulation where the 4D data arrays (time, depth, latitude, longitude; denoted dask_array below) typically have a shape of (6000, 31, 189, 192) and a size of ~25GB (hence my desire to use…
Damien Irving
  • 337
  • 4
  • 10
9
votes
1 answer

Why is performance so much better with zarr than parquet when using dask?

When I run essentially the same calculations with dask against zarr data and parquet data, the zarr-based calculations are significantly faster. Why? Is it maybe because I did something wrong when I created the parquet files? I've replicated the…
Christine
  • 135
  • 1
  • 7
9
votes
1 answer

Dask fails with freeze_support bug

I try to run a very simple Dask program like the following: # myfile.py from dask.distributed import Client client = Client() But when I run this program, I get this odd error An attempt has been made to start a new process before the …
MRocklin
  • 55,641
  • 23
  • 163
  • 235
9
votes
2 answers

distributed.worker Memory use is high but worker has no data to store to disk

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.91 GB -- Worker memory limit: 2.00 GB distributed.worker - WARNING - Worker is at 41% memory…
AHassett
  • 91
  • 2
  • 3
9
votes
1 answer

How to parallelize groupby() in dask?

I tried: df.groupby('name').agg('count').compute(num_workers=1) df.groupby('name').agg('count').compute(num_workers=4) They take the same time, why num_workers does not work? Thanks
Robin1988
  • 1,504
  • 4
  • 20
  • 25
9
votes
3 answers

Dask: Drop NAs on columns?

I have tried to apply a filter to remove columns with too many NAs to my dask dataframe: df.dropna(axis=1, how='all', thresh=round(len(df) * .8)) Unfortunately it seems that the dask dropna API is slightly different from that of pandas and does not…
Robert T. Tusk
  • 141
  • 1
  • 6
9
votes
1 answer

Repartition Dask DataFrame to get even partitions

I have a Dask DataFrames that contains index which is not unique (client_id). Repartitioning and resetting index ends up with very uneven partitions - some contains only a few rows, some thousands. For instance the following code: for p in…
Szymon
  • 139
  • 1
  • 5
9
votes
1 answer

xarray/dask - limiting the number of threads/cpus

I'm fairly new to xarray and I'm currently trying to leverage it to subset some NetCDFs. I'm running this on a shared server and would like to know how best to limit the processing power used by xarray so that it plays nicely with others. I've read…
9
votes
1 answer

Is it possible to shutdown a dask.distributed cluster given a Client instance?

if I have a distributed.Client instance can I use that to shutdown the remote cluster? i.e. to kill all workers and also shutdown the scheduler? If that can't be done using the Client instance is there another way other than manually killing each…
Dave Hirschfeld
  • 768
  • 2
  • 6
  • 15
9
votes
1 answer

Summarize categorical data in Dask DataFrame

By default describe method of Dask DataFrame summarizes only numerical columns. According to the docs I should be able to get descriptions of categorical columns by providing include parameter.…
grześ
  • 467
  • 3
  • 21
9
votes
1 answer

Shutdown dask workers from client or scheduler

In the API, there is a way to restart all workers and to shutdown the client completely, but I see no way to stop all workers while keeping the client unchanged. Is there a way to do this that I cannot find or is it a feature that doesn't exist ?
ffarquet
  • 1,263
  • 12
  • 27
9
votes
1 answer

dask apply: AttributeError: 'DataFrame' object has no attribute 'name'

I have a dataframe of params and apply a function to each row. this function is essentially a couple of sql_queries and simple calculations on the result. I am trying to leverage Dask's multiprocessing while keeping structure and ~ interface. The…
Philipp_Kats
  • 3,872
  • 3
  • 27
  • 44
9
votes
1 answer

Is it possible to wait until `.persist()` finishes caching in dask?

Since .persist() caches data in the background, I'm wondering whether it is possible to wait until it finishes caching then do the following things. In addition, there is a way to have a progress bar for the caching process? Thank you very much
user3716774
  • 431
  • 3
  • 11
9
votes
1 answer

Killed/MemoryError when creating a large dask.dataframe from delayed collection

I am trying to create a dask.dataframe from a bunch of large CSV files (currently 12 files, 8-10 million lines and 50 columns each). A few of them might fit together into my system memory but all of them at once definitely will not, hence the use of…
Dirk
  • 9,381
  • 17
  • 70
  • 98