Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
2 answers

Automatically adding a dataset to Dask scheduler on startup

TL;DR I want to pre-load a dataset into the Dask Distributed scheduler when it's starting up. Background I'm using Dask in a realtime query fashion with a smaller-then-memory dataset. Because it's realtime it's important that the workers can trust…
Niklas B
  • 1,839
  • 18
  • 36
3
votes
1 answer

Dask and fbprophet

I'm trying to use dask and the fbprophet library together and I'm either doing something wrong or having unexpected performance problems. import dask.dataframe as dd import datetime as dt import multiprocessing as mp import numpy as np import…
rpanai
  • 12,515
  • 2
  • 42
  • 64
3
votes
1 answer

Redshift to dask DataFrame

Does anyone have a nice neat and stable way to achieve the equivalent of: pandas.read_sql(sql, con, chunksize=None) and/or pandas.read_sql_table(table_name, con, schema=None, chunksize=None) connected to redshift with SQLAlchemy & psycopg2,…
Leonard Aukea
  • 402
  • 6
  • 12
3
votes
1 answer

Lazy repartitioning of dask dataframe

After several stages of lazy dataframe processing, I need to repartition my dataframe before saving it. However, the .repartition() method requires me to know the number of partitions (as opposed to size of partitions) and that depends on size of…
evilkonrex
  • 255
  • 2
  • 10
3
votes
0 answers

Reduce i/o by storing data into a dictionary shared between workers on node using dask.distributed

I am using dask.distributed scheduler and workers to process some large microscopy images on a cluster. I run multiple workers per node (1 core = 1 worker). Each core in the node share 200Gb of RAM. Issue I would like to decrease the writing…
3
votes
1 answer

Is there a dask api to get current number of tasks in dask cluster

I have come across an issue where dask scheduler get killed(though workers keep running) with memory error if large number of tasks are submitted in short period of time. If it's possible to get current number of task on the cluster, then it's easy…
Santosh Kumar
  • 761
  • 5
  • 28
3
votes
1 answer

Apply function on coordinate pair along particular axis using multiple variables in Xarray

My xarray Dataset has three dimensions, x,y,t and 2 variables, foo, bar I would like to apply function baz() on every x, y coordinate pair's time series t baz() will accept an array of foo-s and and array of bar-s for a given (x, y) I'm…
Conic
  • 998
  • 1
  • 11
  • 26
3
votes
0 answers

How to enable proper work stealing in dask.distributed when using task restrictions / worker resources?

Context I'm using dask.distributed to parallelise computations across machines. I therefore have dask-workers running on the different machines which connect to a dask-scheduler, to which I can then submit my custom graphs to together with the…
malbert
  • 308
  • 1
  • 7
3
votes
2 answers

How do you drop infs from dask dataframe/series?

I have a dask Series from which I need to drop both infs and nans. .dropna() only drops the nans. In numpy/pandas, I would do something like result = result[np.isfinite(result)]. What's the recommended equivalent in dask-land? Indexing the dask…
Tim Morton
  • 240
  • 1
  • 3
  • 11
3
votes
1 answer

using dask distributed computing via jupyter notebook

I am seeing strange behavior from dask when using it from jupyter notebook. So I am initiating a local client and giving it a list of jobs to do. My real code is a bit complex so I am putting a simple example for you here: from dask.distributed…
3
votes
1 answer

How would I do a Spark explode in Dask?

I'm new to dask so bear with me. I have a JSON file where each row has the following schema: { 'id': 2, 'version': 7.3, 'participants': range(10) } participants is a nested field. input_file = 'data.json' df =…
louis_guitton
  • 5,105
  • 1
  • 31
  • 33
3
votes
1 answer

Storing dask collection to files/CSV asynchronously

I'm implementing various kinds of data processing pipelines using dask.distributed. Usually the original data is read from S3 and in the end processed (large) collection would be written to CSV on S3 as well. I can run the processing asynchonously…
evilkonrex
  • 255
  • 2
  • 10
3
votes
1 answer

Reshaping a dask.array in Fortran-contiguous order

I would like to ask if there is a way how to reshape a dask array in Fortran-contiguous (column-major) order since the parallelized version of the np.reshape function is not supported yet (see here).
Ales
  • 495
  • 3
  • 11
3
votes
1 answer

Generating parquet files - differences between R and Python

We have generated a parquet file in Dask (Python) and with Drill (R using the Sergeant packet ). We have noticed a few issues: The format of the Dask (i.e. fastparquet) has a _metadata and a _common_metadata files while the parquet file in R \…
skibee
  • 1,279
  • 1
  • 17
  • 37
3
votes
0 answers

Computing in-place with dask

Short version I have a dask array whose graph is ultimately based on a bunch of numpy arrays at the bottom, and which applies elementwise operations to them. Is it safe to use da.store to compute the array and store the results back into the…
Bruce Merry
  • 751
  • 3
  • 11