Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

2 answers

Automatically adding a dataset to Dask scheduler on startup

TL;DR I want to pre-load a dataset into the Dask Distributed scheduler when it's starting up. Background I'm using Dask in a realtime query fashion with a smaller-then-memory dataset. Because it's realtime it's important that the workers can trust…

python dask dask-distributed

asked Sep 28 '17 at 13:37

Niklas B

1,839
18
36

votes

1 answer

Dask and fbprophet

I'm trying to use dask and the fbprophet library together and I'm either doing something wrong or having unexpected performance problems. import dask.dataframe as dd import datetime as dt import multiprocessing as mp import numpy as np import…

python pandas dask facebook-prophet

asked Sep 27 '17 at 20:11

rpanai

12,515
2
42
64

votes

1 answer

Redshift to dask DataFrame

Does anyone have a nice neat and stable way to achieve the equivalent of: pandas.read_sql(sql, con, chunksize=None) and/or pandas.read_sql_table(table_name, con, schema=None, chunksize=None) connected to redshift with SQLAlchemy & psycopg2,…

amazon-redshift dask

asked Sep 27 '17 at 07:42

Leonard Aukea

votes

1 answer

Lazy repartitioning of dask dataframe

After several stages of lazy dataframe processing, I need to repartition my dataframe before saving it. However, the .repartition() method requires me to know the number of partitions (as opposed to size of partitions) and that depends on size of…

dask dask-distributed

asked Sep 22 '17 at 08:44

evilkonrex

votes

0 answers

Reduce i/o by storing data into a dictionary shared between workers on node using dask.distributed

I am using dask.distributed scheduler and workers to process some large microscopy images on a cluster. I run multiple workers per node (1 core = 1 worker). Each core in the node share 200Gb of RAM. Issue I would like to decrease the writing…

python python-3.x parallel-processing dask dask-distributed

asked Sep 20 '17 at 17:54

s1mc0d3

votes

1 answer

Is there a dask api to get current number of tasks in dask cluster

I have come across an issue where dask scheduler get killed(though workers keep running) with memory error if large number of tasks are submitted in short period of time. If it's possible to get current number of task on the cluster, then it's easy…

dask dask-distributed

asked Sep 16 '17 at 22:01

Santosh Kumar

votes

1 answer

Apply function on coordinate pair along particular axis using multiple variables in Xarray

My xarray Dataset has three dimensions, x,y,t and 2 variables, foo, bar I would like to apply function baz() on every x, y coordinate pair's time series t baz() will accept an array of foo-s and and array of bar-s for a given (x, y) I'm…

python-3.x numpy dask python-xarray

asked Sep 13 '17 at 17:39

Conic

votes

0 answers

How to enable proper work stealing in dask.distributed when using task restrictions / worker resources?

Context I'm using dask.distributed to parallelise computations across machines. I therefore have dask-workers running on the different machines which connect to a dask-scheduler, to which I can then submit my custom graphs to together with the…

dask dask-distributed

asked Sep 12 '17 at 15:54

malbert

votes

2 answers

How do you drop infs from dask dataframe/series?

I have a dask Series from which I need to drop both infs and nans. .dropna() only drops the nans. In numpy/pandas, I would do something like result = result[np.isfinite(result)]. What's the recommended equivalent in dask-land? Indexing the dask…

dask

asked Sep 12 '17 at 14:48

Tim Morton

votes

1 answer

using dask distributed computing via jupyter notebook

I am seeing strange behavior from dask when using it from jupyter notebook. So I am initiating a local client and giving it a list of jobs to do. My real code is a bit complex so I am putting a simple example for you here: from dask.distributed…

python-3.x dask dask-distributed

asked Sep 11 '17 at 19:27

Samaneh Navabpour

votes

1 answer

How would I do a Spark explode in Dask?

I'm new to dask so bear with me. I have a JSON file where each row has the following schema: { 'id': 2, 'version': 7.3, 'participants': range(10) } participants is a nested field. input_file = 'data.json' df =…

python json pyspark dask

asked Sep 11 '17 at 18:05

louis_guitton

5,105
1
31
33

votes

1 answer

Storing dask collection to files/CSV asynchronously

I'm implementing various kinds of data processing pipelines using dask.distributed. Usually the original data is read from S3 and in the end processed (large) collection would be written to CSV on S3 as well. I can run the processing asynchonously…

dask dask-distributed

asked Aug 24 '17 at 08:41

evilkonrex

votes

1 answer

Reshaping a dask.array in Fortran-contiguous order

I would like to ask if there is a way how to reshape a dask array in Fortran-contiguous (column-major) order since the parallelized version of the np.reshape function is not supported yet (see here).

python arrays numpy reshape dask

asked Aug 03 '17 at 08:51

Ales

votes

1 answer

Generating parquet files - differences between R and Python

We have generated a parquet file in Dask (Python) and with Drill (R using the Sergeant packet ). We have noticed a few issues: The format of the Dask (i.e. fastparquet) has a _metadata and a _common_metadata files while the parquet file in R \…

r parquet dask apache-drill fastparquet

asked Jul 31 '17 at 12:21

skibee

1,279
1
17
37

votes

0 answers

Computing in-place with dask

Short version I have a dask array whose graph is ultimately based on a bunch of numpy arrays at the bottom, and which applies elementwise operations to them. Is it safe to use da.store to compute the array and store the results back into the…

dask

asked Jul 27 '17 at 15:37

Bruce Merry

Prev 1 2 3

…

99 100 Next