Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

How to impute column values on Dask Dataframe?

I would like to impute negative values of Dask Dataframe, with pandas i use this code: df.loc[(df.column_name < 0),'column_name'] = 0

pandas dataframe dask

asked Mar 25 '18 at 15:16

ambigus9

1,417
3
19
37

votes

0 answers

Dask assigning to dataframe column by index throws ValueError

I have a pipeline of trasnformations on a grouped by dataframe. All functions get a DataframeGroupBy and compute some features. Those features are then stored in a Dataframe. The index of the dataframe is the same since all features are derived by…

python pandas dataframe dask dask-distributed

asked Mar 22 '18 at 09:04

Apostolos

7,763
17
80
150

votes

2 answers

How to efficiently parallelize time series forecasting using dask?

I'm trying to parallelize time series forecasting in python using dask. The format of the data is that each time series is a column and they have a common index of monthly dates. I have a custom forecasting function that returns a time series object…

python parallel-processing time-series forecasting dask

asked Mar 21 '18 at 21:41

Davis

votes

1 answer

Why am I getting dask warnings when running a pandas operation?

I have a notebook with both pandas and dask operations. When I have not started the client, everything is as expected. But once I start the dask.distributed client, I get warnings in cells where I'm running pandas operations e.g.…

dask dask-distributed

asked Mar 10 '18 at 07:08

birdsarah

1,165
8
20

votes

2 answers

Slicing out a few rows from a `dask.DataFrame`

Often, when working with a large dask.DataFrame, it would be useful to grab only a few rows on which to test all subsequent operations. Currently, according to Slicing a Dask Dataframe, this is unsupported. I was hoping to then use head to…

dask

asked Mar 06 '18 at 20:23

Stefan van der Walt

7,165
1
32
41

votes

1 answer

Dask: Groupby and 'First'/ 'Last' in agg

I want to groupby a single column, and then use agg with mean for a couple of columns, but just select first or last for the remaining columns. This is possible in pandas, but isn't currently supported in Dask. How to do this? Thanks. aggs = {'B':…

python pandas-groupby dask

asked Feb 24 '18 at 09:26

morganics

1,209
13
27

votes

0 answers

How to use xarray/dask/pandas/deepgraph for parallel pairwise correlation matrix in Python 3?

I'm trying to follow the tutorial on xarray's documentation: http://xarray.pydata.org/en/stable/dask.html#automatic-parallelization My ultimate goal is to get a pairwise spearman correlation matrix from a dataset that has ~100,000 attributes which…

python parallel-processing correlation dask pairwise

asked Feb 15 '18 at 19:23

O.rka

29,847
68
194
309

votes

1 answer

How to write a Dask dataframe containing a column of arrays to a parquet file

I have a Dask dataframe, one column of which contains a numpy array of floats: import dask.dataframe as dd import pandas as pd import numpy as np df = dd.from_pandas( pd.DataFrame( { 'id':range(1, 6), …

python dask fastparquet

asked Feb 14 '18 at 19:05

junichiro

5,282
3
18
26

votes

1 answer

parallel dask for loop slower than regular loop?

If I try to parallelize a for loop with dask, it ends up executing slower than the regular version. Basically, I just follow the introductory example from the dask tutorial, but for some reason it's failing on my end. What am I doing wrong? In [1]:…

python numpy parallel-processing dask

asked Feb 12 '18 at 15:23

mistakeNot

votes

1 answer

dask: how to groupby, aggregate without losing column used for groupby

How do one get a SQL-style grouped output when grouping following data: item frequency A 5 A 9 B 2 B 4 C 6 df.groupby(by = ["item"]).sum() results in this: item frequency A 14 B …

python group-by dask

asked Feb 11 '18 at 14:42

Omley

votes

1 answer

Dask Dataframe groupby has no len()

If you have a groupby object based on a dask dataframe why does len() return an error? (bug or feature)

python dataframe dask

asked Feb 10 '18 at 23:24

Back2Basics

7,406
2
32
45

votes

2 answers

Override dask scheduler to concurrently load data on multiple workers

I want to run graphs/futures on my distributed cluster which all have a 'load data' root task and then a bunch of training tasks that run on that data. A simplified version would look like this: from dask.distributed import Client client =…

dask dask-distributed

asked Jan 17 '18 at 10:54

user8871302

votes

1 answer

Memory leak and/or data persistence in dask distributed

I'm using dask/distributed to submit 100+ evaluations of a function to the multi-node cluster. Each eval is very costly, about 90 sec of CPU time. I've noticed though that there seems to be a memory leak and all workers over time grow in size,…

multiprocessing distributed dask

asked Oct 31 '17 at 17:10

marioba

votes

1 answer

dask set_index from large unordered csv file

At the risk of being a bit off-topic, I want to show a simple solution for loading large csv files in a dask dataframe where the option sorted=True can be applied and save a significant time of processing. I found the option of doing set_index…

python csv sorting indexing dask

asked Oct 27 '17 at 08:59

Julian C

votes

2 answers

Progress reporting on dask's set_index

I am trying to wrap a progress indicator around the entire script. However, set_index(..., compute=False) does still run tasks on the scheduler, observable in the web interface. How do I report on the progress of the set_index step? import…

dask dask-distributed

asked Oct 25 '17 at 06:06

kadrach

Prev 1 2 3

…

99 100 Next