Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
5
votes
1 answer

How to impute column values on Dask Dataframe?

I would like to impute negative values of Dask Dataframe, with pandas i use this code: df.loc[(df.column_name < 0),'column_name'] = 0
ambigus9
  • 1,417
  • 3
  • 19
  • 37
5
votes
0 answers

Dask assigning to dataframe column by index throws ValueError

I have a pipeline of trasnformations on a grouped by dataframe. All functions get a DataframeGroupBy and compute some features. Those features are then stored in a Dataframe. The index of the dataframe is the same since all features are derived by…
Apostolos
  • 7,763
  • 17
  • 80
  • 150
5
votes
2 answers

How to efficiently parallelize time series forecasting using dask?

I'm trying to parallelize time series forecasting in python using dask. The format of the data is that each time series is a column and they have a common index of monthly dates. I have a custom forecasting function that returns a time series object…
Davis
  • 163
  • 2
  • 10
5
votes
1 answer

Why am I getting dask warnings when running a pandas operation?

I have a notebook with both pandas and dask operations. When I have not started the client, everything is as expected. But once I start the dask.distributed client, I get warnings in cells where I'm running pandas operations e.g.…
birdsarah
  • 1,165
  • 8
  • 20
5
votes
2 answers

Slicing out a few rows from a `dask.DataFrame`

Often, when working with a large dask.DataFrame, it would be useful to grab only a few rows on which to test all subsequent operations. Currently, according to Slicing a Dask Dataframe, this is unsupported. I was hoping to then use head to…
Stefan van der Walt
  • 7,165
  • 1
  • 32
  • 41
5
votes
1 answer

Dask: Groupby and 'First'/ 'Last' in agg

I want to groupby a single column, and then use agg with mean for a couple of columns, but just select first or last for the remaining columns. This is possible in pandas, but isn't currently supported in Dask. How to do this? Thanks. aggs = {'B':…
morganics
  • 1,209
  • 13
  • 27
5
votes
0 answers

How to use xarray/dask/pandas/deepgraph for parallel pairwise correlation matrix in Python 3?

I'm trying to follow the tutorial on xarray's documentation: http://xarray.pydata.org/en/stable/dask.html#automatic-parallelization My ultimate goal is to get a pairwise spearman correlation matrix from a dataset that has ~100,000 attributes which…
O.rka
  • 29,847
  • 68
  • 194
  • 309
5
votes
1 answer

How to write a Dask dataframe containing a column of arrays to a parquet file

I have a Dask dataframe, one column of which contains a numpy array of floats: import dask.dataframe as dd import pandas as pd import numpy as np df = dd.from_pandas( pd.DataFrame( { 'id':range(1, 6), …
junichiro
  • 5,282
  • 3
  • 18
  • 26
5
votes
1 answer

parallel dask for loop slower than regular loop?

If I try to parallelize a for loop with dask, it ends up executing slower than the regular version. Basically, I just follow the introductory example from the dask tutorial, but for some reason it's failing on my end. What am I doing wrong? In [1]:…
mistakeNot
  • 743
  • 2
  • 10
  • 24
5
votes
1 answer

dask: how to groupby, aggregate without losing column used for groupby

How do one get a SQL-style grouped output when grouping following data: item frequency A 5 A 9 B 2 B 4 C 6 df.groupby(by = ["item"]).sum() results in this: item frequency A 14 B …
Omley
  • 426
  • 6
  • 17
5
votes
1 answer

Dask Dataframe groupby has no len()

If you have a groupby object based on a dask dataframe why does len() return an error? (bug or feature)
Back2Basics
  • 7,406
  • 2
  • 32
  • 45
5
votes
2 answers

Override dask scheduler to concurrently load data on multiple workers

I want to run graphs/futures on my distributed cluster which all have a 'load data' root task and then a bunch of training tasks that run on that data. A simplified version would look like this: from dask.distributed import Client client =…
user8871302
  • 123
  • 7
5
votes
1 answer

Memory leak and/or data persistence in dask distributed

I'm using dask/distributed to submit 100+ evaluations of a function to the multi-node cluster. Each eval is very costly, about 90 sec of CPU time. I've noticed though that there seems to be a memory leak and all workers over time grow in size,…
marioba
  • 51
  • 1
5
votes
1 answer

dask set_index from large unordered csv file

At the risk of being a bit off-topic, I want to show a simple solution for loading large csv files in a dask dataframe where the option sorted=True can be applied and save a significant time of processing. I found the option of doing set_index…
Julian C
  • 149
  • 9
5
votes
2 answers

Progress reporting on dask's set_index

I am trying to wrap a progress indicator around the entire script. However, set_index(..., compute=False) does still run tasks on the scheduler, observable in the web interface. How do I report on the progress of the set_index step? import…
kadrach
  • 408
  • 6
  • 11