Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
14
votes
1 answer

How to concat multiple pandas dataframes into one dask dataframe larger than memory?

I am parsing tab-delimited data to create tabular data, which I would like to store in an HDF5. My problem is I have to aggregate the data into one format, and then dump into HDF5. This is ~1 TB-sized data, so I naturally cannot fit this into RAM.…
ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234
13
votes
2 answers

Writing xarray multiindex data in chunks

I am trying to efficiently restructure a large multidimentional dataset. Let assume I have a number of remotely sensed images over time with a number of bands with coordinates x y for pixel location, time for time of image acquisition, and band for…
mmann1123
  • 5,031
  • 7
  • 41
  • 49
13
votes
3 answers

Difference between dask.distributed LocalCluster with threads vs. processes

What is the difference between the following LocalCluster configurations for dask.distributed? Client(n_workers=4, processes=False, threads_per_worker=1) versus Client(n_workers=1, processes=True, threads_per_worker=4) They both have four threads…
jrinker
  • 2,010
  • 2
  • 14
  • 17
13
votes
1 answer

Dask: delayed vs futures and task graph generation

I have a few basic questions on Dask: Is it correct that I have to use Futures when I want to use dask for distributed computations (i.e. on a cluster)? In that case, i.e. when working with futures, are task graphs still the way to reason about…
clog14
  • 1,549
  • 1
  • 16
  • 32
13
votes
1 answer

How to read parquet file from s3 using dask with specific AWS profile

How to read a parquet file on s3 using dask and specific AWS profile (stored in a credentials file). Dask uses s3fs which uses boto. This is what I have tried: >>>import os >>>import s3fs >>>import boto3 >>>import dask.dataframe as…
muon
  • 12,821
  • 11
  • 69
  • 88
13
votes
4 answers

Shuffling data in dask

This is a follow on question from Subsetting Dask DataFrames. I wish to shuffle data from a dask dataframe before sending it in batches to a ML algorithm. The answer in that question was to do the following: for part in…
sachinruk
  • 9,571
  • 12
  • 55
  • 86
13
votes
1 answer

Understanding memory behavior of Dask distributed

Similar to this question, I'm running into memory issues with Dask distributed. However, in my case the explanation is not that the client is trying to collect a large amount of data. The problem can be illustrated based on a very simple task graph:…
bluenote10
  • 23,414
  • 14
  • 122
  • 178
13
votes
1 answer

How to use pandas.cut() (or equivalent) in dask efficiently?

Is there an equivalent to pandas.cut() in Dask? I try to bin and group a large dataset in Python. It is a list of measured electrons with the properties (positionX, positionY, energy, time). I need to group it along positionX, positionY and do…
Y. Ac
  • 133
  • 1
  • 5
13
votes
4 answers

Dask "no module named xxxx" error

Using dask distributed i try to submit a function that is located in another file named worker.py. In workers i've the following error : No module named 'worker' However I'm unable to figure out what i'm doing wrong here ... Here is a sample of…
Bertrand
  • 994
  • 9
  • 23
13
votes
2 answers

iterate over GroupBy object in dask

Is it possible to iterate over a dask GroupBy object to get access to the underlying dataframes? I tried: import dask.dataframe as dd import pandas as pd pdf = pd.DataFrame({'A':[1,2,3,4,5], 'B':['1','1','a','a','a']}) ddf = dd.from_pandas(pdf,…
Arco Bast
  • 3,595
  • 2
  • 26
  • 53
13
votes
2 answers

how to throttle a large number of tasks without using all workers

Imagine I have a dask grid with 10 workers & 40 cores totals. This is a shared grid, so I don't want to fully saturate it with my work. I have 1000 tasks to do, and I want to submit (and have actively running) a maximum of 20 tasks at a time. To be…
Jeff
  • 125,376
  • 21
  • 220
  • 187
12
votes
3 answers

Applying Python function to Pandas grouped DataFrame - what's the most efficient approach to speed up the computations?

I'm dealing with quite large Pandas DataFrame - my dataset resembles a following df setup : import pandas as pd import numpy as np #--------------------------------------------- SIZING PARAMETERS : R1 = 20 # .repeat(…
Kuba_
  • 886
  • 6
  • 22
12
votes
2 answers

Concatenating a dask dataframe and a pandas dataframe

I have a dask dataframe (df) with around 250 million rows (from a 10Gb CSV file). I have another pandas dataframe (ndf) of 25,000 rows. I would like to add the first column of pandas dataframe to the dask dataframe by repeating every item 10,000…
najeem
  • 1,841
  • 13
  • 29
12
votes
2 answers

Dask item assignment. Cannot use loc for item assignment

I have a folder of parquet files that I can't fit in memory so I am using dask to perform the data cleansing operations. I have a function where I want to perform item assignment but I can't seem to find any solutions online that qualify as…
Matt Elgazar
  • 707
  • 1
  • 8
  • 21
12
votes
2 answers

Create sql table from dask dataframe using map_partitions and pd.df.to_sql

Dask doesn't have a df.to_sql() like pandas and so I am trying to replicate the functionality and create an sql table using the map_partitions method to do so. Here is my code: import dask.dataframe as dd import pandas as pd import sqlalchemy_utils…
Ludo
  • 2,307
  • 2
  • 27
  • 58