Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

How to concat multiple pandas dataframes into one dask dataframe larger than memory?

I am parsing tab-delimited data to create tabular data, which I would like to store in an HDF5. My problem is I have to aggregate the data into one format, and then dump into HDF5. This is ~1 TB-sized data, so I naturally cannot fit this into RAM.…

asked Oct 09 '16 at 20:18

ShanZhengYang

16,511
49
132
234

votes

2 answers

Writing xarray multiindex data in chunks

I am trying to efficiently restructure a large multidimentional dataset. Let assume I have a number of remotely sensed images over time with a number of bands with coordinates x y for pixel location, time for time of image acquisition, and band for…

python arrays dask parquet python-xarray

asked Sep 15 '20 at 17:08

mmann1123

5,031
7
41
49

votes

3 answers

Difference between dask.distributed LocalCluster with threads vs. processes

What is the difference between the following LocalCluster configurations for dask.distributed? Client(n_workers=4, processes=False, threads_per_worker=1) versus Client(n_workers=1, processes=True, threads_per_worker=4) They both have four threads…

python dask dask-distributed

asked Sep 02 '19 at 16:53

jrinker

2,010
2
14
17

votes

1 answer

Dask: delayed vs futures and task graph generation

I have a few basic questions on Dask: Is it correct that I have to use Futures when I want to use dask for distributed computations (i.e. on a cluster)? In that case, i.e. when working with futures, are task graphs still the way to reason about…

python distributed-computing dask

asked Jan 17 '19 at 08:51

clog14

1,549
1
16
32

votes

1 answer

How to read parquet file from s3 using dask with specific AWS profile

How to read a parquet file on s3 using dask and specific AWS profile (stored in a credentials file). Dask uses s3fs which uses boto. This is what I have tried: >>>import os >>>import s3fs >>>import boto3 >>>import dask.dataframe as…

python amazon-s3 boto3 dask python-s3fs

asked Jan 22 '18 at 20:04

muon

12,821
11
69
88

votes

4 answers

Shuffling data in dask

This is a follow on question from Subsetting Dask DataFrames. I wish to shuffle data from a dask dataframe before sending it in batches to a ML algorithm. The answer in that question was to do the following: for part in…

python dask

asked Oct 20 '17 at 03:54

sachinruk

9,571
12
55
86

votes

1 answer

Understanding memory behavior of Dask distributed

Similar to this question, I'm running into memory issues with Dask distributed. However, in my case the explanation is not that the client is trying to collect a large amount of data. The problem can be illustrated based on a very simple task graph:…

python dask dask-delayed

asked Jun 03 '17 at 12:53

bluenote10

23,414
14
122
178

votes

1 answer

How to use pandas.cut() (or equivalent) in dask efficiently?

Is there an equivalent to pandas.cut() in Dask? I try to bin and group a large dataset in Python. It is a list of measured electrons with the properties (positionX, positionY, energy, time). I need to group it along positionX, positionY and do…

python pandas dask

asked Feb 24 '17 at 15:12

Y. Ac

votes

4 answers

Dask "no module named xxxx" error

Using dask distributed i try to submit a function that is located in another file named worker.py. In workers i've the following error : No module named 'worker' However I'm unable to figure out what i'm doing wrong here ... Here is a sample of…

python dask

asked Oct 11 '16 at 12:49

Bertrand

votes

2 answers

iterate over GroupBy object in dask

Is it possible to iterate over a dask GroupBy object to get access to the underlying dataframes? I tried: import dask.dataframe as dd import pandas as pd pdf = pd.DataFrame({'A':[1,2,3,4,5], 'B':['1','1','a','a','a']}) ddf = dd.from_pandas(pdf,…

python pandas dask

asked Sep 27 '16 at 17:40

Arco Bast

3,595
2
26
53

votes

2 answers

how to throttle a large number of tasks without using all workers

Imagine I have a dask grid with 10 workers & 40 cores totals. This is a shared grid, so I don't want to fully saturate it with my work. I have 1000 tasks to do, and I want to submit (and have actively running) a maximum of 20 tasks at a time. To be…

python dask

asked Aug 12 '16 at 15:28

Jeff

125,376
21
220
187

votes

3 answers

Applying Python function to Pandas grouped DataFrame - what's the most efficient approach to speed up the computations?

I'm dealing with quite large Pandas DataFrame - my dataset resembles a following df setup : import pandas as pd import numpy as np #--------------------------------------------- SIZING PARAMETERS : R1 = 20 # .repeat(…

python pandas apache-spark parallel-processing dask

asked Feb 24 '20 at 11:38

Kuba_

votes

2 answers

Concatenating a dask dataframe and a pandas dataframe

I have a dask dataframe (df) with around 250 million rows (from a 10Gb CSV file). I have another pandas dataframe (ndf) of 25,000 rows. I would like to add the first column of pandas dataframe to the dask dataframe by repeating every item 10,000…

python pandas dataframe dask

asked Feb 15 '19 at 03:13

najeem

1,841
13
29

votes

2 answers

Dask item assignment. Cannot use loc for item assignment

I have a folder of parquet files that I can't fit in memory so I am using dask to perform the data cleansing operations. I have a function where I want to perform item assignment but I can't seem to find any solutions online that qualify as…

python pandas dask series

asked Jan 25 '19 at 07:11

Matt Elgazar

votes

2 answers

Create sql table from dask dataframe using map_partitions and pd.df.to_sql

Dask doesn't have a df.to_sql() like pandas and so I am trying to replicate the functionality and create an sql table using the map_partitions method to do so. Here is my code: import dask.dataframe as dd import pandas as pd import sqlalchemy_utils…

python postgresql pandas dask pandas-to-sql

asked Jan 24 '19 at 10:45

Ludo

2,307
2
27
58

Prev 1 2 3

…

99 100 Next