Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

How would I use Dask to perform parallel operations on slices of NumPy arrays?

I have a numpy array of coordinates of size n_slice x 2048 x 3, where n_slice is in the tens of thousands. I want to apply the following operation on each 2048 x 3 slice separately import numpy as np from scipy.spatial.distance import pdist # load…

asked Oct 15 '16 at 00:49

Steven C. Howell

16,902
15
72
97

votes

1 answer

Does dask distributed use Tornado coroutines for workers tasks?

I've read at the dask distributed documentation that: Worker and Scheduler nodes operate concurrently. They serve several overlapping requests and perform several overlapping computations at the same time without blocking. I've always thought…

python multithreading tornado coroutine dask

asked Oct 04 '16 at 21:02

dukebody

7,025
3
36
61

votes

1 answer

Lazily create dask dataframe from generator

I want to lazily create a Dask dataframe from a generator, which looks something like: [parser.read(local_file_name) for local_file_name in repo.download_files())] Where both parser.read and repo.download_files return generators (using yield).…

python pandas dask

asked Sep 30 '16 at 13:39

morganics

1,209
13
27

votes

1 answer

Dask: is it safe to pickle a dataframe for later use?

I have a database-like object containing many dask dataframes. I would like to work with the data, save it and reload it on the next day to continue the analysis. Therefore, I tried saving dask dataframes (not computation results, just the "plan of…

python dask

asked Aug 25 '16 at 13:48

Arco Bast

3,595
2
26
53

votes

1 answer

How do you transpose a dask dataframe (convert columns to rows) to approach tidy data principles

TLDR: I created a dask dataframe from a dask bag. The dask dataframe treats every observation (event) as a column. So, instead of having rows of data for each event, I have a column for each event. The goal is to transpose the columns to rows in…

python twitter dataframe transpose dask

asked Aug 04 '16 at 07:19

Linwoodc3

1,037
2
11
14

votes

1 answer

using spot instances with dask.distributed

Does dask.distributed support using ec2 spot instances with dask-ec2? I didn't see an option to specify for that on http://distributed.readthedocs.io/en/latest/ec2.html

python dask

asked Jul 21 '16 at 00:06

JRR

6,014
6
39
59

votes

1 answer

Dask: very low CPU usage and multiple threads? is this expected?

I am using dask as in how to parallelize many (fuzzy) string comparisons using apply in Pandas? Basically I do some computations (without writing anything to disk) that invoke Pandas and Fuzzywuzzy (that may not be releasing the GIL apparently, if…

python multithreading pandas dask fuzzywuzzy

asked Jul 01 '16 at 14:22

ℕʘʘḆḽḘ

18,566
34
128
235

votes

1 answer

Name columns when importing csv to dataframe in dask

I would like to name columns when I import a csv to a dataframe with dask in Python.The code I use looks like this: for i in range(1, files + 1): filename = str(i) + 'GlobalActorsHeatMap.csv' runs[i] = dd.read_csv(filename,…

python csv numpy dask

asked Mar 17 '16 at 13:37

Jim Caton

votes

1 answer

Can dask work with an endless streaming input

I understand that dask work well in batch mode like this def load(filename): ... def clean(data): ... def analyze(sequence_of_data): ... def store(result): with open(..., 'w') as f: f.write(result) dsk = {'load-1':…

dask

asked Nov 27 '15 at 07:31

sami

votes

2 answers

Dask DataFrame: Resample over groupby object with multiple rows

I have the following dask dataframe created from Castra: import dask.dataframe as dd df = dd.from_castra('data.castra', columns=['user_id','ts','text']) Yielding: user_id / ts / text ts 2015-08-08 01:10:00 …

python pandas dataframe dask castra

asked Nov 26 '15 at 18:28

zanbri

5,958
2
31
41

votes

2 answers

Parallelize loop over numpy rows

I need to apply the same function onto every row in a numpy array and store the result again in a numpy array. # states will contain results of function applied to a row in array states = np.empty_like(array) for i, ar in enumerate(array): …

python numpy dask

asked Sep 28 '15 at 05:48

Max Linke

1,705
2
18
24

votes

2 answers

How to deal with modifying large pandas dataframes

I have a largish pandas dataframe (1.5gig .csv on disk). I can load it into memory and query it. I want to create a new column that is combined value of two other columns, and I tried this: def combined(row): row['combined'] =…

python pandas dask

asked Jul 22 '15 at 20:30

Christopher

votes

2 answers

Python pandas group by, transform multiple columns with custom conditions

I have dataframe containing 500k+ records and I would like to group-by multiple columns (data type of string and date) and later pick only few records inside each group based on custom condition. Basically, I need to group the records (by…

python pandas dataframe dask

asked Jan 09 '23 at 19:59

Govind

2,482
1
28
40

votes

1 answer

Setting maximum number of workers in Dask map function

I have a Dask process that triggers 100 workers with a map function: worker_args = .... # array with 100 elements with worker parameters futures = client.map(function_in_worker, worker_args) worker_responses = client.gather(futures) I use docker…

python dask dask-distributed dask-dataframe dask-delayed

asked Nov 03 '22 at 14:06

ps0604

1,227
23
133
330

votes

2 answers

Is there a way to traverse through a dask dataframe backwards?

I want to read_parquet but read backwards from where you start (assuming a sorted index). I don't want to read the entire parquet into memory because that defeats the whole point of using it. Is there a nice way to do this?

python pandas dask parquet dask-dataframe

asked Jun 23 '22 at 20:12

Anina Hitt

Prev 1 2 3

…

99 100 Next