Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
5
votes
1 answer

How would I use Dask to perform parallel operations on slices of NumPy arrays?

I have a numpy array of coordinates of size n_slice x 2048 x 3, where n_slice is in the tens of thousands. I want to apply the following operation on each 2048 x 3 slice separately import numpy as np from scipy.spatial.distance import pdist # load…
Steven C. Howell
  • 16,902
  • 15
  • 72
  • 97
5
votes
1 answer

Does dask distributed use Tornado coroutines for workers tasks?

I've read at the dask distributed documentation that: Worker and Scheduler nodes operate concurrently. They serve several overlapping requests and perform several overlapping computations at the same time without blocking. I've always thought…
dukebody
  • 7,025
  • 3
  • 36
  • 61
5
votes
1 answer

Lazily create dask dataframe from generator

I want to lazily create a Dask dataframe from a generator, which looks something like: [parser.read(local_file_name) for local_file_name in repo.download_files())] Where both parser.read and repo.download_files return generators (using yield).…
morganics
  • 1,209
  • 13
  • 27
5
votes
1 answer

Dask: is it safe to pickle a dataframe for later use?

I have a database-like object containing many dask dataframes. I would like to work with the data, save it and reload it on the next day to continue the analysis. Therefore, I tried saving dask dataframes (not computation results, just the "plan of…
Arco Bast
  • 3,595
  • 2
  • 26
  • 53
5
votes
1 answer

How do you transpose a dask dataframe (convert columns to rows) to approach tidy data principles

TLDR: I created a dask dataframe from a dask bag. The dask dataframe treats every observation (event) as a column. So, instead of having rows of data for each event, I have a column for each event. The goal is to transpose the columns to rows in…
Linwoodc3
  • 1,037
  • 2
  • 11
  • 14
5
votes
1 answer

using spot instances with dask.distributed

Does dask.distributed support using ec2 spot instances with dask-ec2? I didn't see an option to specify for that on http://distributed.readthedocs.io/en/latest/ec2.html
JRR
  • 6,014
  • 6
  • 39
  • 59
5
votes
1 answer

Dask: very low CPU usage and multiple threads? is this expected?

I am using dask as in how to parallelize many (fuzzy) string comparisons using apply in Pandas? Basically I do some computations (without writing anything to disk) that invoke Pandas and Fuzzywuzzy (that may not be releasing the GIL apparently, if…
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
5
votes
1 answer

Name columns when importing csv to dataframe in dask

I would like to name columns when I import a csv to a dataframe with dask in Python.The code I use looks like this: for i in range(1, files + 1): filename = str(i) + 'GlobalActorsHeatMap.csv' runs[i] = dd.read_csv(filename,…
Jim Caton
  • 111
  • 2
  • 5
5
votes
1 answer

Can dask work with an endless streaming input

I understand that dask work well in batch mode like this def load(filename): ... def clean(data): ... def analyze(sequence_of_data): ... def store(result): with open(..., 'w') as f: f.write(result) dsk = {'load-1':…
sami
  • 501
  • 2
  • 6
  • 18
5
votes
2 answers

Dask DataFrame: Resample over groupby object with multiple rows

I have the following dask dataframe created from Castra: import dask.dataframe as dd df = dd.from_castra('data.castra', columns=['user_id','ts','text']) Yielding: user_id / ts / text ts 2015-08-08 01:10:00 …
zanbri
  • 5,958
  • 2
  • 31
  • 41
5
votes
2 answers

Parallelize loop over numpy rows

I need to apply the same function onto every row in a numpy array and store the result again in a numpy array. # states will contain results of function applied to a row in array states = np.empty_like(array) for i, ar in enumerate(array): …
Max Linke
  • 1,705
  • 2
  • 18
  • 24
5
votes
2 answers

How to deal with modifying large pandas dataframes

I have a largish pandas dataframe (1.5gig .csv on disk). I can load it into memory and query it. I want to create a new column that is combined value of two other columns, and I tried this: def combined(row): row['combined'] =…
Christopher
  • 633
  • 2
  • 7
  • 19
4
votes
2 answers

Python pandas group by, transform multiple columns with custom conditions

I have dataframe containing 500k+ records and I would like to group-by multiple columns (data type of string and date) and later pick only few records inside each group based on custom condition. Basically, I need to group the records (by…
Govind
  • 2,482
  • 1
  • 28
  • 40
4
votes
1 answer

Setting maximum number of workers in Dask map function

I have a Dask process that triggers 100 workers with a map function: worker_args = .... # array with 100 elements with worker parameters futures = client.map(function_in_worker, worker_args) worker_responses = client.gather(futures) I use docker…
ps0604
  • 1,227
  • 23
  • 133
  • 330
4
votes
2 answers

Is there a way to traverse through a dask dataframe backwards?

I want to read_parquet but read backwards from where you start (assuming a sorted index). I don't want to read the entire parquet into memory because that defeats the whole point of using it. Is there a nice way to do this?
Anina Hitt
  • 61
  • 3