Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

4 answers

Howto copy a dask dataframe?

Given a pandas df one can copy it before doing anything via: df.copy() How can I do this with a dask dataframe object?

python dask

asked Aug 03 '16 at 11:39

Michael

votes

1 answer

Does Dask support functions with multiple outputs in Custom Graphs?

The Custom Graphs API of Dask seems to support only functions returning one output key/value. For example, the following dependency could not be easily represented as a Dask graph: B -> D / \ A- -> F \ / C -> E This…

python dask

asked Jul 15 '16 at 22:30

Petr Wolf

votes

1 answer

How do I actually get dask to compute a list of delayed or dask-container-based results?

I have a trivially parallelizable task of computing results independently for many tables split across many files. I can construct delayed or dask.dataframe lists (and have also tried with, e.g. a dict), and I cannot get all of the results to…

python dask

asked May 24 '16 at 00:56

Dav Clark

1,430
1
13
26

votes

2 answers

How do Dask dataframes handle larger-than-memory datasets?

The documentation of the Dask package for dataframes says: Dask dataframes look and feel like pandas dataframes, but operate on datasets larger than memory using multiple threads. But later in the same page: One dask DataFrame is comprised of…

python dask bigdata

asked Mar 28 '16 at 19:17

dukebody

7,025
3
36
61

votes

1 answer

Correct choice of chunks-specification for dask array

According to the dask documentaion it's possible to specify the chunks in one of three ways: a blocksize like 1000 a blockshape like (1000, 1000) explicit sizes of all blocks along all dimensions, like ((1000, 1000, 500), (400, 400)) Your chunks…

python dask

asked Jan 20 '16 at 09:14

istern

votes

1 answer

creating dask dataframe by reading a pickle file in dask module of Python

when i am trying to create a dask dataframe by reading a pickle file , iam getting an error import dask.dataframe as dd ds_df = dd.read_pickle("D:\test.pickle") AttributeError: 'module' object has no attribute 'read_pickle' but it works fine with…

python dask

asked Dec 14 '15 at 09:15

Satya

5,470
17
47
72

votes

1 answer

Choosing a framework for larger than memory data analysis with python

I'm solving a problem with a dataset that is larger than memory. The original dataset is a .csv file. One of the columns is for track IDs from the musicbrainz service. What I already did I read the .csv file with dask and converted it to castra…

python hdf5 blaze dask

asked Oct 14 '15 at 15:42

Nagasaki45

2,634
1
22
27

votes

2 answers

Can't drop columns or slice dataframe using dask?

I am trying to use dask instead of pandas since I have 2.6gb csv file. I load it and I want to drop a column. but it seems that neither the drop method df.drop('column') or slicing df[ : , :-1] is implemented yet. Is this the case or am I just…

dask

asked Aug 07 '15 at 00:47

chrisfs

6,182
6
29
35

votes

1 answer

Computing a norm in a loop slows down the computation with Dask

I was trying to implement a conjugate gradient algorithm using Dask (for didactic purposes) when I realized that the performance were way worst that a simple numpy implementation. After a few experiments, I have been able to reduce the problem to…

python numpy dask

asked Apr 22 '23 at 12:44

SteP

votes

3 answers

Dask dataframe: Can `set_index` put a single index into multiple partitions?

Empirically it seems that whenever you set_index on a Dask dataframe, Dask will always put rows with equal indexes into a single partition, even if it results in wildly imbalanced partitions. Here is a demonstration: import pandas as pd import…

python dataframe indexing dask

asked Oct 14 '21 at 12:27

Dahn

1,397
1
10
29

votes

1 answer

Dask map method in fuction with multiple arguments

I want to apply the Client.map method to a function that uses multiple arguments as does the Pool.starmap method of multiprocessing. Here is an example from contextlib import contextmanager from dask.distributed import Client @contextmanager def…

python dask dask-distributed

asked Aug 18 '21 at 15:07

Andrex

votes

2 answers

Reload Dask worker containers automatically on code change

I have the Dask code below that submits N workers, where each worker is implemented in a Docker container: default_sums = client.map(process_asset_defaults, build_worker_args(req, numWorkers)) future_total_sum = client.submit(sum,…

python docker dask dask-distributed

asked Jul 02 '21 at 19:56

ps0604

1,227
23
133
330

votes

1 answer

ALS algorithm in Dask optimization

I am trying to implement ALS algorithm in Dask, but I am having trouble figuring out how to compute latent feautures in one step. I followed formulas on this stackoverflow thread and come up with this code: Items =…

python classification sparse-matrix dask matrix-factorization

asked May 22 '21 at 14:56

user4952634

votes

3 answers

Deploying a cluster of containers in Azure

I have a Docker application that works fine in my laptop on Windows using compose and starting multiple instances of a container as a Dask cluster. The name of the service is "worker" and I start two container instances like so: docker compose up…

azure docker docker-compose dask dask-distributed

asked Apr 20 '21 at 17:31

ps0604

1,227
23
133
330

votes

1 answer

Read group of rows from Parquet file in Python Pandas / Dask?

I have a Pandas dataframe that looks similar to this: datetime data1 data2 2021-01-23 00:00:31.140 a1 a2 2021-01-23 00:00:31.140 b1 b2 2021-01-23 00:00:31.140 c1 c2 2021-01-23 00:01:29.021 d1 …

python pandas dask parquet dask-dataframe

asked Mar 06 '21 at 03:47

Mike

Prev 1 2 3

…

99 100 Next