Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
7
votes
4 answers

Howto copy a dask dataframe?

Given a pandas df one can copy it before doing anything via: df.copy() How can I do this with a dask dataframe object?
Michael
  • 347
  • 5
  • 13
7
votes
1 answer

Does Dask support functions with multiple outputs in Custom Graphs?

The Custom Graphs API of Dask seems to support only functions returning one output key/value. For example, the following dependency could not be easily represented as a Dask graph: B -> D / \ A- -> F \ / C -> E This…
Petr Wolf
  • 127
  • 7
7
votes
1 answer

How do I actually get dask to compute a list of delayed or dask-container-based results?

I have a trivially parallelizable task of computing results independently for many tables split across many files. I can construct delayed or dask.dataframe lists (and have also tried with, e.g. a dict), and I cannot get all of the results to…
Dav Clark
  • 1,430
  • 1
  • 13
  • 26
7
votes
2 answers

How do Dask dataframes handle larger-than-memory datasets?

The documentation of the Dask package for dataframes says: Dask dataframes look and feel like pandas dataframes, but operate on datasets larger than memory using multiple threads. But later in the same page: One dask DataFrame is comprised of…
dukebody
  • 7,025
  • 3
  • 36
  • 61
7
votes
1 answer

Correct choice of chunks-specification for dask array

According to the dask documentaion it's possible to specify the chunks in one of three ways: a blocksize like 1000 a blockshape like (1000, 1000) explicit sizes of all blocks along all dimensions, like ((1000, 1000, 500), (400, 400)) Your chunks…
istern
  • 363
  • 1
  • 4
  • 13
7
votes
1 answer

creating dask dataframe by reading a pickle file in dask module of Python

when i am trying to create a dask dataframe by reading a pickle file , iam getting an error import dask.dataframe as dd ds_df = dd.read_pickle("D:\test.pickle") AttributeError: 'module' object has no attribute 'read_pickle' but it works fine with…
Satya
  • 5,470
  • 17
  • 47
  • 72
7
votes
1 answer

Choosing a framework for larger than memory data analysis with python

I'm solving a problem with a dataset that is larger than memory. The original dataset is a .csv file. One of the columns is for track IDs from the musicbrainz service. What I already did I read the .csv file with dask and converted it to castra…
Nagasaki45
  • 2,634
  • 1
  • 22
  • 27
7
votes
2 answers

Can't drop columns or slice dataframe using dask?

I am trying to use dask instead of pandas since I have 2.6gb csv file. I load it and I want to drop a column. but it seems that neither the drop method df.drop('column') or slicing df[ : , :-1] is implemented yet. Is this the case or am I just…
chrisfs
  • 6,182
  • 6
  • 29
  • 35
6
votes
1 answer

Computing a norm in a loop slows down the computation with Dask

I was trying to implement a conjugate gradient algorithm using Dask (for didactic purposes) when I realized that the performance were way worst that a simple numpy implementation. After a few experiments, I have been able to reduce the problem to…
SteP
  • 262
  • 1
  • 2
  • 9
6
votes
3 answers

Dask dataframe: Can `set_index` put a single index into multiple partitions?

Empirically it seems that whenever you set_index on a Dask dataframe, Dask will always put rows with equal indexes into a single partition, even if it results in wildly imbalanced partitions. Here is a demonstration: import pandas as pd import…
Dahn
  • 1,397
  • 1
  • 10
  • 29
6
votes
1 answer

Dask map method in fuction with multiple arguments

I want to apply the Client.map method to a function that uses multiple arguments as does the Pool.starmap method of multiprocessing. Here is an example from contextlib import contextmanager from dask.distributed import Client @contextmanager def…
Andrex
  • 602
  • 1
  • 7
  • 22
6
votes
2 answers

Reload Dask worker containers automatically on code change

I have the Dask code below that submits N workers, where each worker is implemented in a Docker container: default_sums = client.map(process_asset_defaults, build_worker_args(req, numWorkers)) future_total_sum = client.submit(sum,…
ps0604
  • 1,227
  • 23
  • 133
  • 330
6
votes
1 answer

ALS algorithm in Dask optimization

I am trying to implement ALS algorithm in Dask, but I am having trouble figuring out how to compute latent feautures in one step. I followed formulas on this stackoverflow thread and come up with this code: Items =…
user4952634
6
votes
3 answers

Deploying a cluster of containers in Azure

I have a Docker application that works fine in my laptop on Windows using compose and starting multiple instances of a container as a Dask cluster. The name of the service is "worker" and I start two container instances like so: docker compose up…
ps0604
  • 1,227
  • 23
  • 133
  • 330
6
votes
1 answer

Read group of rows from Parquet file in Python Pandas / Dask?

I have a Pandas dataframe that looks similar to this: datetime data1 data2 2021-01-23 00:00:31.140 a1 a2 2021-01-23 00:00:31.140 b1 b2 2021-01-23 00:00:31.140 c1 c2 2021-01-23 00:01:29.021 d1 …
Mike
  • 155
  • 2
  • 8