Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

How to discretize large dataframe by columns with variable bins in Pandas/Dask

I am able to discretize a Pandas dataframe by columns with this code: import numpy as np import pandas as pd def discretize(X, n_scale=1): for c in X.columns: loc = X[c].median() # median absolute deviation of the column …

asked Aug 08 '16 at 14:45

gc5

9,468
24
90
151

votes

1 answer

Dask worker persistent variables

Is there a way with dask to have a variable that can be retrieved from one task to another. I mean a variable that I could lock in the worker and then retrieve in the same worker when i execute another task.

python dask

asked Aug 04 '16 at 16:10

Bertrand

votes

1 answer

dask distributed memory error

I got the following error on the scheduler while running Dask on a distributed job: distributed.core - ERROR - Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/distributed/core.py", line 269, in write frames =…

python dask

asked Jul 23 '16 at 06:52

JRR

6,014
6
39
59

votes

1 answer

Result of .join in dask dataframes seems to depend on the way, the dask dataframe was generated

I got unexpected results when applying join to dask dataframes which were generated by the .from_delayed method. I want to demonstrate this by the following example, which consists of three parts. Generate dask dataframe via the from_delayed…

python pandas dask

asked Jul 17 '16 at 00:25

Arco Bast

3,595
2
26
53

votes

1 answer

When using Bag.to_textfiles with dask, I get the error "AttributeError: 'dict' object has no attribute 'endswith'"

Title says most of it but the object in question is: >>> import dask.bag as db >>> b = db.from_sequence([{'name': 'Alice', 'balance': 100}, ... {'name': 'Bob', 'balance': 200}, ... {'name':…

python dask

asked Jun 06 '16 at 16:59

JMann

votes

1 answer

How do you use dask + distributed for NFS files?

Working from Matthew Rocklin's post on distributed data frames with Dask, I'm trying to distribute some summary statistics calculations across my cluster. Setting up the cluster with dcluster ... works fine. Inside a notebook, import dask.dataframe…

python-3.x distributed-computing dask

asked Apr 14 '16 at 18:44

DGrady

1,065
1
13
23

votes

2 answers

Array operations on dask arrays

I have got two dask arrays i.e., a and b. I get dot product of a and b as below >>>z2 = da.from_array(a.dot(b),chunks=1) >>> z2 dask.array But when i do sigmoid(z2) Shell stops working. I…

python dask

asked Mar 26 '16 at 08:52

Kavan

votes

1 answer

When are generators converted to lists in Dask?

In Dask, when do generators get converted to lists, or are they generally consumed lazily? For example, with the code: from collections import Counter import numpy as np import dask.bag as db def foo(n): for _ in range(n): yield…

python dask

asked Mar 03 '16 at 08:19

AJ Friend

votes

2 answers

Combination of parallel processing and dask arrays to process multiple image stacks

I have a directory containing n h5 file each of which has m image stacks to filter. For each image, I will run the filtering (gaussian and laplacian) using dask parallel arrays in order to speed up the processing (Ref to Dask). I will use the dask…

python numpy dask scikit-image

asked Feb 10 '16 at 20:20

s1mc0d3

votes

1 answer

Collecting attributes from dask dataframe providers

TL;DR: How can I collect metadata (errors during parsing) from distributed reads into a dask dataframe collection. I currently have a proprietary file format i'm using to feed into dask.DataFrame. I have a function that accepts a file path and…

python pandas dask

asked Jan 26 '16 at 12:31

NirIzr

3,131
2
30
49

votes

2 answers

Python: Why copying a Dask slice to Numpy array result in row count mismatch

I am having error while copying a slice of dask array to nparray, the number of row doesn't match store = h5py.File(s_file_path + '.hdf5', 'r') dset = store['data_matrix'] data_matrix = da.from_array(dset, chunks=dset.chunks) test_set =…

python arrays numpy dask

asked Dec 24 '15 at 17:58

user1946989

votes

3 answers

how can one use dask.dataframe with custom dsk graphs

I'll try to rephrase my question: How do I combine a dask.dataframe along with a function like zip? assume we have a file named "accounts.0.csv" with the following data id,names,amount 352,Dan,4837 387,Tim,208 42,Jerry,21 129,Patricia,284 i wrote…

python dask

asked Oct 21 '15 at 10:44

sami

votes

0 answers

assign() to variable column name in dask DataFrames

I have code that works in pandas, but I'm having trouble converting it to use dask. There is a partial solution here, but it does not allow me to use a variable as the name of the column I am assigning to. Here's the working pandas…

python pandas dataframe dask

asked Oct 20 '15 at 18:19

kaz

votes

1 answer

Individual dask array boundaries with ghosting

I'm playing with Dask trying to set up a few simple PDE solves using finite differences and I'm wondering if there's a way to specify boundary conditions per-boundary. Docs here The current ghost.ghost function allows specifying a few different…

dask

asked Jul 24 '15 at 17:53

Gil Forsyth

vote

3 answers

How to read and store vector (List[float]) in Dask DataFrame?

I am trying to have "vector" column in Dask DataFrame, from a large np.array of vectors (at this point it is 500k * 1536 vector). With Pandas DataFrame code would look something like this: import pandas as pd import numpy as np vectors =…

python pandas dataframe dask

asked Aug 31 '23 at 17:37

Mike Chaliy

25,801
18
67
105

Prev 1 2 3

…

100 Next