Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
2
votes
1 answer

How to discretize large dataframe by columns with variable bins in Pandas/Dask

I am able to discretize a Pandas dataframe by columns with this code: import numpy as np import pandas as pd def discretize(X, n_scale=1): for c in X.columns: loc = X[c].median() # median absolute deviation of the column …
gc5
  • 9,468
  • 24
  • 90
  • 151
2
votes
1 answer

Dask worker persistent variables

Is there a way with dask to have a variable that can be retrieved from one task to another. I mean a variable that I could lock in the worker and then retrieve in the same worker when i execute another task.
Bertrand
  • 994
  • 9
  • 23
2
votes
1 answer

dask distributed memory error

I got the following error on the scheduler while running Dask on a distributed job: distributed.core - ERROR - Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/distributed/core.py", line 269, in write frames =…
JRR
  • 6,014
  • 6
  • 39
  • 59
2
votes
1 answer

Result of .join in dask dataframes seems to depend on the way, the dask dataframe was generated

I got unexpected results when applying join to dask dataframes which were generated by the .from_delayed method. I want to demonstrate this by the following example, which consists of three parts. Generate dask dataframe via the from_delayed…
Arco Bast
  • 3,595
  • 2
  • 26
  • 53
2
votes
1 answer

When using Bag.to_textfiles with dask, I get the error "AttributeError: 'dict' object has no attribute 'endswith'"

Title says most of it but the object in question is: >>> import dask.bag as db >>> b = db.from_sequence([{'name': 'Alice', 'balance': 100}, ... {'name': 'Bob', 'balance': 200}, ... {'name':…
JMann
  • 579
  • 4
  • 12
2
votes
1 answer

How do you use dask + distributed for NFS files?

Working from Matthew Rocklin's post on distributed data frames with Dask, I'm trying to distribute some summary statistics calculations across my cluster. Setting up the cluster with dcluster ... works fine. Inside a notebook, import dask.dataframe…
DGrady
  • 1,065
  • 1
  • 13
  • 23
2
votes
2 answers

Array operations on dask arrays

I have got two dask arrays i.e., a and b. I get dot product of a and b as below >>>z2 = da.from_array(a.dot(b),chunks=1) >>> z2 dask.array But when i do sigmoid(z2) Shell stops working. I…
Kavan
  • 331
  • 1
  • 4
  • 13
2
votes
1 answer

When are generators converted to lists in Dask?

In Dask, when do generators get converted to lists, or are they generally consumed lazily? For example, with the code: from collections import Counter import numpy as np import dask.bag as db def foo(n): for _ in range(n): yield…
AJ Friend
  • 703
  • 1
  • 7
  • 16
2
votes
2 answers

Combination of parallel processing and dask arrays to process multiple image stacks

I have a directory containing n h5 file each of which has m image stacks to filter. For each image, I will run the filtering (gaussian and laplacian) using dask parallel arrays in order to speed up the processing (Ref to Dask). I will use the dask…
s1mc0d3
  • 523
  • 2
  • 15
2
votes
1 answer

Collecting attributes from dask dataframe providers

TL;DR: How can I collect metadata (errors during parsing) from distributed reads into a dask dataframe collection. I currently have a proprietary file format i'm using to feed into dask.DataFrame. I have a function that accepts a file path and…
NirIzr
  • 3,131
  • 2
  • 30
  • 49
2
votes
2 answers

Python: Why copying a Dask slice to Numpy array result in row count mismatch

I am having error while copying a slice of dask array to nparray, the number of row doesn't match store = h5py.File(s_file_path + '.hdf5', 'r') dset = store['data_matrix'] data_matrix = da.from_array(dset, chunks=dset.chunks) test_set =…
user1946989
  • 377
  • 1
  • 4
  • 16
2
votes
3 answers

how can one use dask.dataframe with custom dsk graphs

I'll try to rephrase my question: How do I combine a dask.dataframe along with a function like zip? assume we have a file named "accounts.0.csv" with the following data id,names,amount 352,Dan,4837 387,Tim,208 42,Jerry,21 129,Patricia,284 i wrote…
sami
  • 501
  • 2
  • 6
  • 18
2
votes
0 answers

assign() to variable column name in dask DataFrames

I have code that works in pandas, but I'm having trouble converting it to use dask. There is a partial solution here, but it does not allow me to use a variable as the name of the column I am assigning to. Here's the working pandas…
kaz
  • 675
  • 2
  • 5
  • 13
2
votes
1 answer

Individual dask array boundaries with ghosting

I'm playing with Dask trying to set up a few simple PDE solves using finite differences and I'm wondering if there's a way to specify boundary conditions per-boundary. Docs here The current ghost.ghost function allows specifying a few different…
Gil Forsyth
  • 398
  • 1
  • 7
1
vote
3 answers

How to read and store vector (List[float]) in Dask DataFrame?

I am trying to have "vector" column in Dask DataFrame, from a large np.array of vectors (at this point it is 500k * 1536 vector). With Pandas DataFrame code would look something like this: import pandas as pd import numpy as np vectors =…
Mike Chaliy
  • 25,801
  • 18
  • 67
  • 105
1 2 3
99
100