Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
6
votes
2 answers

Shifting all rows in dask dataframe

In Pandas, there is a method DataFrame.shift(n) which shifts the contents of an array by n rows, relative to the index, similarly to np.roll(a, n). I can't seem to find a way to get a similar behaviour working with Dask. I realise things like…
TroyHurts
  • 337
  • 1
  • 11
5
votes
1 answer

Apply a function over the columns of a Dask array

What is the most efficient way to apply a function to each column of a Dask array? As documented below, I've tried a number of things but I still suspect that my use of Dask is rather amateurish. I have a quite wide and quite long array, in the…
5
votes
1 answer

Implement Equal-Width Intervals feature engineering in Dask

In equal-width discretization, the variable values are assigned to intervals of the same width. The number of intervals is user-defined and the width is determined by the minimum/maximum values and the number of intervals. For example, given the…
ps0604
  • 1,227
  • 23
  • 133
  • 330
5
votes
2 answers

Why Dask is not respecting the memory limits for LocalCluster?

I'm running the code pasted below in a machine with 16GB of RAM (purposely). import dask.array as da import dask.delayed from sklearn.datasets import make_blobs import numpy as np from dask_ml.cluster import KMeans from dask.distributed import…
jcfaracco
  • 853
  • 2
  • 6
  • 21
5
votes
1 answer

Dask Gateway, set worker resources

I am trying to set the resources for workers as per the docs here, but on a set up that uses Dask Gateway. Specifically, I'd like to be able to follow the answer to this question, but using Dask Gateway. I haven't been able to find a reference to…
bill_e
  • 930
  • 2
  • 12
  • 24
5
votes
1 answer

Can I read parquet from HTTP(s) octet-stream?

Some backend-endpoint returns parquet-file in octet-stream. In pandas I can do something like this: result = requests.get("https://..../file.parquet") df = pd.read_parquet(io.BytesIO(result.content)) Can I do it in Dask somehow? This…
bc30138
  • 93
  • 6
5
votes
0 answers

Storing Dask Array using Zarr Consumes Too Much Memory

I have a long list of .zarr arrays, that I would like to merge into a single array and write to disk. My code approximately looks as follows: import dask.array import zarr import os local_paths = ['parts/X_00000000.zarr', 'parts/X_00000001.zarr', …
r0f1
  • 2,717
  • 3
  • 26
  • 39
5
votes
1 answer

Dask computations slow down with time

I'm having the following issue with Dask. I noticed that the same computations take longer and longer as time passes. After I restart scheduler, the computations are fast again, and just keep slowing down. The figure below shows the time consumed by…
rafgonsi
  • 83
  • 7
5
votes
0 answers

Dask distributed KeyError

I am trying to learn Dask using a small example. Basically I read in a file and calculate row means. from dask_jobqueue import SLURMCluster cluster = SLURMCluster(cores=4, memory='24 GB') cluster.scale(4) from dask.distributed import Client client…
5
votes
1 answer

Why running Sklearn machine learning with Dask doesn't result in parallelism?

I want to perform Machine Learning algorithms from Sklearn library on all my cores using Dask and joblib libraries. My code for the joblib.parallel_backend with Dask: #Fire up the Joblib backend with Dask: with joblib.parallel_backend('dask'): …
Jakub Szlaur
  • 1,852
  • 10
  • 39
5
votes
1 answer

dask read parquet and specify schema

Is there a dask equivalent of spark's ability to specify a schema when reading in a parquet file? Possibly using kwargs passed to pyarrow? I have a bunch of parquet files in a bucket but some of the fields have slightly inconsistent names. I could…
Ray Bell
  • 1,508
  • 4
  • 18
  • 45
5
votes
1 answer

Compute co-occurences in pandas dataframe for column values grouped by another column values

Question I am using Pandas on Python 3.7.7. I would like to compute the mutual information between categorical values of a variable x grouped by another variable's values y. My data looks like the following table: +-----+-----+ | x | y …
Davide
  • 103
  • 8
5
votes
0 answers

Why is multiprocessing slower than single-core? Would using joblib or dask make a difference?

The issue I am trying to optimise some calculations which lend themselves to so-called embarrassingly parallel calculations, but I am finding that using python's multiprocessing package actually slows things down. My question is: am I doing…
Pythonista anonymous
  • 8,140
  • 20
  • 70
  • 112
5
votes
1 answer

dask.delayed memory management when a single task can consume a lot of memory outside of python

I have some calculations calling the pardiso() solver from python. The solver allocates its own memory in a way that is opaque to python, but the pointers used to access that memory are stored in python. If I were to try and run these calculations…
cbf123
  • 51
  • 1
5
votes
1 answer

Jupyter Lab - dask-labextension not working

Jupyter Lab dask-labextension not working Installed both from: A.) JupyterLab side bar/ Extension manager/ search for it / click install B.) Command line rom my cd to anaconda installation guide conda install jupyterlab nodejs conda install -c…
sogu
  • 2,738
  • 5
  • 31
  • 90