Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
1 answer

Core dimension error when running numba ufunc on dask array

I'm trying to run custom numba vectorized/ufunc functions in a lazy dask pipeline. When I run the code below I get a ValueError: Core dimension 'm' consists of multiple chunks. I don't understand why m is considered a core dimension. Any idea how I…
Loïc Dutrieux
  • 381
  • 5
  • 16
3
votes
1 answer

Efficiency in using pandas and parquet

People talk a lot about using parquet and pandas. And I am trying hard to understand if we can utilize the entire features of parquet files when used with pandas. For instance say I have a big parquet file (partitioned on year) with 30 columns…
Xion
  • 319
  • 2
  • 11
3
votes
1 answer

Submit worker functions in dask distributed without waiting for the functions to end

I have this python code that uses the apscheduler library to submit processes, it works fine: from apscheduler.schedulers.background import BackgroundScheduler scheduler = BackgroundScheduler() array = [ 1, 3, 5, 7] for elem in array: …
ps0604
  • 1,227
  • 23
  • 133
  • 330
3
votes
0 answers

How do I configure Dask distributed logging levels with an environment variable?

It feels like I should be able to read between the lines of https://docs.dask.org/en/latest/how-to/debug.html and https://docs.dask.org/en/latest/configuration.html to craft an environment variable name and value, but none…
Duncan McGregor
  • 17,665
  • 12
  • 64
  • 118
3
votes
1 answer

Dask read CSV files recursively from directories

For the following directory structure Folder Sub-Folder1 File1.csv File2.csv File3.csv File4.csv Sub-Folder2 File1.csv File2.csv Sub-Folder3 File1.csv …
S_S
  • 1,276
  • 4
  • 24
  • 47
3
votes
2 answers

visualize DASK task graphs

I am following this tutorial and created a graph like so: from dask.threaded import get from operator import add dsk = { 'x': 1, 'y': 2, 'z': (add, 'x', 'y'), 'w': (sum, ['x', 'y', 'z']) } get(dsk, "w") That works and I get the…
3
votes
1 answer

How to compute pandas dataframe of pairwise string-similarities in parallel using dask?

I have a list of strings, and I want to build a dataframe which gives the Jaro-Winkler normalized similarity between each pair of strings. There is a function in the package textdistance to compute it. Loosely, similar strings have a score close to…
hwong557
  • 1,309
  • 1
  • 10
  • 15
3
votes
2 answers

Extracting latest values in a Dask dataframe with non-unique index column dates

I'm quite familiar with pandas dataframes but I'm very new to Dask so I'm still trying to wrap my head around parallelizing my code. I've obtained my desired results using pandas and pandarallel already so what I'm trying to figure out is if I can…
Kafkaesque
  • 37
  • 3
3
votes
0 answers

What does set_index(col, compute=True) do in Dask?

This is now a Github issue What does the parameter compute in Dask dataframe's set index do? df.set_index(col, compute=True) The documentation says compute: bool, default False Whether or not to trigger an immediate computation. Defaults to…
Dahn
  • 1,397
  • 1
  • 10
  • 29
3
votes
1 answer

Python: How to write large netcdf with xarray

I am loading in the following data using xr.mfdataset. There is 16GB of data, across many files. import xarray as xr from datetime import datetime from pathlib import Path from dask.diagnostics import ProgressBar def add_time_dim(xda: xr.Dataset)…
Tommy Lees
  • 1,293
  • 3
  • 14
  • 34
3
votes
2 answers

Convert column of categoricals to additional columns

I have a large dataset in the form of the following dataframe that I previously loaded from avro files timestamp id category value 2021-01-01 00:00:00+00:00 a d g 2021-01-01 00:10:00+00:00 a d h 2021-01-01…
sobek
  • 1,386
  • 10
  • 28
3
votes
1 answer

limit number of CPUs used by dask compute

Below code uses appx 1 sec to execute on an 8-CPU system. How to manually configure number of CPUs used by dask.compute eg to 4 CPUs so the below code will use appx 2 sec to execute even on an 8-CPU system? import dask from time import sleep def…
Russell Burdt
  • 2,391
  • 2
  • 19
  • 30
3
votes
3 answers

Expand a list-like column in dask DF across several columns

This is similar to previous questions about how to expand a list-based column across several columns, but the solutions I'm seeing don't seem to work for Dask. Note, that the true DFs I'm working with are too large to hold in memory, so converting…
Drivebyluna
  • 344
  • 2
  • 14
3
votes
1 answer

Parallelizing list filtering

I have a list of items that I need to filter based on some conditions. I'm wondering whether Dask could do this filtering in parallel, as the list is very long (a few dozen million records). Basically, what I need to do is this: items = [ …
Victor
  • 1,163
  • 4
  • 25
  • 45
3
votes
0 answers

Using Dask throws ImportError when run inside SageMath

Recently, I have been trying to parallelize some Sage (Sage 9.4 on a MacBook Pro running OSX 11.2.3) code using Dask. The problem I run into is that while I can run Dask inside Sage, it will break whenever I include any code that isn't "pure…
Sam Ballas
  • 53
  • 5