Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
29
votes
1 answer

dask: difference between client.persist and client.compute

I am confused about what the difference is between client.persist() and client.compute() both seem (in some cases) to start my calculations and both return asynchronous objects, however not in my simple example: In this example from…
johnbaltis
  • 1,413
  • 4
  • 14
  • 26
29
votes
3 answers

how to parallelize many (fuzzy) string comparisons using apply in Pandas?

I have the following problem I have a dataframe master that contains sentences, such as master Out[8]: original 0 this is a nice sentence 1 this is another one 2 stackoverflow is nice For every row in Master, I lookup…
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
28
votes
1 answer

Best practices in setting number of dask workers

I am a bit confused by the different terms used in dask and dask.distributed when setting up workers on a cluster. The terms I came across are: thread, process, processor, node, worker, scheduler. My question is how to set the number of each, and if…
kristofarkas
  • 395
  • 3
  • 9
28
votes
1 answer

how do we choose --nthreads and --nprocs per worker in dask distributed?

How do we choose --nthreads and --nprocs per worker in Dask distributed? I have 3 workers, with 4 cores and one thread per core on 2 workers and 8 cores on 1 worker (according to the output of lscpu Linux command on each worker).
Harish Rajula
  • 699
  • 6
  • 11
28
votes
1 answer

dask dataframe apply meta

I'm wanting to do a frequency count on a single column of a dask dataframe. The code works, but I get an warning complaining that meta is not defined. If I try to define meta I get an error AttributeError: 'DataFrame' object has no attribute 'name'.…
Matti Lyra
  • 12,828
  • 8
  • 49
  • 67
27
votes
6 answers

How should I get the shape of a dask dataframe?

Performing .shape is giving me the following error. AttributeError: 'DataFrame' object has no attribute 'shape' How should I get the shape instead?
user1559897
  • 1,454
  • 2
  • 14
  • 27
26
votes
3 answers

Merge a large Dask dataframe with a small Pandas dataframe

Following the example here: YouTube: Dask-Pandas Dataframe Join I attempting to merge a ~70GB Dask data frame with a ~24MB that I loaded as a Pandas dataframe. The merge is on two columns A and B, and I did not set any as indices: import…
dleal
  • 2,244
  • 6
  • 27
  • 49
24
votes
3 answers

Airflow + celery or dask. For what, when?

I read in the official Airflow documentation the following: What does this mean exactly? What do the authors mean by scaling out? That is, when is it not enough to use Airflow or when would anyone use Airflow in combination with something like…
Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
24
votes
6 answers

Default pip installation of Dask gives "ImportError: No module named toolz"

I installed Dask using pip like this: pip install dask and when I try to do import dask.dataframe as dd I get the following error message: >>> import dask.dataframe as dd Traceback (most recent call last): File "", line 1, in
TheDudeAbides
  • 1,821
  • 1
  • 21
  • 29
23
votes
4 answers

Sampling n= 2000 from a Dask Dataframe of len 18000 generates error Cannot take a larger sample than population when 'replace=False'

I have a dask dataframe created from a csv file and len(daskdf) returns 18000 but when I ddSample = daskdf.sample(2000) I get the error ValueError: Cannot take a larger sample than population when 'replace=False' Can I sample without replacement if…
mobcdi
  • 1,532
  • 2
  • 28
  • 49
22
votes
1 answer

How to efficiently submit tasks with large arguments in Dask distributed?

I want to submit functions with Dask that have large (gigabyte scale) arguments. What is the best way to do this? I want to run this function many times with different (small) parameters. Example (bad) This uses the concurrent.futures interface. …
MRocklin
  • 55,641
  • 23
  • 163
  • 235
21
votes
2 answers

Apply function to grouped data frame in Dask: How do you specify the grouped Dataframe as argument in the function?

I have a dask dataframe grouped by the index (first_name). import pandas as pd import numpy as np from multiprocessing import cpu_count from dask import dataframe as dd from dask.multiprocessing import get from dask.distributed import…
nanounanue
  • 7,942
  • 7
  • 41
  • 73
21
votes
4 answers

Convert string to dict, then access key:values??? How to access data in a for Python?

I am having issues accessing data inside a dictionary. Sys: Macbook 2012 Python: Python 3.5.1 :: Continuum Analytics, Inc. I am working with a dask.dataframe created from a csv. Edit Question How I got to this point Assume I start out with…
Linwoodc3
  • 1,037
  • 2
  • 11
  • 14
20
votes
1 answer

What do KilledWorker exceptions mean in Dask?

My tasks are returning with KilledWorker exceptions when using Dask with the dask.distributed scheduler. What do these errors mean?
MRocklin
  • 55,641
  • 23
  • 163
  • 235
19
votes
1 answer

Managing worker memory on a dask localcluster

I am trying to load a dataset with dask but when it is time to compute my dataset I keep getting problems like this: WARNING - Worker exceeded 95% memory budget. Restarting. I am just working on my local machine, initiating dask as follows: if…
Jones
  • 333
  • 1
  • 2
  • 5