Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

dask: difference between client.persist and client.compute

I am confused about what the difference is between client.persist() and client.compute() both seem (in some cases) to start my calculations and both return asynchronous objects, however not in my simple example: In this example from…

python dask

asked Jan 23 '17 at 12:51

johnbaltis

1,413
4
14
26

votes

3 answers

how to parallelize many (fuzzy) string comparisons using apply in Pandas?

I have the following problem I have a dataframe master that contains sentences, such as master Out[8]: original 0 this is a nice sentence 1 this is another one 2 stackoverflow is nice For every row in Master, I lookup…

python pandas parallel-processing dask fuzzywuzzy

asked Jun 22 '16 at 22:17

ℕʘʘḆḽḘ

18,566
34
128
235

votes

1 answer

Best practices in setting number of dask workers

I am a bit confused by the different terms used in dask and dask.distributed when setting up workers on a cluster. The terms I came across are: thread, process, processor, node, worker, scheduler. My question is how to set the number of each, and if…

dask dask-distributed

asked Jun 29 '18 at 10:28

kristofarkas

votes

1 answer

how do we choose --nthreads and --nprocs per worker in dask distributed?

How do we choose --nthreads and --nprocs per worker in Dask distributed? I have 3 workers, with 4 cores and one thread per core on 2 workers and 8 cores on 1 worker (according to the output of lscpu Linux command on each worker).

distributed-computing dask dask-distributed

asked Mar 21 '18 at 12:59

Harish Rajula

votes

1 answer

dask dataframe apply meta

I'm wanting to do a frequency count on a single column of a dask dataframe. The code works, but I get an warning complaining that meta is not defined. If I try to define meta I get an error AttributeError: 'DataFrame' object has no attribute 'name'.…

python pandas dask

asked Jun 08 '17 at 10:13

Matti Lyra

12,828
8
49
67

votes

6 answers

How should I get the shape of a dask dataframe?

Performing .shape is giving me the following error. AttributeError: 'DataFrame' object has no attribute 'shape' How should I get the shape instead?

python dask

asked May 15 '18 at 16:57

user1559897

1,454
2
14
27

votes

3 answers

Merge a large Dask dataframe with a small Pandas dataframe

Following the example here: YouTube: Dask-Pandas Dataframe Join I attempting to merge a ~70GB Dask data frame with a ~24MB that I loaded as a Pandas dataframe. The merge is on two columns A and B, and I did not set any as indices: import…

python pandas dask

asked Sep 13 '16 at 12:38

dleal

2,244
6
27
49

votes

3 answers

Airflow + celery or dask. For what, when?

I read in the official Airflow documentation the following: What does this mean exactly? What do the authors mean by scaling out? That is, when is it not enough to use Airflow or when would anyone use Airflow in combination with something like…

celery dask airflow

asked Mar 15 '18 at 22:17

Amelio Vazquez-Reina

91,494
132
359
564

votes

6 answers

Default pip installation of Dask gives "ImportError: No module named toolz"

I installed Dask using pip like this: pip install dask and when I try to do import dask.dataframe as dd I get the following error message: >>> import dask.dataframe as dd Traceback (most recent call last): File "", line 1, in …

python installation pip importerror dask

asked Jan 03 '17 at 22:38

TheDudeAbides

1,821
1
21
29

votes

4 answers

Sampling n= 2000 from a Dask Dataframe of len 18000 generates error Cannot take a larger sample than population when 'replace=False'

I have a dask dataframe created from a csv file and len(daskdf) returns 18000 but when I ddSample = daskdf.sample(2000) I get the error ValueError: Cannot take a larger sample than population when 'replace=False' Can I sample without replacement if…

python dask

asked Aug 26 '16 at 23:33

mobcdi

1,532
2
28
49

votes

1 answer

How to efficiently submit tasks with large arguments in Dask distributed?

I want to submit functions with Dask that have large (gigabyte scale) arguments. What is the best way to do this? I want to run this function many times with different (small) parameters. Example (bad) This uses the concurrent.futures interface. …

python dask

asked Jan 04 '17 at 18:54

MRocklin

55,641
23
163
235

votes

2 answers

Apply function to grouped data frame in Dask: How do you specify the grouped Dataframe as argument in the function?

I have a dask dataframe grouped by the index (first_name). import pandas as pd import numpy as np from multiprocessing import cpu_count from dask import dataframe as dd from dask.multiprocessing import get from dask.distributed import…

python pandas dask

asked Mar 19 '18 at 06:24

nanounanue

7,942
7
41
73

votes

4 answers

Convert string to dict, then access key:values??? How to access data in a for Python?

I am having issues accessing data inside a dictionary. Sys: Macbook 2012 Python: Python 3.5.1 :: Continuum Analytics, Inc. I am working with a dask.dataframe created from a csv. Edit Question How I got to this point Assume I start out with…

python pandas dictionary data-manipulation dask

asked Aug 26 '16 at 15:25

Linwoodc3

1,037
2
11
14

votes

1 answer

What do KilledWorker exceptions mean in Dask?

My tasks are returning with KilledWorker exceptions when using Dask with the dask.distributed scheduler. What do these errors mean?

dask

asked Oct 11 '17 at 15:04

MRocklin

55,641
23
163
235

votes

1 answer

Managing worker memory on a dask localcluster

I am trying to load a dataset with dask but when it is time to compute my dataset I keep getting problems like this: WARNING - Worker exceeded 95% memory budget. Restarting. I am just working on my local machine, initiating dask as follows: if…

python pandas dask

asked Dec 26 '18 at 19:20

Jones

Prev 1

…

99 100 Next