Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
11
votes
1 answer

How do I change rows and columns in a dask dataframe?

There are few issues I am having with Dask Dataframes. lets say I have a dataframe with 2 columns ['a','b'] if i want a new column c = a + b in pandas i would do : df['c'] = df['a'] + df['b'] In dask I am doing the same operation as follows: df =…
Sam
  • 111
  • 1
  • 3
10
votes
2 answers

Only a column name can be used for the key in a dtype mappings argument

I've successfully brought in one table using dask read_sql_table from a oracle database. However, when I try to bring in another table I get this error KeyError: 'Only a column name can be used for the key in a dtype mappings argument.' I've…
Pete
  • 107
  • 1
  • 1
  • 6
10
votes
1 answer

TypeError: can't pickle _thread._local objects when using dask on pandas DataFrame

I have a huge DataFrame which I want to process using dask in order to save time. The problem is that I get stuck in this TypeError: can't pickle _thread._local objects error as soon as it starts running. Can someone help me? I have written a…
FrancescoLS
  • 376
  • 1
  • 6
  • 17
10
votes
2 answers

dask-worker memory kept between tasks

Intro I am parallelising some code using dask.distributed (embarrassingly parallel task). I have a list of Paths pointing to different images that I scatter to workers. Each worker loads and filters an image (3D stack) and run some filtering. 3D…
s1mc0d3
  • 523
  • 2
  • 15
10
votes
1 answer

How to check if dask dataframe is empty

Is there a dask equivalent of pandas empty function? I want to check if a dask dataframe is empty but df.empty return AttributeError: 'DataFrame' object has no attribute 'empty'
user308827
  • 21,227
  • 87
  • 254
  • 417
10
votes
1 answer

How to use Dask Pivot_table?

I'm Trying to use Pivot_table on Dask with the following dataframe: date store_nbr item_nbr unit_sales year month 0 2013-01-01 25 103665 7.0 2013 1 1 2013-01-01 25 105574 1.0 2013 …
ambigus9
  • 1,417
  • 3
  • 19
  • 37
10
votes
1 answer

Dask delayed object of unspecified length not iterable error when combining dictionaries

I'm trying to construct a dictionary in parallel using dask, but I'm running into a TypeError: Delayed objects of unspecified length are not iterable. I'm trying to compute add, subtract, and multiply at the same time so the dictionary is…
blahblahblah
  • 2,299
  • 8
  • 45
  • 60
10
votes
4 answers

Remove empty partitions in Dask

When loading data from CSV some CSVs cannot be loaded, resulting in an empty partition. I would like to remove all empty partitions, as some methods seem to not work well with empty partitions. I have tried to repartition, where (for example)…
morganics
  • 1,209
  • 13
  • 27
10
votes
2 answers

simple dask map_partitions example

I read the following SO thead and now am trying to understand it. Here is my example: import dask.dataframe as dd import pandas as pd from dask.multiprocessing import get import random df = pd.DataFrame({'col_1':random.sample(range(10000), 10000),…
user1700890
  • 7,144
  • 18
  • 87
  • 183
10
votes
3 answers

Can I set the index column when reading a CSV using Python dask?

When using Python Pandas to read a CSV it is possible to specify the index column. Is this possible using Python Dask when reading the file, as opposed to setting the index afterwards? For example, using pandas: df = pandas.read_csv(filename,…
Jaydog
  • 552
  • 2
  • 6
  • 22
10
votes
3 answers

Dask: nunique method on Dataframe groupBy

I would like to know if it is possible to have the number of unique items from a given column after a groupBy aggregation with Dask. I don't see anything like this in the documentation. It is available on pandas dataframe and really useful. I've…
Guillaume EB
  • 317
  • 2
  • 12
10
votes
1 answer

Dask read_csv fails where pandas doesn't

Trying to use dask's read_csv on file where pandas's read_csv like this dd.read_csv('data/ecommerce-new.csv') fails with the following error: pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 2 The file…
nikitautiu
  • 951
  • 1
  • 14
  • 28
10
votes
1 answer

ValueError: Not all divisions are known, can't align partitions error on dask dataframe

I have the following pandas dataframe with the following columns user_id user_agent_id requests All columns contain integers. I wan't to perform some operations on them and run them using dask dataframe. This is what I do. user_profile =…
Apostolos
  • 7,763
  • 17
  • 80
  • 150
10
votes
2 answers

how to store worker-local variables in dask/distributed

Using dask 0.15.0, distributed 1.17.1. I want to memoize some things per worker, like a client to access google cloud storage, because instantiating it is expensive. I'd rather store this in some kind of worker attribute. What is the canonical way…
10
votes
1 answer

How do I use an InfiniBand network with Dask?

I have a cluster with a high performance network (InfiniBand). However when I set up my Dask scheduler and workers, performance doesn't seem to be as fast as I would expect. How can I tell Dask to use this network? Disclaimer: I'm just asking this…
MRocklin
  • 55,641
  • 23
  • 163
  • 235