Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
4
votes
3 answers

Merging pandas Data frames uses way too much memory

I'm working on this Kaggle competition as the final project for the course I'm taking, and for that, I was trying to replicate this notebook but there is a function he uses to get the lagged features that is just using way too much memory for me.…
João Areias
  • 1,192
  • 11
  • 41
4
votes
1 answer

Truth of Delayed objects is not Supported

I'm using dask to delay computation of some functions that return series in my code-base. Most operations seem to behave as expected so far - apart from my use of np.average. The function I have returns a pd.Series which I then want to compute a…
freebie
  • 2,161
  • 2
  • 19
  • 36
4
votes
1 answer

Can't convert dataframe column data types

After processing a big data set using Pandas/Dask, I saved the resulting data frame to a csv file. When I try to read the output CSV using Dask, the data types are all objects by default. Whenever I try to convert them using conventional methods…
GRoutar
  • 1,311
  • 1
  • 15
  • 38
4
votes
1 answer

Tornado unexpected exception in Future after timeout

I have set up a dask cluster. I can access a web dashboard, but when I'm trying to connect to the scheduler: from dask.distributed import Client client = Client('192.168.0.10:8786') I get the following error: tornado.application - ERROR - Exception…
4
votes
1 answer

Efficient way to stack Dask Arrays generated from Xarray

So I am trying to read a large amount of relatively large netCDF files containing hydrologic data. The NetCDF files all look like this: Dimensions: (feature_id: 2729077, reference_time: 1, time: 1) Coordinates: * time …
pythonweb
  • 1,024
  • 2
  • 11
  • 26
4
votes
3 answers

Importing a Dask dataframe gives an error cannot import name 'is_datetime64tz_dtype'

I installed Dask in my Jupyter notebook using the below command !pip install “dask[complete]” After this when I run the import command import dask.dataframe as dd I get the below error. ImportError Traceback (most…
4
votes
2 answers

Efficient pairwise comparison of rows in pandas DataFrame

I am currently working with a smallish dataset (about 9 million rows). Unfortunately, most of the entries are strings, and even with coercion to categories, the frame sits at a few GB in memory. What I would like to do is compare each row with other…
Fred Byrd
  • 233
  • 2
  • 6
4
votes
1 answer

write dask dataframe to one file

I can write a massive dask data frame to disk like so: raw_data.to_csv(r'C:\Bla\SubFolder\*.csv') This produces chunked data of the original (massaged) dataset in the subfolder: C:\Bla\SubFolder\ Just wondering, can I force dask to write the data…
cs0815
  • 16,751
  • 45
  • 136
  • 299
4
votes
1 answer

Dask DataFrame .head() very slow after indexing

Not reproducible, but can someone fill in why a .head() call is greatly slowed after indexing? import dask.dataframe as dd df = dd.read_parquet("Filepath") df.head() # takes 10 seconds df = df.set_index('id') df.head() # takes 10 minutes +
AZhao
  • 13,617
  • 7
  • 31
  • 54
4
votes
1 answer

Switching from multiprocess to multithreaded Dask.DataFrame

I have a question about using dask to parallelize my code. I have a pandas dataframe and 8 cores CPU. So I want apply some function row-wise. Here is example: import dask.dataframe as dd from dask.multiprocessing import get # o - is pandas…
zhc
  • 53
  • 1
  • 5
4
votes
1 answer

dask how to define a custom (time fold) function that operates in parallel and returns a dataframe with a different shape

I am trying to implement a time fold function to be 'map'ed to various partitions of a dask dataframe which in turn changes the shape of the dataframe in question (or alternatively produces a new dataframe with the altered shape). This is how far I…
PhaKuDi
  • 141
  • 8
4
votes
3 answers

dask read_parquet with pyarrow memory blow up

I am using dask to write and read parquet. I am writing using fastparquet engine and reading using pyarrow engine. My worker has 1 gb of memory. With fastparquet the memory usage is fine, but when i switch to pyarrow, it just blows up and causes the…
pranav kohli
  • 123
  • 2
  • 6
4
votes
1 answer

Creating a dask bag from a generator

I would like to create a dask.Bag (or dask.Array) from a list of generators. The gotcha is that the generators (when evaluated) are too large for memory. delayed_array = [delayed(generator) for generator in list_of_generators] my_bag =…
danodonovan
  • 19,636
  • 10
  • 70
  • 78
4
votes
2 answers

Including keyword arguments (kwargs) in custom Dask graphs

Am building a custom graph for one operation with Dask. Am familiar with how to pass arguments to a function in Dask graph and have read up on the docs. However still seem to be missing something. One of the functions used in the Dask graph takes…
jakirkham
  • 685
  • 5
  • 18
4
votes
3 answers

parallelize conversion of a single 16M row csv to Parquet with dask

The following operation works, but takes nearly 2h: from dask import dataframe as ddf ddf.read_csv('data.csv').to_parquet('data.pq') Is there a way to parallelize this? The file data.csv is ~2G uncompressed with 16 million rows by 22 columns.
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90