Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
1 answer

Best way to parallelize computation over dask blocks that do not return np arrays?

I'd like to return a dask dataframe from an overlapping dask array computation, where each block's computation returns a pandas dataframe. The example below shows one way to do this, simplified for demonstration purposes. I've found a combination…
HoosierDaddy
  • 720
  • 6
  • 19
3
votes
2 answers

How to create a database connect engine in each Dask sub process to parallel thousands of sql query, without recreating engine in every query

I need to embarrassingly parallel the fetch job for thousands of sql query from database. Here is the simplified example. ##Env info: python=3.7 postgresql=10 dask=latest ##generate the example db table. from sqlalchemy import create_engine import…
WilsonF
  • 85
  • 6
3
votes
1 answer

How can I speed up reading a CSV/Parquet file from adl:// with fsspec+adlfs?

I have a several gigabyte CSV file residing in Azure Data Lake. Using Dask, I can read this file in under a minute as follows: >>> import dask.dataframe as dd >>> adl_path = 'adl://...' >>> df = dd.read_csv(adl_path, storage_options={...}) >>>…
user655321
  • 1,572
  • 2
  • 16
  • 33
3
votes
1 answer

Read a list of files using Dask

I found that Dask can read several csv files this way: import dask.dataframe as dd df = dd.read_csv('myfiles.*.csv') # doctest: +SKIP But what if I want to load not all but some of them: my_files = ['file1.csv', 'file3.csv','file7.csv'] df =…
Mikhail_Sam
  • 10,602
  • 11
  • 66
  • 102
3
votes
1 answer

Computing dask array chunks asynchronously (Dask + FastAPI)

I am building a FastAPI application that will serve chunks of a Dask Array. I would like to leverage FastAPI's asynchronous functionality alongside Dask-distributed's ability to operate asynchronously. Below is a mcve that demonstrates what I'm…
jhamman
  • 5,867
  • 19
  • 39
3
votes
1 answer

Loading large zipped data-set using dask

I am trying to load a large zipped data set into python with the following structure: year.zip year month a lot of .csv files So far I have used the ZipFile library to iterate through each of the CSV files and load them using pandas. zf =…
Vlad
  • 33
  • 4
3
votes
1 answer

Split a parquet file in smaller chunks using dask

I am trying to split a parquet file using DASK with the following piece of code import dask.dataframe as pd df = pd.read_parquet(dataset_path, chunksize="100MB") df.repartition(partition_size="100MB") pd.to_parquet(df,output_path) I have only one…
Serge
  • 87
  • 2
  • 10
3
votes
0 answers

How to pick proper number of threads, workers, processes for Dask when running in an ephemeral environment as single machine and cluster

Our company is currently leveraging prefect.io for data workflows (ELT, report generation, ML, etc). We have just started adding the ability to do parallel task execution, which is powered by Dask. Our flows are executed using ephemeral AWS Fargate…
braunk
  • 31
  • 2
3
votes
1 answer

Dask running out of memory even with chunks

I'm working with big CSV files and I need to make a Cartesian Product (merge operation). I've tried to face the problem with Pandas (you can check Panda's code and a data format example for the same problem, here) without success due to memory…
Genarito
  • 3,027
  • 5
  • 27
  • 53
3
votes
1 answer

What is the relationship between BlazingSQL and dask?

I'm trying to understand if BlazingSQL is a competitor or complementary to dask. I have some medium-sized data (10-50GB) saved as parquet files on Azure blob storage. IIUC I can query, join, aggregate, groupby with BlazingSQL using SQL syntax, but I…
Dave Hirschfeld
  • 768
  • 2
  • 6
  • 15
3
votes
3 answers

Unable to install Dask[complete] in alpine 3.9 docker image

As part of one of my requirement, I am trying to create a docker image from alpine 3.9 and installing python 3.8 in it, which works fine. But when I try to install dask[complete] in it it is failing with the following error : ERROR: Command errored…
Aman Saurav
  • 751
  • 9
  • 28
3
votes
1 answer

Pivoting a dask dataframe using multiple columns as index

I have a Dask DataFrame of following format: date hour device param value 20190701 21 dev_01 att_1 0.000000 20190718 22 dev_01 att_2 20.000000 20190718 22 dev_01 att_3 18.611111 20190701 21 dev_01 att_4 …
Arnab Biswas
  • 4,495
  • 3
  • 42
  • 60
3
votes
3 answers

Convert Dask Bag of Pandas DataFrames to a single Dask DataFrame

Summary of Problem Short Version How do I go from a Dask Bag of Pandas DataFrames, to a single Dask DataFrame? Long Version I have a number of files that are not readable by any of dask.dataframe's various read functions (e.g. dd.read_csv or…
natemcintosh
  • 730
  • 6
  • 16
3
votes
1 answer

Asynchronous Xarray writing to Zarr

all. I'm using a Dask Distributed cluster to write Zarr+Dask-backed Xarray Datasets inside of a loop, and the dataset.to_zarr is blocking. This can really slow things down when there are straggler chunks that block the continuation of the loop. …
jkmacc
  • 6,125
  • 3
  • 30
  • 27
3
votes
3 answers

AttributeError: module 'dask' has no attribute 'delayed'

Using Pycharm Community 2018.1.4 Python 3.6 Dask 2.8.1 Trying to implement dask delayed on some of my methods and getting an error AttributeError: module 'dask' has no attribute 'delayed'. This is obviously not true so I am wondering what I am…
Sherry
  • 353
  • 3
  • 15