Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

Best way to parallelize computation over dask blocks that do not return np arrays?

I'd like to return a dask dataframe from an overlapping dask array computation, where each block's computation returns a pandas dataframe. The example below shows one way to do this, simplified for demonstration purposes. I've found a combination…

dask dask-delayed

asked Mar 23 '20 at 18:48

HoosierDaddy

votes

2 answers

How to create a database connect engine in each Dask sub process to parallel thousands of sql query, without recreating engine in every query

I need to embarrassingly parallel the fetch job for thousands of sql query from database. Here is the simplified example. ##Env info: python=3.7 postgresql=10 dask=latest ##generate the example db table. from sqlalchemy import create_engine import…

python postgresql parallel-processing dask

asked Mar 17 '20 at 03:47

WilsonF

votes

1 answer

How can I speed up reading a CSV/Parquet file from adl:// with fsspec+adlfs?

I have a several gigabyte CSV file residing in Azure Data Lake. Using Dask, I can read this file in under a minute as follows: >>> import dask.dataframe as dd >>> adl_path = 'adl://...' >>> df = dd.read_csv(adl_path, storage_options={...}) >>>…

python dask fsspec

asked Mar 12 '20 at 00:33

user655321

1,572
2
16
33

votes

1 answer

Read a list of files using Dask

I found that Dask can read several csv files this way: import dask.dataframe as dd df = dd.read_csv('myfiles.*.csv') # doctest: +SKIP But what if I want to load not all but some of them: my_files = ['file1.csv', 'file3.csv','file7.csv'] df =…

python python-3.x csv dataframe dask

asked Mar 03 '20 at 07:45

Mikhail_Sam

10,602
11
66
102

votes

1 answer

Computing dask array chunks asynchronously (Dask + FastAPI)

I am building a FastAPI application that will serve chunks of a Dask Array. I would like to leverage FastAPI's asynchronous functionality alongside Dask-distributed's ability to operate asynchronously. Below is a mcve that demonstrates what I'm…

dask dask-distributed fastapi uvicorn

asked Mar 02 '20 at 16:28

jhamman

5,867
19
39

votes

1 answer

Loading large zipped data-set using dask

I am trying to load a large zipped data set into python with the following structure: year.zip year month a lot of .csv files So far I have used the ZipFile library to iterate through each of the CSV files and load them using pandas. zf =…

python pandas csv zip dask

asked Feb 25 '20 at 17:07

Vlad

votes

1 answer

Split a parquet file in smaller chunks using dask

I am trying to split a parquet file using DASK with the following piece of code import dask.dataframe as pd df = pd.read_parquet(dataset_path, chunksize="100MB") df.repartition(partition_size="100MB") pd.to_parquet(df,output_path) I have only one…

python data-science dask

asked Jan 23 '20 at 21:20

Serge

votes

0 answers

How to pick proper number of threads, workers, processes for Dask when running in an ephemeral environment as single machine and cluster

Our company is currently leveraging prefect.io for data workflows (ELT, report generation, ML, etc). We have just started adding the ability to do parallel task execution, which is powered by Dask. Our flows are executed using ephemeral AWS Fargate…

dask dask-distributed aws-fargate dask-kubernetes

asked Jan 23 '20 at 19:43

braunk

votes

1 answer

Dask running out of memory even with chunks

I'm working with big CSV files and I need to make a Cartesian Product (merge operation). I've tried to face the problem with Pandas (you can check Panda's code and a data format example for the same problem, here) without success due to memory…

python python-3.x dask

asked Jan 22 '20 at 17:50

Genarito

3,027
5
27
53

votes

1 answer

What is the relationship between BlazingSQL and dask?

I'm trying to understand if BlazingSQL is a competitor or complementary to dask. I have some medium-sized data (10-50GB) saved as parquet files on Azure blob storage. IIUC I can query, join, aggregate, groupby with BlazingSQL using SQL syntax, but I…

gpu dask parquet cudf

asked Jan 18 '20 at 03:09

Dave Hirschfeld

votes

3 answers

Unable to install Dask[complete] in alpine 3.9 docker image

As part of one of my requirement, I am trying to create a docker image from alpine 3.9 and installing python 3.8 in it, which works fine. But when I try to install dask[complete] in it it is failing with the following error : ERROR: Command errored…

python docker dockerfile dask alpine-linux

asked Jan 16 '20 at 07:39

Aman Saurav

votes

1 answer

Pivoting a dask dataframe using multiple columns as index

I have a Dask DataFrame of following format: date hour device param value 20190701 21 dev_01 att_1 0.000000 20190718 22 dev_01 att_2 20.000000 20190718 22 dev_01 att_3 18.611111 20190701 21 dev_01 att_4 …

dask

asked Jan 03 '20 at 12:36

Arnab Biswas

4,495
3
42
60

votes

3 answers

Convert Dask Bag of Pandas DataFrames to a single Dask DataFrame

Summary of Problem Short Version How do I go from a Dask Bag of Pandas DataFrames, to a single Dask DataFrame? Long Version I have a number of files that are not readable by any of dask.dataframe's various read functions (e.g. dd.read_csv or…

python pandas dataframe dask

asked Dec 13 '19 at 14:32

natemcintosh

votes

1 answer

Asynchronous Xarray writing to Zarr

all. I'm using a Dask Distributed cluster to write Zarr+Dask-backed Xarray Datasets inside of a loop, and the dataset.to_zarr is blocking. This can really slow things down when there are straggler chunks that block the continuation of the loop. …

dask python-xarray zarr

asked Dec 10 '19 at 17:20

jkmacc

6,125
3
30
27

votes

3 answers

AttributeError: module 'dask' has no attribute 'delayed'

Using Pycharm Community 2018.1.4 Python 3.6 Dask 2.8.1 Trying to implement dask delayed on some of my methods and getting an error AttributeError: module 'dask' has no attribute 'delayed'. This is obviously not true so I am wondering what I am…

dask dask-delayed

asked Dec 04 '19 at 16:58

Sherry

Prev 1 2 3

…

99 100 Next