Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

read_sql_table in Dask returns NoSuchTableError

I have a read_sql using pandas and it works fine. However, when I tried to re-create the same dataframe under Dask using the same logic. It gives me NoSuchTableError. I know for sure the table exists in my SQL database. pandas #works: import…

asked Sep 29 '20 at 19:19

Xwnola

votes

2 answers

Dask: How to Add Security (TLS/SSL) to Dask Cluster?

I'm trying to figure out how to add a security layer to my Dask Cluster deployed using helm on GKE on GCP, that would force a user to input the certificate and key files into the Security Object, as explained in this documentation [1].…

python ssl cluster-computing dask dask-distributed

asked Jul 27 '20 at 21:49

Riley Hun

2,541
5
31
77

votes

1 answer

local dask cluster using docker-compose

I want to create a docker-compose.yml containing our company analysis toolchain. For this purpose, I add dask. The docker-compoe.yml looks like this: docker-compose.yml version: '3' services: jupyter: build: docker/jupyter/. ports: -…

python docker docker-compose dask dask-distributed

asked Jun 12 '20 at 13:35

user2757652

votes

1 answer

Dask: Submit continuously, work on all submitted data

Having 500, continously growing DataFrames, I would like to submit operations on the (for each DataFrame indipendent) data to dask. My main question is: Can dask hold the continously submitted data, so I can submit a function on all the submitted…

python-3.x dask dask-distributed streamz

asked May 13 '20 at 13:33

gies0r

4,723
4
39
50

votes

1 answer

Groupby and shift a dask dataframe

I would like to scale some operations I do on pandas dataframe using dask 2.14. For example I would like to apply a shift on a column of a dataframe: import dask.dataframe as dd data =…

python dask

asked May 05 '20 at 11:08

Luca Monno

votes

1 answer

Efficiently read big csv file by parts using Dask

Now I'm reading big csv file using Dask and do some postprocessing on it (for example, do some math, then predict by some ML model and write results to Database). Avoiding load all data in memory, I want to read by chunks of current size: read first…

python csv dask dask-dataframe

asked Mar 18 '20 at 12:42

Mikhail_Sam

10,602
11
66
102

votes

1 answer

Streamz/Dask: gather does not wait for all results of buffer

Imports: from dask.distributed import Client import streamz import time Simulated workload: def increment(x): time.sleep(0.5) return x + 1 Let's suppose I'd like to process some workload on a local Dask client: if __name__ == "__main__": …

python stream dask dask-distributed streamz

asked Feb 03 '20 at 02:35

daniel451

10,626
19
67
125

votes

2 answers

How can I read each Parquet row group into a separate partition?

I have a parquet file with 10 row groups: In [30]: print(pyarrow.parquet.ParquetFile("/tmp/test2.parquet").num_row_groups) 10 But when I load it using Dask Dataframe, it is read into a single partition: In [31]:…

python dataframe dask parquet

asked Jan 30 '20 at 14:27

gerrit

24,025
17
97
170

votes

1 answer

Limit Dask CPU and Memory Usage (Single Node)

I am running Dask on a single computer where running .compute() to perform the computations on a huge parquet file will cause dask to use up all the CPU cores on the system. import dask as dd df = dd.read_parquet(parquet_file) # very large…

python python-3.x pandas dask dask-distributed

asked Jan 22 '20 at 18:30

Nyxynyx

61,411
155
482
830

votes

1 answer

Memory clean up of Dask workers

I am running multiple parallel tasks on a multi-node distributed Dask cluster. However, once the tasks are finished, workers still hold large memory and cluster gets filled up soon. I have tried client.restart() after every task and…

python dask dask-distributed

asked Jan 18 '20 at 16:01

spiralarchitect

votes

1 answer

create a dask dataframe from a dictionary

I have a dictionary like this: d = {'Caps': 'cap_list', 'Term': 'unique_tokens', 'LocalFreq': 'local_freq_list','CorpusFreq': 'corpus_freq_list'} I want to create a dask dataframe from it. How do I do it? Normally, in Pandas, is can be easily…

pandas dask

asked Dec 17 '19 at 15:38

user1717931

2,419
5
29
40

votes

1 answer

Is there a faster way to export data from Dask DataFrame to CSV?

I am reading CSV file (10 GB) using Dask. Then after performing some operations, I am Exporting file in CSV format using to_csv. But problem is that exporting this file is taking around 27 Minutes (According to ProgressBar Diagnostics). CSV file…

python csv dask

asked Oct 23 '19 at 12:11

Pritesh K.

votes

2 answers

Adding new Xarray DataArray to an existing Zarr store without re-writing the whole dataset?

How do I add a new DataArray to an existing Dataset without overwriting the whole thing? The new DataArray shares some coordinates with the existing one, but also has new ones. In my current implementation, the Dataset gets completely overwritten,…

dask python-xarray zarr

asked Sep 21 '19 at 17:27

jkmacc

6,125
3
30
27

votes

2 answers

Reading and writing out of core files sequentially multi-threaded with Python

Overall goal: I want to train a pytorch model on a data set that does not fit into memory. Now forget that I spoke about pytorch, what it boils down to: Reading and writing a large file out of core or memory mapped. I found a lot of libraries, but…

python numpy dask python-xarray joblib

asked Aug 14 '19 at 12:39

dreamflasher

1,387
15
22

votes

1 answer

Load many feather files in a folder into dask

With a folder with many .feather files, I would like to load all of them into dask in python. So far, I have tried the following sourced from a similar question on GitHub https://github.com/dask/dask/issues/1277 files = [...] dfs =…

python pandas dask feather

asked Aug 08 '19 at 00:53

ZeroStack

1,049
1
13
25

Prev 1 2 3

…

99 100 Next