Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
5
votes
1 answer

read_sql_table in Dask returns NoSuchTableError

I have a read_sql using pandas and it works fine. However, when I tried to re-create the same dataframe under Dask using the same logic. It gives me NoSuchTableError. I know for sure the table exists in my SQL database. pandas #works: import…
Xwnola
  • 51
  • 3
5
votes
2 answers

Dask: How to Add Security (TLS/SSL) to Dask Cluster?

I'm trying to figure out how to add a security layer to my Dask Cluster deployed using helm on GKE on GCP, that would force a user to input the certificate and key files into the Security Object, as explained in this documentation [1].…
Riley Hun
  • 2,541
  • 5
  • 31
  • 77
5
votes
1 answer

local dask cluster using docker-compose

I want to create a docker-compose.yml containing our company analysis toolchain. For this purpose, I add dask. The docker-compoe.yml looks like this: docker-compose.yml version: '3' services: jupyter: build: docker/jupyter/. ports: -…
user2757652
  • 353
  • 2
  • 9
5
votes
1 answer

Dask: Submit continuously, work on all submitted data

Having 500, continously growing DataFrames, I would like to submit operations on the (for each DataFrame indipendent) data to dask. My main question is: Can dask hold the continously submitted data, so I can submit a function on all the submitted…
gies0r
  • 4,723
  • 4
  • 39
  • 50
5
votes
1 answer

Groupby and shift a dask dataframe

I would like to scale some operations I do on pandas dataframe using dask 2.14. For example I would like to apply a shift on a column of a dataframe: import dask.dataframe as dd data =…
Luca Monno
  • 830
  • 1
  • 11
  • 25
5
votes
1 answer

Efficiently read big csv file by parts using Dask

Now I'm reading big csv file using Dask and do some postprocessing on it (for example, do some math, then predict by some ML model and write results to Database). Avoiding load all data in memory, I want to read by chunks of current size: read first…
Mikhail_Sam
  • 10,602
  • 11
  • 66
  • 102
5
votes
1 answer

Streamz/Dask: gather does not wait for all results of buffer

Imports: from dask.distributed import Client import streamz import time Simulated workload: def increment(x): time.sleep(0.5) return x + 1 Let's suppose I'd like to process some workload on a local Dask client: if __name__ == "__main__": …
daniel451
  • 10,626
  • 19
  • 67
  • 125
5
votes
2 answers

How can I read each Parquet row group into a separate partition?

I have a parquet file with 10 row groups: In [30]: print(pyarrow.parquet.ParquetFile("/tmp/test2.parquet").num_row_groups) 10 But when I load it using Dask Dataframe, it is read into a single partition: In [31]:…
gerrit
  • 24,025
  • 17
  • 97
  • 170
5
votes
1 answer

Limit Dask CPU and Memory Usage (Single Node)

I am running Dask on a single computer where running .compute() to perform the computations on a huge parquet file will cause dask to use up all the CPU cores on the system. import dask as dd df = dd.read_parquet(parquet_file) # very large…
Nyxynyx
  • 61,411
  • 155
  • 482
  • 830
5
votes
1 answer

Memory clean up of Dask workers

I am running multiple parallel tasks on a multi-node distributed Dask cluster. However, once the tasks are finished, workers still hold large memory and cluster gets filled up soon. I have tried client.restart() after every task and…
spiralarchitect
  • 880
  • 7
  • 19
5
votes
1 answer

create a dask dataframe from a dictionary

I have a dictionary like this: d = {'Caps': 'cap_list', 'Term': 'unique_tokens', 'LocalFreq': 'local_freq_list','CorpusFreq': 'corpus_freq_list'} I want to create a dask dataframe from it. How do I do it? Normally, in Pandas, is can be easily…
user1717931
  • 2,419
  • 5
  • 29
  • 40
5
votes
1 answer

Is there a faster way to export data from Dask DataFrame to CSV?

I am reading CSV file (10 GB) using Dask. Then after performing some operations, I am Exporting file in CSV format using to_csv. But problem is that exporting this file is taking around 27 Minutes (According to ProgressBar Diagnostics). CSV file…
Pritesh K.
  • 128
  • 1
  • 9
5
votes
2 answers

Adding new Xarray DataArray to an existing Zarr store without re-writing the whole dataset?

How do I add a new DataArray to an existing Dataset without overwriting the whole thing? The new DataArray shares some coordinates with the existing one, but also has new ones. In my current implementation, the Dataset gets completely overwritten,…
jkmacc
  • 6,125
  • 3
  • 30
  • 27
5
votes
2 answers

Reading and writing out of core files sequentially multi-threaded with Python

Overall goal: I want to train a pytorch model on a data set that does not fit into memory. Now forget that I spoke about pytorch, what it boils down to: Reading and writing a large file out of core or memory mapped. I found a lot of libraries, but…
dreamflasher
  • 1,387
  • 15
  • 22
5
votes
1 answer

Load many feather files in a folder into dask

With a folder with many .feather files, I would like to load all of them into dask in python. So far, I have tried the following sourced from a similar question on GitHub https://github.com/dask/dask/issues/1277 files = [...] dfs =…
ZeroStack
  • 1,049
  • 1
  • 13
  • 25