Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
2 answers

Why is computing the shape on an indexed Parquet file so slow in dask?

I have created a Parquet file from multiple Parquet files located in the same folder. Each file corresponds to a partition. Parquet files are created in different processes (using Python concurrent.futures). Here is an example of the code I run in…
hadim
  • 636
  • 1
  • 7
  • 16
3
votes
2 answers

Estimate pandas dataframe size without loading into memory

Is there a way to estimate the size a dataframe would be without loading it into memory? I already know that I do not have enough memory for the dataframe that I am trying to create but I do not know how much more memory would be required to fully…
alws_cnfsd
  • 105
  • 6
3
votes
1 answer

Dask - How to cancel and resubmit stalled tasks?

Frequently, I encounter an issue where Dask randomly stalls on a couple tasks, usually tied to a read of data from a different node on my network (more details about this below). This can happen after several hours of running the script with no…
dan
  • 183
  • 13
3
votes
1 answer

client.upload_file() for nested modules

I have a project structured as follows; - topmodule/ - childmodule1/ - my_func1.py - childmodule2/ - my_func2.py - common.py - __init__.py From my Jupyter notebook on an edge-node of a Dask cluster, I am doing the…
Jenna Kwon
  • 1,212
  • 1
  • 12
  • 22
3
votes
2 answers

using dask read_csv to read filename as a column name

I am importing 4000+ csv files all with the same columns, columns=['Date', 'Datapint'] the importing the csv's to dask is pretty straight forward and is working fine for me. file_paths = '/root/data/daily/' df = dd.read_csv(file_paths+'*.csv', …
blonc
  • 193
  • 2
  • 14
3
votes
1 answer

What is the way to add an index column in Dask when reading from a CSV?

I'm trying to process a fairly large dataset that doesn't fit into memory using Pandas when loading it at once so I'm using Dask. However, I'm having difficulty in adding a unique ID column to the dataset once read when using the read_csv method. I…
ShockDoctor
  • 653
  • 3
  • 9
  • 21
3
votes
2 answers

Does HyperbandCV and other incremental search algorithms work for models without partial_fit and fir pipelines?

I have been deep diving on the github pages and reading the documentation, but I am not fully understanding whether HyperbandCV will be useful to speed up hyperparameter optimization in my case. I am using SKLearn's pipeline functionality. And I am…
Ife A
  • 43
  • 4
3
votes
1 answer

How to read a single large parquet file into multiple partitions using dask/dask-cudf?

I am trying to read a single large parquet file (size > gpu_size), using dask_cudf/dask but it is currently reading it into a single partition, which i am guessing is the expected behavior inferring from the doc-string:…
Vibhu Jawa
  • 88
  • 9
3
votes
1 answer

How to use group by describe with unstack operation in python dask?

I am trying to use the describe() and unstack() function in dask to get the summary statistics of the data. However, i get an error as shown below import dask.dataframe as dd df =…
The Great
  • 7,215
  • 7
  • 40
  • 128
3
votes
2 answers

Is it possible to read a .tiff file from a remote service with dask?

I'm storing .tiff files on google cloud storage. I'd like to manipulate them using a distributed Dask cluster installed with Helm on Kubernetes.. Based on the dask-image repo, the Dask documentation on remote data services, and the use of…
skeller88
  • 4,276
  • 1
  • 32
  • 34
3
votes
0 answers

Dask workers time out shortly after starting

Good Afternoon SO, I am trying to deploy a WRF post-processing solution in Python using Dask and wrf-python that is run on a cluster, however I am encountering an issue with the interactivity between the dask scheduler and the worker instances. In…
Phantom139
  • 143
  • 9
3
votes
0 answers

dd.read_csv - FileNotFoundError: [WinError 3] - UNC Path

Minimal reproduceable example: This code: file_path = r"\\myserver\e\somedir\mycsv.csv" my_df = dd.read_csv(file_path, dtype="str") Results in: FileNotFoundError: [WinError 3] The system cannot find the path specified:…
healthDog
  • 31
  • 1
3
votes
0 answers

Dask: iterate over dataframe groups (implement a state machine given event stream)

Given an event stream for each key, I would like to maintain some internal state, and emit a state history for each event. A naive implementation would simply chunk the data by key, iterate over the events in order, maintain some internal state in…
Alexander David
  • 769
  • 2
  • 8
  • 19
3
votes
3 answers

Why does Dask perform so slower while multiprocessing perform so much faster?

To get a better understanding about parallel, I am comparing a set of different pieces of code. Here is the basic one (code_piece_1). for loop import time # setup problem_size = 1e7 items = range(9) # serial def counter(num=0): junk = 0 …
user10449636
3
votes
1 answer

How should I load a memory-intensive helper object per-worker in dask distributed?

I am currently trying to parse a very large number of text documents using dask + spaCy. SpaCy requires that I load a relatively large Language object, and I would like to load this once per worker. I have a couple of mapping functions that I would…
JSybrandt
  • 108
  • 1
  • 7