Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

2 answers

Why is computing the shape on an indexed Parquet file so slow in dask?

I have created a Parquet file from multiple Parquet files located in the same folder. Each file corresponds to a partition. Parquet files are created in different processes (using Python concurrent.futures). Here is an example of the code I run in…

dask parquet fastparquet

asked Nov 25 '19 at 00:30

hadim

votes

2 answers

Estimate pandas dataframe size without loading into memory

Is there a way to estimate the size a dataframe would be without loading it into memory? I already know that I do not have enough memory for the dataframe that I am trying to create but I do not know how much more memory would be required to fully…

python pandas dataframe dask

asked Nov 15 '19 at 21:00

alws_cnfsd

votes

1 answer

Dask - How to cancel and resubmit stalled tasks?

Frequently, I encounter an issue where Dask randomly stalls on a couple tasks, usually tied to a read of data from a different node on my network (more details about this below). This can happen after several hours of running the script with no…

python-3.x dask dask-distributed dask-delayed fastparquet

asked Nov 13 '19 at 14:56

dan

votes

1 answer

client.upload_file() for nested modules

I have a project structured as follows; - topmodule/ - childmodule1/ - my_func1.py - childmodule2/ - my_func2.py - common.py - __init__.py From my Jupyter notebook on an edge-node of a Dask cluster, I am doing the…

dask dask-distributed

asked Oct 31 '19 at 05:42

Jenna Kwon

1,212
1
12
22

votes

2 answers

using dask read_csv to read filename as a column name

I am importing 4000+ csv files all with the same columns, columns=['Date', 'Datapint'] the importing the csv's to dask is pretty straight forward and is working fine for me. file_paths = '/root/data/daily/' df = dd.read_csv(file_paths+'*.csv', …

pandas dask

asked Oct 26 '19 at 00:45

blonc

votes

1 answer

What is the way to add an index column in Dask when reading from a CSV?

I'm trying to process a fairly large dataset that doesn't fit into memory using Pandas when loading it at once so I'm using Dask. However, I'm having difficulty in adding a unique ID column to the dataset once read when using the read_csv method. I…

python pandas dataframe indexing dask

asked Oct 23 '19 at 14:21

ShockDoctor

votes

2 answers

Does HyperbandCV and other incremental search algorithms work for models without partial_fit and fir pipelines?

I have been deep diving on the github pages and reading the documentation, but I am not fully understanding whether HyperbandCV will be useful to speed up hyperparameter optimization in my case. I am using SKLearn's pipeline functionality. And I am…

dask dask-ml

asked Oct 20 '19 at 04:25

Ife A

votes

1 answer

How to read a single large parquet file into multiple partitions using dask/dask-cudf?

I am trying to read a single large parquet file (size > gpu_size), using dask_cudf/dask but it is currently reading it into a single partition, which i am guessing is the expected behavior inferring from the doc-string:…

dask cudf

asked Oct 17 '19 at 16:35

Vibhu Jawa

votes

1 answer

How to use group by describe with unstack operation in python dask?

I am trying to use the describe() and unstack() function in dask to get the summary statistics of the data. However, i get an error as shown below import dask.dataframe as dd df =…

python python-3.x dask dask-distributed dask-delayed

asked Oct 17 '19 at 04:40

The Great

7,215
7
40
128

votes

2 answers

Is it possible to read a .tiff file from a remote service with dask?

I'm storing .tiff files on google cloud storage. I'd like to manipulate them using a distributed Dask cluster installed with Helm on Kubernetes.. Based on the dask-image repo, the Dask documentation on remote data services, and the use of…

image-processing google-cloud-platform dask dask-distributed

asked Oct 16 '19 at 22:07

skeller88

4,276
1
32
34

votes

0 answers

Dask workers time out shortly after starting

Good Afternoon SO, I am trying to deploy a WRF post-processing solution in Python using Dask and wrf-python that is run on a cluster, however I am encountering an issue with the interactivity between the dask scheduler and the worker instances. In…

python dask dask-distributed

asked Sep 18 '19 at 18:12

Phantom139

votes

0 answers

dd.read_csv - FileNotFoundError: [WinError 3] - UNC Path

Minimal reproduceable example: This code: file_path = r"\\myserver\e\somedir\mycsv.csv" my_df = dd.read_csv(file_path, dtype="str") Results in: FileNotFoundError: [WinError 3] The system cannot find the path specified:…

python dask

asked Sep 12 '19 at 20:25

healthDog

votes

0 answers

Dask: iterate over dataframe groups (implement a state machine given event stream)

Given an event stream for each key, I would like to maintain some internal state, and emit a state history for each event. A naive implementation would simply chunk the data by key, iterate over the events in order, maintain some internal state in…

python dask dask-distributed

asked Sep 11 '19 at 13:18

Alexander David

votes

3 answers

Why does Dask perform so slower while multiprocessing perform so much faster?

To get a better understanding about parallel, I am comparing a set of different pieces of code. Here is the basic one (code_piece_1). for loop import time # setup problem_size = 1e7 items = range(9) # serial def counter(num=0): junk = 0 …

python parallel-processing dask parallelism-amdahl

asked Sep 06 '19 at 10:59

user10449636

votes

1 answer

How should I load a memory-intensive helper object per-worker in dask distributed?

I am currently trying to parse a very large number of text documents using dask + spaCy. SpaCy requires that I load a relatively large Language object, and I would like to load this once per worker. I have a couple of mapping functions that I would…

python dask spacy dask-distributed

asked Sep 02 '19 at 16:41

JSybrandt

Prev 1 2 3

…

99 100 Next