Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

0 answers

`concurrent.futures._base.CancelledError` running Dask with pymoo

I'm solving a multi objective optimization problem with pymoo and integrated with Dask. In pymoo you have to define a Problem class to pass it to the optimizer, I started with this class class ChungProblemWeightedDask(ChungProblemDask): def…

python python-asyncio dask

asked Apr 09 '21 at 17:36

Andrex

votes

0 answers

Getting MemoryError While Clustering

I am trying to cluster my dataset in Python but I am getting Memory Error. Dataset has 442.000 rows and 12 columns(all float). The code is below: import psycopg2 as pg import dask.dataframe as dd from sklearn.cluster import…

python machine-learning dask

asked Apr 06 '21 at 14:35

mcsahin

votes

1 answer

Read single large zipped csv (too large for memory) using Dask

I have a use case where I have an S3 bucket containing a list of a few hundred gzipped files. Each individual file, when unzipped and loaded into a dataframe, occupies more than the available memory. I'd like to read these files and perform some…

python dask dask-distributed

asked Apr 02 '21 at 13:42

David Moye

votes

1 answer

Can dask dashboard be used on SageMaker (Labs 1.2.*)?

I don't have browser access to the lab environment, and the available dask extension for lab didn't work for me so far. I want to be able to see the progress and performance data for my dask projects, no luck for now. compute() sometimes take hours…

python dask dask-distributed dask-dataframe

asked Apr 01 '21 at 13:17

Alejandro

votes

0 answers

Dask distributed.core - ERROR - 'tuple' object does not support item assignment

I am using Dask and cython in my project, where I am invoking cython code after register with the client and collect the obtained result from cython code to my dask-python code. When I make a cluster with processes=True, It works fine. But, as soon…

python dask dask-distributed dask-delayed dask-dataframe

asked Mar 29 '21 at 07:06

Rahul

votes

1 answer

Dask: How to return a tuple of futures in client.submit

I need to return a tuple from a task which has to be unpacked in the main process because each element of the tuple will go to different dask tasks. I would like to avoid unnecessary communication so I think that the tuple elements should be…

python dask dask-distributed dask-delayed

asked Mar 26 '21 at 10:06

z4m0

votes

1 answer

Saving to Parquet throws an error in Dask.dataframe

When performing the operation: Dask.dataframe.to_parquet(data), if data was read via Dask with a given number of partitions, and you try to save it in parquet format after having removed some columns, it fails with e.g. the following…

python python-3.x dask parquet dask-dataframe

asked Mar 25 '21 at 06:41

GMc

votes

0 answers

Using Dask to load data from Azure Data Lake Gen2 with SAS Token

I'm looking for a way to load data from an Azure DataLake Gen2 using Dask, the content of the container are only parquet files but I only have the account name, account endpoint and an SAS token. When I use Azure SDK for a File System, I can…

azure dask

asked Mar 04 '21 at 14:37

user4923

votes

0 answers

Dask Dataframe: Resample partitioned data loaded from multiple parquet files

I am loading multiple parquet files containing timeseries data together. But the loaded dask dataframe has unknown partitions because of which I can't apply various time series operations on it. df = dd.read_parquet('/path/to/*.parquet',…

python pandas dask resampling dask-dataframe

asked Mar 02 '21 at 18:48

Milan Jain

votes

2 answers

produce vector output from a dask array

I have a large dask array (labeled_arr) that is actually a labeled raster image (dtype is int64). I want to use rasterio to turn the labeled regions into polygons and combine them into a single list of polygons (or geoseries with just a geometry…

python geospatial dask rasterio

asked Feb 16 '21 at 21:10

Jessica

votes

2 answers

Dask - how to assign task to the specific CPU

I'm using Dask to process research batches, which are quite heavy (from few minutes to few hours). There's no communication between the tasks and they produce only side results. I'm using a machine which already virtualizes resources beneath it (~…

python dask dask-distributed

asked Feb 11 '21 at 20:50

Piotr Rarus

votes

0 answers

Make Dask-Yarn More Robust to Node Failures

We're using Dask to distribute compute work across an EMR cluster. We're using Dask-Yarn. We've noticed that when we experience node failures sometimes those failures will take out the container running the Scheduler and our jobs fail. I was going…

python hadoop-yarn dask dask-distributed

asked Jan 28 '21 at 19:19

gallamine

votes

0 answers

Loading feather files from s3 with dask delayed

I have an s3 folder with multiple .feather files, I would like to load these into dask using python as described here: Load many feather files in a folder into dask. I have tried two ways both give me different errors: import pandas as pd import…

python pandas dask dask-delayed feather

asked Jan 28 '21 at 17:09

Dean

votes

3 answers

DASK - AttributeError: 'DataFrame' object has no attribute 'sort_values'

I am just trying to order a dask dataframe by a specific column. CODE 1 - If I call it it shows as indeed a ddf my_ddf OUTPUT 1 npartitions=1 headers ..... CODE 2 my_ddf.sort_values('id', ascending=False) OUTPUT 2 AttributeError …

python python-3.x pandas dataframe dask

asked Jan 27 '21 at 17:56

sogu

2,738
5
31
90

votes

1 answer

Python Dask: Searching for a value in a column and get the value of a different column

df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(), 'B': 'one one two three two two one three'.split(), 'C': np.arange(8), 'D': np.arange(8) * 2}) Just imagine this dataframe now with pandas it…

python dataframe dask

asked Jan 21 '21 at 10:58

Paul Bruder

Prev 1 2 3

…

99 100 Next