Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
0 answers

`concurrent.futures._base.CancelledError` running Dask with pymoo

I'm solving a multi objective optimization problem with pymoo and integrated with Dask. In pymoo you have to define a Problem class to pass it to the optimizer, I started with this class class ChungProblemWeightedDask(ChungProblemDask): def…
Andrex
  • 602
  • 1
  • 7
  • 22
3
votes
0 answers

Getting MemoryError While Clustering

I am trying to cluster my dataset in Python but I am getting Memory Error. Dataset has 442.000 rows and 12 columns(all float). The code is below: import psycopg2 as pg import dask.dataframe as dd from sklearn.cluster import…
mcsahin
  • 63
  • 1
  • 7
3
votes
1 answer

Read single large zipped csv (too large for memory) using Dask

I have a use case where I have an S3 bucket containing a list of a few hundred gzipped files. Each individual file, when unzipped and loaded into a dataframe, occupies more than the available memory. I'd like to read these files and perform some…
David Moye
  • 701
  • 4
  • 13
3
votes
1 answer

Can dask dashboard be used on SageMaker (Labs 1.2.*)?

I don't have browser access to the lab environment, and the available dask extension for lab didn't work for me so far. I want to be able to see the progress and performance data for my dask projects, no luck for now. compute() sometimes take hours…
Alejandro
  • 519
  • 1
  • 6
  • 32
3
votes
0 answers

Dask distributed.core - ERROR - 'tuple' object does not support item assignment

I am using Dask and cython in my project, where I am invoking cython code after register with the client and collect the obtained result from cython code to my dask-python code. When I make a cluster with processes=True, It works fine. But, as soon…
3
votes
1 answer

Dask: How to return a tuple of futures in client.submit

I need to return a tuple from a task which has to be unpacked in the main process because each element of the tuple will go to different dask tasks. I would like to avoid unnecessary communication so I think that the tuple elements should be…
z4m0
  • 33
  • 4
3
votes
1 answer

Saving to Parquet throws an error in Dask.dataframe

When performing the operation: Dask.dataframe.to_parquet(data), if data was read via Dask with a given number of partitions, and you try to save it in parquet format after having removed some columns, it fails with e.g. the following…
GMc
  • 189
  • 1
  • 9
3
votes
0 answers

Using Dask to load data from Azure Data Lake Gen2 with SAS Token

I'm looking for a way to load data from an Azure DataLake Gen2 using Dask, the content of the container are only parquet files but I only have the account name, account endpoint and an SAS token. When I use Azure SDK for a File System, I can…
user4923
  • 31
  • 1
3
votes
0 answers

Dask Dataframe: Resample partitioned data loaded from multiple parquet files

I am loading multiple parquet files containing timeseries data together. But the loaded dask dataframe has unknown partitions because of which I can't apply various time series operations on it. df = dd.read_parquet('/path/to/*.parquet',…
Milan Jain
  • 459
  • 7
  • 17
3
votes
2 answers

produce vector output from a dask array

I have a large dask array (labeled_arr) that is actually a labeled raster image (dtype is int64). I want to use rasterio to turn the labeled regions into polygons and combine them into a single list of polygons (or geoseries with just a geometry…
Jessica
  • 505
  • 1
  • 3
  • 11
3
votes
2 answers

Dask - how to assign task to the specific CPU

I'm using Dask to process research batches, which are quite heavy (from few minutes to few hours). There's no communication between the tasks and they produce only side results. I'm using a machine which already virtualizes resources beneath it (~…
Piotr Rarus
  • 884
  • 8
  • 16
3
votes
0 answers

Make Dask-Yarn More Robust to Node Failures

We're using Dask to distribute compute work across an EMR cluster. We're using Dask-Yarn. We've noticed that when we experience node failures sometimes those failures will take out the container running the Scheduler and our jobs fail. I was going…
gallamine
  • 865
  • 2
  • 12
  • 26
3
votes
0 answers

Loading feather files from s3 with dask delayed

I have an s3 folder with multiple .feather files, I would like to load these into dask using python as described here: Load many feather files in a folder into dask. I have tried two ways both give me different errors: import pandas as pd import…
Dean
  • 105
  • 1
  • 6
3
votes
3 answers

DASK - AttributeError: 'DataFrame' object has no attribute 'sort_values'

I am just trying to order a dask dataframe by a specific column. CODE 1 - If I call it it shows as indeed a ddf my_ddf OUTPUT 1 npartitions=1 headers ..... CODE 2 my_ddf.sort_values('id', ascending=False) OUTPUT 2 AttributeError …
sogu
  • 2,738
  • 5
  • 31
  • 90
3
votes
1 answer

Python Dask: Searching for a value in a column and get the value of a different column

df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(), 'B': 'one one two three two two one three'.split(), 'C': np.arange(8), 'D': np.arange(8) * 2}) Just imagine this dataframe now with pandas it…