Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

How to efficiently convert npy to xarray / zarr

I have a 37 GB .npy file that I would like to convert to Zarr store so that I can include coordinate labels. I have code that does this in theory, but I keep running out of memory. I want to use Dask in-between to facilitate doing this in chunks,…

asked Jun 17 '22 at 18:57

thomaskeefe

1,900
18
19

votes

1 answer

Dask multi-stage resource setup causes Failed to Serialize Error

Using the exact code from Dask's documentation at https://jobqueue.dask.org/en/latest/examples.html In case the page changes, this is the code: from dask_jobqueue import SLURMCluster from distributed import Client from dask import delayed cluster =…

python python-3.x dask dask-distributed dask-delayed

asked Jun 10 '22 at 17:24

michaelgbj

votes

2 answers

Dask Kubernetes strange behavior of adapt method

I have a Dask cluster on AKS and I want to run a function f in parallel, but have this function run in a single process allocated in a single pod. According to the documentation on Worker Resources I should start each worker with dask-worker…

python kubernetes dask

asked Feb 26 '22 at 20:49

Andrex

votes

2 answers

Running two Tensorflow trainings in parallel using joblib and dask

I have the following code that runs two TensorFlow trainings in parallel using Dask workers implemented in Docker containers. I need to launch two processes, using the same dask client, where each will train their respective models with N…

python tensorflow dask dask-distributed joblib

asked Dec 23 '21 at 16:46

ps0604

1,227
23
133
330

votes

1 answer

Why does Dask seem to store Parquet inefficiently

When I save the same table using Pandas and Dask into Parquet, Pandas creates a 4k file, wheres Dask creates a 39M file. Create the dataframe import pandas as pd import pyarrow as pa import pyarrow.parquet as pq import dask.dataframe as dd n =…

python pandas dask parquet pyarrow

asked Aug 06 '21 at 23:22

Dahn

1,397
1
10
29

votes

1 answer

Get column value after searching for row in dask

I have a pandas dataframe that I converted to a dask dataframe using the from_pandas function of dask. It has 3 columns namely col1, col2 and col3. Now I am searching for a specific row using daskdf[(daskdf.col1 == v1) & (daskdf.col2 == v2)] where…

python pandas dataframe dask dask-dataframe

asked Apr 13 '21 at 08:30

Tanmay Bhatnagar

2,330
4
30
50

votes

0 answers

Convert a multi-dimension (3D) dask array to a dask dataframe

I have a tf.keras Model having LSTM as its first layer (3D tensor as input). I need to convert a dask array (3-D) into a dask dataframe (mandatory requirement for the module responsible for fitting the model) with 1 column (each cell is a 3d…

python tensorflow lstm dask

asked Apr 05 '21 at 14:22

abhinavchat

votes

1 answer

Merging on columns with dask

I have a simple script currently written with pandas that I want to convert to dask dataframes. In this script, I am executing a merge on two dataframes on user-specified columns and I am trying to convert it into dask. def merge_dfs(df1, df2,…

python pandas dataframe dask dask-dataframe

asked Apr 05 '21 at 09:36

Eliran Turgeman

1,526
2
16
34

votes

1 answer

Dask Cluster: AttributeError: 'DataFrame' object has no attribute '_data'

I'm working with a Dask Cluster on GCP. I'm using this code to deploy it: from dask_cloudprovider.gcp import GCPCluster from dask.distributed import Client enviroment_vars = { 'EXTRA_PIP_PACKAGES': '"gcsfs"' } cluster = GCPCluster( …

pandas dockerfile dask dask-dataframe

asked Feb 26 '21 at 15:51

Paula Vallejo

votes

2 answers

How to read in csv with to to a DASK dataframe so it will not have “Unnamed: 0” column?

Goal I want to read in a csv to a DASK dataframe without getting “Unnamed: 0” column. CODE mydtype = {'col1': 'object', 'col2': 'object', 'col3': 'object', 'col4': 'float32',} do =…

python pandas csv dask dask-dataframe

asked Feb 24 '21 at 11:32

sogu

2,738
5
31
90

votes

1 answer

Dask: handling unresponsive workers

When using Dask with SGE or PBS clusters I sometimes have workers becoming unresponsive. These workers are highlighted in red in the dashboard Info section with their "Last seen" number constantly increasing. I know this can happen if submitted…

python dask dask-distributed

asked Feb 16 '21 at 09:13

Thomas

votes

1 answer

Dask aws cluster error when initializing: User data is limited to 16384 bytes

I'm following the guide here: https://cloudprovider.dask.org/en/latest/packer.html#ec2cluster-with-rapids In particular I set up my instance with packer, and am now trying to run the final piece of code: cluster = EC2Cluster( …

amazon-web-services amazon-ec2 conda dask dask-distributed

asked Jan 31 '21 at 17:56

ZirconCode

votes

1 answer

Dask crashing when saving to file?

I'm trying to take onehot encode a dataset then groupby a specific column so I can get one row for each item in that column with a aggregated view of what onehot columns are true for that specific row. It seems to be working on small data and using…

python pandas dask dask-distributed dask-dataframe

asked Dec 30 '20 at 01:07

Lostsoul

25,013
48
144
239

votes

1 answer

Is there a way of using dask jobqueue over ssh

Dask jobqueue seems to be a very nice solution for distributing jobs to PBS/Slurm managed clusters. However, if I'm understanding its use correctly, you must create instance of "PBSCluster/SLURMCluster" on head/login node. Then you can on the same…

ssh cluster-computing dask dask-distributed dask-jobqueue

asked Dec 22 '20 at 00:13

Phil Reinhold

votes

0 answers

Dask Dataframe from parquet files: OSError: Couldn't deserialize thrift: TProtocolException: Invalid data

I'm generating a Dask dataframe to be used downstream in a clustering algorithm supplied by dask-ml. In a previous step in my pipeline I read a dataframe from disk using the dask.dataframe.read_parquet, apply a transformation to add columns using…

dask parquet pyarrow dask-dataframe

asked Dec 12 '20 at 21:37

Michael Wheeler

Prev 1 2 3

…

99 100 Next