Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
4
votes
1 answer

How to efficiently convert npy to xarray / zarr

I have a 37 GB .npy file that I would like to convert to Zarr store so that I can include coordinate labels. I have code that does this in theory, but I keep running out of memory. I want to use Dask in-between to facilitate doing this in chunks,…
thomaskeefe
  • 1,900
  • 18
  • 19
4
votes
1 answer

Dask multi-stage resource setup causes Failed to Serialize Error

Using the exact code from Dask's documentation at https://jobqueue.dask.org/en/latest/examples.html In case the page changes, this is the code: from dask_jobqueue import SLURMCluster from distributed import Client from dask import delayed cluster =…
michaelgbj
  • 290
  • 1
  • 10
4
votes
2 answers

Dask Kubernetes strange behavior of adapt method

I have a Dask cluster on AKS and I want to run a function f in parallel, but have this function run in a single process allocated in a single pod. According to the documentation on Worker Resources I should start each worker with dask-worker…
Andrex
  • 602
  • 1
  • 7
  • 22
4
votes
2 answers

Running two Tensorflow trainings in parallel using joblib and dask

I have the following code that runs two TensorFlow trainings in parallel using Dask workers implemented in Docker containers. I need to launch two processes, using the same dask client, where each will train their respective models with N…
ps0604
  • 1,227
  • 23
  • 133
  • 330
4
votes
1 answer

Why does Dask seem to store Parquet inefficiently

When I save the same table using Pandas and Dask into Parquet, Pandas creates a 4k file, wheres Dask creates a 39M file. Create the dataframe import pandas as pd import pyarrow as pa import pyarrow.parquet as pq import dask.dataframe as dd n =…
Dahn
  • 1,397
  • 1
  • 10
  • 29
4
votes
1 answer

Get column value after searching for row in dask

I have a pandas dataframe that I converted to a dask dataframe using the from_pandas function of dask. It has 3 columns namely col1, col2 and col3. Now I am searching for a specific row using daskdf[(daskdf.col1 == v1) & (daskdf.col2 == v2)] where…
Tanmay Bhatnagar
  • 2,330
  • 4
  • 30
  • 50
4
votes
0 answers

Convert a multi-dimension (3D) dask array to a dask dataframe

I have a tf.keras Model having LSTM as its first layer (3D tensor as input). I need to convert a dask array (3-D) into a dask dataframe (mandatory requirement for the module responsible for fitting the model) with 1 column (each cell is a 3d…
4
votes
1 answer

Merging on columns with dask

I have a simple script currently written with pandas that I want to convert to dask dataframes. In this script, I am executing a merge on two dataframes on user-specified columns and I am trying to convert it into dask. def merge_dfs(df1, df2,…
Eliran Turgeman
  • 1,526
  • 2
  • 16
  • 34
4
votes
1 answer

Dask Cluster: AttributeError: 'DataFrame' object has no attribute '_data'

I'm working with a Dask Cluster on GCP. I'm using this code to deploy it: from dask_cloudprovider.gcp import GCPCluster from dask.distributed import Client enviroment_vars = { 'EXTRA_PIP_PACKAGES': '"gcsfs"' } cluster = GCPCluster( …
4
votes
2 answers

How to read in csv with to to a DASK dataframe so it will not have “Unnamed: 0” column?

Goal I want to read in a csv to a DASK dataframe without getting “Unnamed: 0” column. CODE mydtype = {'col1': 'object', 'col2': 'object', 'col3': 'object', 'col4': 'float32',} do =…
sogu
  • 2,738
  • 5
  • 31
  • 90
4
votes
1 answer

Dask: handling unresponsive workers

When using Dask with SGE or PBS clusters I sometimes have workers becoming unresponsive. These workers are highlighted in red in the dashboard Info section with their "Last seen" number constantly increasing. I know this can happen if submitted…
Thomas
  • 81
  • 7
4
votes
1 answer

Dask aws cluster error when initializing: User data is limited to 16384 bytes

I'm following the guide here: https://cloudprovider.dask.org/en/latest/packer.html#ec2cluster-with-rapids In particular I set up my instance with packer, and am now trying to run the final piece of code: cluster = EC2Cluster( …
ZirconCode
  • 805
  • 2
  • 10
  • 24
4
votes
1 answer

Dask crashing when saving to file?

I'm trying to take onehot encode a dataset then groupby a specific column so I can get one row for each item in that column with a aggregated view of what onehot columns are true for that specific row. It seems to be working on small data and using…
Lostsoul
  • 25,013
  • 48
  • 144
  • 239
4
votes
1 answer

Is there a way of using dask jobqueue over ssh

Dask jobqueue seems to be a very nice solution for distributing jobs to PBS/Slurm managed clusters. However, if I'm understanding its use correctly, you must create instance of "PBSCluster/SLURMCluster" on head/login node. Then you can on the same…
4
votes
0 answers

Dask Dataframe from parquet files: OSError: Couldn't deserialize thrift: TProtocolException: Invalid data

I'm generating a Dask dataframe to be used downstream in a clustering algorithm supplied by dask-ml. In a previous step in my pipeline I read a dataframe from disk using the dask.dataframe.read_parquet, apply a transformation to add columns using…
Michael Wheeler
  • 849
  • 1
  • 10
  • 29