Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
4
votes
1 answer

Column Name Shift using read_csv in Dask

I'm attempting to use Intake to catalog a csv dataset. It uses the Dask implementation of read_csv which in turn uses the pandas implementation. The issue I'm seeing is that the csv files I'm loading don't have an index column so Dask is…
Brenton
  • 85
  • 1
  • 8
4
votes
1 answer

Avoiding memory overflow while using xarray dask apply_ufunc

I need to apply a function along the time dimension of an xarray dask array of this shape: dask.array
4
votes
2 answers

ERROR: Could not find a version that satisfies the requirement dask-cudf (from versions: none)

Describe the bug When I am trying to import dask_cudf I get the following ERROR: --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call…
sogu
  • 2,738
  • 5
  • 31
  • 90
4
votes
2 answers

How to speed up the 'for' loop in a python function?

I have a function var. I want to know the best possible way to run the for loop (for multiple coordinates: xs and ys) within this function quickly by multiprocessing/parallel processing by utilizing all the processors, cores, and RAM memory the…
Gun
  • 556
  • 6
  • 21
4
votes
1 answer

Use a Dask Cluster in a PythonScriptStep

Is it possible to have a multi-node Dask cluster be the compute for a PythonScriptStep with AML Pipelines? We have a PythonScriptStep that uses featuretools's, deep feature synthesis (dfs) (docs). ft.dfs() has a param, n_jobs which allows for…
Anders Swanson
  • 3,637
  • 1
  • 18
  • 43
4
votes
1 answer

Unexpected behaviour when chunking with multiple netcdf files in xarray/dask

I'm working with a set of 468 netcdf files summing up to 12GB in total. Each file has only one global snapshot of a geophysical variable, i.e. for each file the data shape is (1, 1801, 3600) corresponding to dimensions ('time', 'latitude',…
susopeiz
  • 673
  • 1
  • 9
  • 11
4
votes
2 answers

merge parquet files with different schema using pandas and dask

I have a parquet directory with around 1000 files and the schemas are different. I wanted to merge all those files in to an optimal number of files with file repartition. I using pandas with pyarrow to read each partition file from the directory and…
Learnis
  • 526
  • 5
  • 25
4
votes
2 answers

How To Do Model Predict Using Distributed Dask With a Pre-Trained Keras Model?

I am loading my pre-trained keras model and then trying to parallelize a large number of input data using dask? Unfortunately, I'm running into some issues with this relating to how I'm creating my dask array. Any guidance would be greatly…
Riley Hun
  • 2,541
  • 5
  • 31
  • 77
4
votes
1 answer

Kubernetes and Dask and Scheduler

My code looks something like this def myfunc(param): # expensive stuff that takes 2-3h mylist = [...] client = Client(...) mgr = DeploymentMgr() # ... setup stateful set ... futures = client.map(myfunc, mylist, ..., resources={mgr.hash.upper():…
r0f1
  • 2,717
  • 3
  • 26
  • 39
4
votes
1 answer

Timeout OSError while running dask on local cluster

I am trying to run the following code on a Power PC with config: Operating System: Red Hat Enterprise Linux Server 7.6 (Maipo) CPE OS Name: cpe:/o:redhat:enterprise_linux:7.6:GA:server Kernel: Linux 3.10.0-957.21.3.el7.ppc64le …
Coddy
  • 549
  • 4
  • 18
4
votes
1 answer

How to reset index on concatenated dataframe in Dask

I'm new to Dask and thought this would be a simple task. I want to load data from multiple csv files and combine it into one Dask dataframe. in this example, there are 5 csv files with 10,000 rows of data in each. Obviously I want to give the…
Bill
  • 10,323
  • 10
  • 62
  • 85
4
votes
3 answers

Dask: convert a dask.DataFrame to an xarray.Dataset

This is possible in pandas. I would like to do it with dask. Edit: raised on dask here FYI you can go from an xarray.Dataset to a Dask.DataFrame Pandas solution using .to_xarry: import pandas as pd import numpy as np df = pd.DataFrame([('falcon',…
Ray Bell
  • 1,508
  • 4
  • 18
  • 45
4
votes
1 answer

AttributeError: module 'dask' has no attribute 'set_options'

I'm a rookie using Dask and I installed the new version 2.12.0 on my MacBook MacOS High Sierra 10.13.6. When I try to start the distributed mode with the code below: from dask.distributed import Client c = Client() I got the following…
4
votes
2 answers

MultiGPU Kmeans clustering with RAPIDs freezes

I am new into Python and Rapids.AI and I am trying to recreate SKLearn KMeans in a multinode GPU (I have 2 GPUs) using Dask and RAPIDs (I am using rapids with its docker, which mounts a Jupyter Notebook too). The code I show below (also I show an…
JuMoGar
  • 1,740
  • 2
  • 19
  • 46
4
votes
1 answer

dask.distributed SLURM cluster Nanny Timeout

I am trying to use the dask.distributed.SLURMCluster to submit batch jobs to a SLURM job scheduler on a supercomputing cluster. The jobs all submit as expect, but throw an error after 1 minute of running: asyncio.exceptions.TimeoutError: Nanny…
Ovec8hkin
  • 65
  • 1
  • 6