Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

Column Name Shift using read_csv in Dask

I'm attempting to use Intake to catalog a csv dataset. It uses the Dask implementation of read_csv which in turn uses the pandas implementation. The issue I'm seeing is that the csv files I'm loading don't have an index column so Dask is…

asked Dec 11 '20 at 16:01

Brenton

votes

1 answer

Avoiding memory overflow while using xarray dask apply_ufunc

I need to apply a function along the time dimension of an xarray dask array of this shape: dask.array

python-3.6 dask python-xarray dask-distributed numpy-ufunc

asked Oct 30 '20 at 21:27

Monobakht

votes

2 answers

ERROR: Could not find a version that satisfies the requirement dask-cudf (from versions: none)

Describe the bug When I am trying to import dask_cudf I get the following ERROR: --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call…

python python-3.x gpu dask rapids

asked Oct 28 '20 at 16:13

sogu

2,738
5
31
90

votes

2 answers

How to speed up the 'for' loop in a python function?

I have a function var. I want to know the best possible way to run the for loop (for multiple coordinates: xs and ys) within this function quickly by multiprocessing/parallel processing by utilizing all the processors, cores, and RAM memory the…

python for-loop parallel-processing multiprocessing dask

asked Sep 09 '20 at 03:28

Gun

votes

1 answer

Use a Dask Cluster in a PythonScriptStep

Is it possible to have a multi-node Dask cluster be the compute for a PythonScriptStep with AML Pipelines? We have a PythonScriptStep that uses featuretools's, deep feature synthesis (dfs) (docs). ft.dfs() has a param, n_jobs which allows for…

dask azure-machine-learning-service

asked Aug 07 '20 at 17:43

Anders Swanson

3,637
1
18
43

votes

1 answer

Unexpected behaviour when chunking with multiple netcdf files in xarray/dask

I'm working with a set of 468 netcdf files summing up to 12GB in total. Each file has only one global snapshot of a geophysical variable, i.e. for each file the data shape is (1, 1801, 3600) corresponding to dimensions ('time', 'latitude',…

python dask netcdf python-xarray chunking

asked Jul 16 '20 at 09:53

susopeiz

votes

2 answers

merge parquet files with different schema using pandas and dask

I have a parquet directory with around 1000 files and the schemas are different. I wanted to merge all those files in to an optimal number of files with file repartition. I using pandas with pyarrow to read each partition file from the directory and…

python pandas dask parquet pyarrow

asked May 22 '20 at 14:39

Learnis

votes

2 answers

How To Do Model Predict Using Distributed Dask With a Pre-Trained Keras Model?

I am loading my pre-trained keras model and then trying to parallelize a large number of input data using dask? Unfortunately, I'm running into some issues with this relating to how I'm creating my dask array. Any guidance would be greatly…

keras scikit-learn dask distributed dask-ml

asked May 20 '20 at 23:49

Riley Hun

2,541
5
31
77

votes

1 answer

Kubernetes and Dask and Scheduler

My code looks something like this def myfunc(param): # expensive stuff that takes 2-3h mylist = [...] client = Client(...) mgr = DeploymentMgr() # ... setup stateful set ... futures = client.map(myfunc, mylist, ..., resources={mgr.hash.upper():…

python kubernetes dask

asked May 19 '20 at 16:15

r0f1

2,717
3
26
39

votes

1 answer

Timeout OSError while running dask on local cluster

I am trying to run the following code on a Power PC with config: Operating System: Red Hat Enterprise Linux Server 7.6 (Maipo) CPE OS Name: cpe:/o:redhat:enterprise_linux:7.6:GA:server Kernel: Linux 3.10.0-957.21.3.el7.ppc64le …

python python-3.x dask dask-distributed

asked May 15 '20 at 16:21

Coddy

votes

1 answer

How to reset index on concatenated dataframe in Dask

I'm new to Dask and thought this would be a simple task. I want to load data from multiple csv files and combine it into one Dask dataframe. in this example, there are 5 csv files with 10,000 rows of data in each. Obviously I want to give the…

python dataframe indexing concatenation dask

asked Apr 23 '20 at 19:22

Bill

10,323
10
62
85

votes

3 answers

Dask: convert a dask.DataFrame to an xarray.Dataset

This is possible in pandas. I would like to do it with dask. Edit: raised on dask here FYI you can go from an xarray.Dataset to a Dask.DataFrame Pandas solution using .to_xarry: import pandas as pd import numpy as np df = pd.DataFrame([('falcon',…

pandas dask python-xarray dask-dataframe

asked Mar 28 '20 at 01:38

Ray Bell

1,508
4
18
45

votes

1 answer

AttributeError: module 'dask' has no attribute 'set_options'

I'm a rookie using Dask and I installed the new version 2.12.0 on my MacBook MacOS High Sierra 10.13.6. When I try to start the distributed mode with the code below: from dask.distributed import Client c = Client() I got the following…

python dask distributed attributeerror

asked Mar 18 '20 at 04:28

Alonso Carvajal Moreno

votes

2 answers

MultiGPU Kmeans clustering with RAPIDs freezes

I am new into Python and Rapids.AI and I am trying to recreate SKLearn KMeans in a multinode GPU (I have 2 GPUs) using Dask and RAPIDs (I am using rapids with its docker, which mounts a Jupyter Notebook too). The code I show below (also I show an…

python k-means dask rapids

asked Mar 06 '20 at 11:56

JuMoGar

1,740
2
19
46

votes

1 answer

dask.distributed SLURM cluster Nanny Timeout

I am trying to use the dask.distributed.SLURMCluster to submit batch jobs to a SLURM job scheduler on a supercomputing cluster. The jobs all submit as expect, but throw an error after 1 minute of running: asyncio.exceptions.TimeoutError: Nanny…

dask hpc slurm dask-distributed

asked Mar 04 '20 at 23:14

Ovec8hkin

Prev 1 2 3

…

99 100 Next