Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

0 answers

Value type error when using XGBoost with dask distributed

Here is the code that reproduces the error on my machine: import numpy as np import xgboost as xgb import dask.array as da import dask.distributed from dask_cuda import LocalCUDACluster from dask.distributed import Client X =…

asked Jan 19 '21 at 02:24

lara_toff

votes

2 answers

Running two dask-ml imputers simultaneously instead of sequentially

I can impute the mean and most frequent value using dask-ml like so, this works fine: mean_imputer = impute.SimpleImputer(strategy='mean') most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent') data = [[100, 2, 5], [np.nan, np.nan,…

python pandas dask dask-ml

asked Dec 22 '20 at 15:30

ps0604

1,227
23
133
330

votes

2 answers

Dask distributed.scheduler - ERROR - Couldn't gather keys

import joblib from sklearn.externals.joblib import parallel_backend with joblib.parallel_backend('dask'): from dask_ml.model_selection import GridSearchCV import xgboost from xgboost import XGBRegressor grid_search =…

python dask dask-distributed dask-ml

asked Oct 16 '20 at 03:53

praveen pravii

votes

2 answers

Dask Memory leakage issue with json and requests

This is just a sample minimal test to reproduce memory leakage issue in remote Dask kubernetes cluster. def load_geojson(pid): import requests import io r =…

json python-requests dask dask-distributed dask-kubernetes

asked Sep 24 '20 at 12:58

jsanjayce

votes

0 answers

Speed up selecting elements in combined netCDF files using Xarray and Dask

I am new to Xarray and Dask and trying to access multiple netCDF files that store global ocean current velocities on 3H interval. Each netCDF file covers one time interval of gridded data of 1/4 degree resolution: NetCDF dimension information: Name:…

dask python-xarray

asked Jul 07 '20 at 11:03

JobS

votes

2 answers

Dask - How to connect to running cluster scheduler and access 'total_occupancy'?

I use the following to create a local cluster from a Jupyter notebook : from dask.distributed import Client, LocalCluster cluster = LocalCluster(n_workers=24) c = Client(cluster) Is it possible to connect from another notebook when the kernel is…

python jupyter-notebook scheduler dask

asked Feb 07 '20 at 14:48

DavidK

2,495
3
23
38

votes

0 answers

Is it possible to launch dask clusters on hpc (slurm) remotely from local computer?

I am new to Dask, I understand that to start the dask clusters, I would normally have to ssh to my hpc cluster and then start SLURMCluster() to start some clusters, then after it's started I need to Client('node_ip') on my local computer. I was…

dask dask-distributed

asked Dec 09 '19 at 21:32

user252046

votes

1 answer

How to change the datatype of column in dask dataframe?

I have a column in my dask dataframe whose datatype is integer I want to change it to a float datatype, how can I do such an operation. fid_price_df.head(3) fid selling_price 0 98101 439.00 1 67022 419.00 2 131142 299.00 In the…

python pandas types dask

asked Nov 23 '19 at 06:44

Rahul

votes

2 answers

How to separate files using dask groupby on a column

I have a large set of csv files (file_1.csv, file_2.csv), separated by time period, that cant be fit into memory. Each file will be in the format mentioned below. | instrument | time | code | val …

python pandas dask

asked Nov 20 '19 at 18:20

RTM

votes

1 answer

Dask - WARNING - Worker exceeded 95% memory budget

I am getting error Dask - WARNING - Worker exceeded 95% memory budget. I am working on a local PC with 4 physical and 8 virtual cores, I have tried the following: Per... Managing worker memory on a dask localcluster ...and the documentation…

python dask

asked Sep 18 '19 at 16:48

P. S.R.

votes

1 answer

Specify dashboard port for dask

Is there a way to manually specify the port for the dashboard when creating a dask cluster using dask-jobqueue? When 8787 is taken, it randomly picks a different port, which means that one needs to set up a different tunneling every time. from…

python dask dask-distributed

asked Aug 20 '19 at 14:07

tlamadon

votes

3 answers

Apply function along time dimension of XArray

I have an image stack stored in an XArray DataArray with dimensions time, x, y on which I'd like to apply a custom function along the time axis of each pixel such that the output is a single image of dimensions x,y. I have tried: apply_ufunc but the…

dask python-xarray

asked Aug 19 '19 at 07:58

System123

votes

2 answers

Dask + pyinstaller fails

I am trying to use dask dataframes into a packaged executable using pyinstaller. I have just import dask in my executable and I package it with pyinstaller scripts.py When I run it I get that /some/path/dask.yaml is not found. Does somebody know…

python runtime-error pyinstaller dask

asked Jul 16 '19 at 12:29

luca pescatore

votes

2 answers

How to use Dask to read data from SQL ?

There are not enough examples in the documentation on how to read data from sqlAlchemy to a dask dataframe. Some examples i see are in terms of : df = dd.read_sql_table(table='my_table_name', uri=my_sqlalchemy_con_url, index_col='id') But my…

python pandas dask

asked Jul 04 '19 at 11:15

Viv

1,474
5
28
47

votes

0 answers

Huge memory use difference between dask and dask.distributed

I am trying to use dask.delayed to compute a large matrix for use in a later calculation. I am only ever running the code on a single local machine. When I use a dask single-machine scheduler it works fine, but is a little slow. To access more…

python dask dask-distributed dask-delayed

asked Jun 26 '19 at 18:43

Nick W.

Prev 1 2 3

…

99 100 Next