Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
6
votes
0 answers

Value type error when using XGBoost with dask distributed

Here is the code that reproduces the error on my machine: import numpy as np import xgboost as xgb import dask.array as da import dask.distributed from dask_cuda import LocalCUDACluster from dask.distributed import Client X =…
lara_toff
  • 413
  • 2
  • 14
6
votes
2 answers

Running two dask-ml imputers simultaneously instead of sequentially

I can impute the mean and most frequent value using dask-ml like so, this works fine: mean_imputer = impute.SimpleImputer(strategy='mean') most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent') data = [[100, 2, 5], [np.nan, np.nan,…
ps0604
  • 1,227
  • 23
  • 133
  • 330
6
votes
2 answers

Dask distributed.scheduler - ERROR - Couldn't gather keys

import joblib from sklearn.externals.joblib import parallel_backend with joblib.parallel_backend('dask'): from dask_ml.model_selection import GridSearchCV import xgboost from xgboost import XGBRegressor grid_search =…
praveen pravii
  • 193
  • 2
  • 9
6
votes
2 answers

Dask Memory leakage issue with json and requests

This is just a sample minimal test to reproduce memory leakage issue in remote Dask kubernetes cluster. def load_geojson(pid): import requests import io r =…
6
votes
0 answers

Speed up selecting elements in combined netCDF files using Xarray and Dask

I am new to Xarray and Dask and trying to access multiple netCDF files that store global ocean current velocities on 3H interval. Each netCDF file covers one time interval of gridded data of 1/4 degree resolution: NetCDF dimension information: Name:…
JobS
  • 61
  • 3
6
votes
2 answers

Dask - How to connect to running cluster scheduler and access 'total_occupancy'?

I use the following to create a local cluster from a Jupyter notebook : from dask.distributed import Client, LocalCluster cluster = LocalCluster(n_workers=24) c = Client(cluster) Is it possible to connect from another notebook when the kernel is…
DavidK
  • 2,495
  • 3
  • 23
  • 38
6
votes
0 answers

Is it possible to launch dask clusters on hpc (slurm) remotely from local computer?

I am new to Dask, I understand that to start the dask clusters, I would normally have to ssh to my hpc cluster and then start SLURMCluster() to start some clusters, then after it's started I need to Client('node_ip') on my local computer. I was…
user252046
  • 399
  • 2
  • 11
6
votes
1 answer

How to change the datatype of column in dask dataframe?

I have a column in my dask dataframe whose datatype is integer I want to change it to a float datatype, how can I do such an operation. fid_price_df.head(3) fid selling_price 0 98101 439.00 1 67022 419.00 2 131142 299.00 In the…
Rahul
  • 325
  • 5
  • 11
6
votes
2 answers

How to separate files using dask groupby on a column

I have a large set of csv files (file_1.csv, file_2.csv), separated by time period, that cant be fit into memory. Each file will be in the format mentioned below. | instrument | time | code | val …
RTM
  • 759
  • 2
  • 9
  • 22
6
votes
1 answer

Dask - WARNING - Worker exceeded 95% memory budget

I am getting error Dask - WARNING - Worker exceeded 95% memory budget. I am working on a local PC with 4 physical and 8 virtual cores, I have tried the following: Per... Managing worker memory on a dask localcluster ...and the documentation…
P. S.R.
  • 129
  • 1
  • 12
6
votes
1 answer

Specify dashboard port for dask

Is there a way to manually specify the port for the dashboard when creating a dask cluster using dask-jobqueue? When 8787 is taken, it randomly picks a different port, which means that one needs to set up a different tunneling every time. from…
tlamadon
  • 970
  • 9
  • 18
6
votes
3 answers

Apply function along time dimension of XArray

I have an image stack stored in an XArray DataArray with dimensions time, x, y on which I'd like to apply a custom function along the time axis of each pixel such that the output is a single image of dimensions x,y. I have tried: apply_ufunc but the…
System123
  • 523
  • 5
  • 14
6
votes
2 answers

Dask + pyinstaller fails

I am trying to use dask dataframes into a packaged executable using pyinstaller. I have just import dask in my executable and I package it with pyinstaller scripts.py When I run it I get that /some/path/dask.yaml is not found. Does somebody know…
luca pescatore
  • 119
  • 1
  • 8
6
votes
2 answers

How to use Dask to read data from SQL ?

There are not enough examples in the documentation on how to read data from sqlAlchemy to a dask dataframe. Some examples i see are in terms of : df = dd.read_sql_table(table='my_table_name', uri=my_sqlalchemy_con_url, index_col='id') But my…
Viv
  • 1,474
  • 5
  • 28
  • 47
6
votes
0 answers

Huge memory use difference between dask and dask.distributed

I am trying to use dask.delayed to compute a large matrix for use in a later calculation. I am only ever running the code on a single local machine. When I use a dask single-machine scheduler it works fine, but is a little slow. To access more…
Nick W.
  • 61
  • 4