Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
9
votes
2 answers

Dask equivalent to Pandas replace?

Something I use regularly in pandas is the .replace operation. I am struggling to see how one readily performs this same operation on a dask dataframe? df.replace('PASS', '0', inplace=True) df.replace('FAIL', '1', inplace=True)
docross
  • 93
  • 1
  • 6
9
votes
3 answers

How to specify the directory that dask uses for temporary files?

Dask seems to write to the /tmp folder. How can I change the folder that dask uses for temporary files?
Arco Bast
  • 3,595
  • 2
  • 26
  • 53
9
votes
2 answers

How can I select data from a dask dataframe by a list of indices?

I want to select rows from a dask dataframe based on a list of indices. How can I do that? Example: Let's say, I have the following dask dataframe. dict_ = {'A':[1,2,3,4,5,6,7], 'B':[2,3,4,5,6,7,8], 'index':['x1', 'a2', 'x3', 'c4', 'x5', 'y6',…
Arco Bast
  • 3,595
  • 2
  • 26
  • 53
9
votes
1 answer

Item assignment to Python dask array objects

I've created a Python dask array and I'm trying to modify a slice of the array as follows: import numpy as np import dask.array as da x = np.random.random((20000, 100, 100)) # Create numpy array dx = da.from_array(x, chunks=(x.shape[0], 10, 10)) #…
Lcg3
  • 93
  • 1
  • 5
8
votes
2 answers

Fastest way to get the minimum value of data array in another paired bin array

I have three 1D arrays: idxs: the index data weights: the weight of each index in idxs bins: the bin which is used to calculate minimum weight in it. Here's my current method of using the idxs to check the data called weights in which bin, and…
zxdawn
  • 825
  • 1
  • 9
  • 19
8
votes
4 answers

How to quickly compare two text files and get unique rows?

I have 2 text files (*.txt) that contain unique strings in the format: udtvbacfbbxfdffzpwsqzxyznecbqxgebuudzgzn:refmfxaawuuilznjrxuogrjqhlmhslkmprdxbascpoxda ltswbjfsnejkaxyzwyjyfggjynndwkivegqdarjg:qyktyzugbgclpovyvmgtkihxqisuawesmcvsjzukcbrzi The…
8
votes
2 answers

Dask Vs Rapids. What does rapids provide which dask doesn't have?

I want to understand what is the difference between dask and rapids, what benefits does rapids provides which dask doesn't have. Does rapids internally use dask code? If so then why do we have dask, cause even dask can interact with GPU.
DjVasu
  • 113
  • 9
8
votes
2 answers

Switch off dask client warnings

Dask client spams warnings in my Jupyter Notebook output. Is there a way to switch off dask warnings? Warning text look like this: "distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other…
8
votes
3 answers

Running a Tornado Server within a Jupyter Notebook

Taking the standard Tornado demonstration and pushing the IOLoop into a background thread allows querying of the server within a single script. This is useful when the Tornado server is an interactive object (see Dask or similar). import…
Daniel
  • 19,179
  • 7
  • 60
  • 74
8
votes
1 answer

Memory errors using xarray + dask - use groupby or apply_ufunc?

I am using xarray as the basis of my workflow for analysing fluid turbulence data, but I'm having trouble leveraging dask correctly to limit memory usage on my laptop. I have a dataarray n with dimensions ('t', 'x', 'z'), which I've split into…
ThomasNicholas
  • 1,273
  • 11
  • 21
8
votes
1 answer

filtering with dask read_parquet method gives unwanted results

I am trying to read parquet files using thedask read_parquet method and the filters kwarg. however it sometimes doesn't filter according to the given condition. Example: creating and saving data frame with a dates column import pandas as pd import…
moshevi
  • 4,999
  • 5
  • 33
  • 50
8
votes
1 answer

How to apply funtion to single Column of large dataset using Dask?

If apply funtion to calculate logaritm at single column of large dataset using Dask, How can I do that? df_train.apply(lambda x: np.log1p(x), axis=1 , meta={'column_name':'float32'}).compute() The dataset is very large (125 Millions of rows), How…
ambigus9
  • 1,417
  • 3
  • 19
  • 37
8
votes
1 answer

Why is a computation much slower within a Dask/Distributed worker?

I have a computation which runs much slower within a Dask/Distributed worker compared to running it locally. I can reproduce it without any I/O going on, so I can rule out that it has to do with transferring data. The following code is a minimal…
bluenote10
  • 23,414
  • 14
  • 122
  • 178
8
votes
3 answers

python-xarray: open_mfdataset concat along two dimensions

I have files which are made of 10 ensembles and 35 time files. One of these files looks like: >>> xr.open_dataset('ens1/CCSM4_ens1_07ic_19820701-19820731_NPac_Jul.nc') Dimensions: (ensemble: 1, latitude: 66, longitude: 191, time:…
Ray Bell
  • 1,508
  • 4
  • 18
  • 45
8
votes
1 answer

Convert raster time series of multiple GeoTIFF images to NetCDF

I have a raster time series stored in multiple GeoTIFF files (*.tif) that I'd like to convert to a single NetCDF file. The data is uint16. I could probably use gdal_translate to convert each image to netcdf using: gdal_translate -of netcdf -co…
Rich Signell
  • 14,842
  • 4
  • 49
  • 77