Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

2 answers

Dask equivalent to Pandas replace?

Something I use regularly in pandas is the .replace operation. I am struggling to see how one readily performs this same operation on a dask dataframe? df.replace('PASS', '0', inplace=True) df.replace('FAIL', '1', inplace=True)

pandas dask

asked Nov 30 '16 at 22:02

docross

votes

3 answers

How to specify the directory that dask uses for temporary files?

Dask seems to write to the /tmp folder. How can I change the folder that dask uses for temporary files?

dask

asked Oct 14 '16 at 12:03

Arco Bast

3,595
2
26
53

votes

2 answers

How can I select data from a dask dataframe by a list of indices?

I want to select rows from a dask dataframe based on a list of indices. How can I do that? Example: Let's say, I have the following dask dataframe. dict_ = {'A':[1,2,3,4,5,6,7], 'B':[2,3,4,5,6,7,8], 'index':['x1', 'a2', 'x3', 'c4', 'x5', 'y6',…

python indexing dask

asked Jul 12 '16 at 00:19

Arco Bast

3,595
2
26
53

votes

1 answer

Item assignment to Python dask array objects

I've created a Python dask array and I'm trying to modify a slice of the array as follows: import numpy as np import dask.array as da x = np.random.random((20000, 100, 100)) # Create numpy array dx = da.from_array(x, chunks=(x.shape[0], 10, 10)) #…

python-2.7 dask

asked Mar 21 '16 at 22:32

Lcg3

votes

2 answers

Fastest way to get the minimum value of data array in another paired bin array

I have three 1D arrays: idxs: the index data weights: the weight of each index in idxs bins: the bin which is used to calculate minimum weight in it. Here's my current method of using the idxs to check the data called weights in which bin, and…

python pandas numpy scipy dask

asked Jun 08 '21 at 06:43

zxdawn

votes

4 answers

How to quickly compare two text files and get unique rows?

I have 2 text files (*.txt) that contain unique strings in the format: udtvbacfbbxfdffzpwsqzxyznecbqxgebuudzgzn:refmfxaawuuilznjrxuogrjqhlmhslkmprdxbascpoxda ltswbjfsnejkaxyzwyjyfggjynndwkivegqdarjg:qyktyzugbgclpovyvmgtkihxqisuawesmcvsjzukcbrzi The…

python pandas dask vaex

asked Nov 01 '20 at 09:28

Владимир

votes

2 answers

Dask Vs Rapids. What does rapids provide which dask doesn't have?

I want to understand what is the difference between dask and rapids, what benefits does rapids provides which dask doesn't have. Does rapids internally use dask code? If so then why do we have dask, cause even dask can interact with GPU.

machine-learning parallel-processing gpu dask rapids

asked Mar 18 '20 at 11:44

DjVasu

votes

2 answers

Switch off dask client warnings

Dask client spams warnings in my Jupyter Notebook output. Is there a way to switch off dask warnings? Warning text look like this: "distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other…

python jupyter-notebook dask

asked Aug 09 '19 at 15:08

Andrey Goloborodko

votes

3 answers

Running a Tornado Server within a Jupyter Notebook

Taking the standard Tornado demonstration and pushing the IOLoop into a background thread allows querying of the server within a single script. This is useful when the Tornado server is an interactive object (see Dask or similar). import…

python tornado jupyter python-asyncio dask

asked Mar 16 '19 at 21:39

Daniel

19,179
7
60
74

votes

1 answer

Memory errors using xarray + dask - use groupby or apply_ufunc?

I am using xarray as the basis of my workflow for analysing fluid turbulence data, but I'm having trouble leveraging dask correctly to limit memory usage on my laptop. I have a dataarray n with dimensions ('t', 'x', 'z'), which I've split into…

python out-of-memory pandas-groupby dask python-xarray

asked Aug 02 '18 at 21:32

ThomasNicholas

1,273
11
21

votes

1 answer

filtering with dask read_parquet method gives unwanted results

I am trying to read parquet files using thedask read_parquet method and the filters kwarg. however it sometimes doesn't filter according to the given condition. Example: creating and saving data frame with a dates column import pandas as pd import…

python dataframe filtering dask fastparquet

asked Jul 09 '18 at 11:18

moshevi

4,999
5
33
50

votes

1 answer

How to apply funtion to single Column of large dataset using Dask?

If apply funtion to calculate logaritm at single column of large dataset using Dask, How can I do that? df_train.apply(lambda x: np.log1p(x), axis=1 , meta={'column_name':'float32'}).compute() The dataset is very large (125 Millions of rows), How…

python dask logarithm

asked Mar 09 '18 at 17:21

ambigus9

1,417
3
19
37

votes

1 answer

Why is a computation much slower within a Dask/Distributed worker?

I have a computation which runs much slower within a Dask/Distributed worker compared to running it locally. I can reproduce it without any I/O going on, so I can rule out that it has to do with transferring data. The following code is a minimal…

python distributed dask

asked Dec 12 '17 at 16:10

bluenote10

23,414
14
122
178

votes

3 answers

python-xarray: open_mfdataset concat along two dimensions

I have files which are made of 10 ensembles and 35 time files. One of these files looks like: >>> xr.open_dataset('ens1/CCSM4_ens1_07ic_19820701-19820731_NPac_Jul.nc') Dimensions: (ensemble: 1, latitude: 66, longitude: 191, time:…

dask python-xarray

asked Nov 29 '17 at 04:38

Ray Bell

1,508
4
18
45

votes

1 answer

Convert raster time series of multiple GeoTIFF images to NetCDF

I have a raster time series stored in multiple GeoTIFF files (*.tif) that I'd like to convert to a single NetCDF file. The data is uint16. I could probably use gdal_translate to convert each image to netcdf using: gdal_translate -of netcdf -co…

python dask python-xarray rasterio

asked Oct 23 '17 at 22:15

Rich Signell

14,842
4
49
77

Prev 1 2 3

…

99 100 Next