Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

2 answers

Reading csv with separator in python dask

I am trying to create a DataFrame by reading a csv file separated by '#####' 5 hashes The code is: import dask.dataframe as dd df = dd.read_csv('D:\temp.csv',sep='#####',engine='python') res = df.compute() Error is: dask.async.ValueError: Dask…

asked Dec 14 '15 at 11:45

Satya

5,470
17
47
72

votes

1 answer

variable column name in dask assign() or apply()

I have code that works in pandas, but I'm having trouble converting it to use dask. There is a partial solution here, but it does not allow me to use a variable as the name of the column I am creating/assigning to. Here's the working pandas…

python pandas dask

asked Nov 05 '15 at 23:52

kaz

votes

1 answer

TypeError when running compute that includes map_blocks and reduce

I am having difficulty diagnosing the cause of the error. My code involves running a convolution (with map_blocks) over some arrays if they belong to the same group of variables, otherwise just record the 2-dim array. I then do an argmax operation…

python numpy dask

asked Jul 25 '23 at 02:32

matsuo_basho

2,833
8
26
47

votes

1 answer

Dask tutorial failing with distributed.nanny - WARNING - Restarting worker

Interested in the possibilities offered by Dask, I started with the dask tutorial, and prepared my laptop by following the instructions as per the tutorial: cloning the repo and making a new conda env with: conda env create -f…

client dask restart

asked Feb 10 '23 at 15:24

Emek

votes

2 answers

How to add a constant to negative values in array

Given the xarray below, I would like to add 10 to all negative values (i.e, -5 becomes 5, -4 becomes 6 ... -1 becomes 9, all values remain unchanged). a = xr.DataArray(np.arange(25).reshape(5, 5)-5, dims=("x", "y")) I tried: a[a<0]=10+a[a<0], but…

python dask python-xarray

asked Dec 13 '22 at 09:30

e5k

votes

1 answer

DASK: merge throws error when one side's key is NA whereas pd.merge works

I have these sample dataframes: tdf1 = pd.DataFrame([{"id": 1, "val": 4}, {"id": 2, "val": 5}, {"id": 3, "val": 6}, {"id": pd.NA, "val": 7}, {"id": 4, "val": 8}]) tdf2 = pd.DataFrame([{"some_id": 1, "name": "Josh"}, {"some_id": 3, "name":…

python pandas dask dask-dataframe pandas-merge

asked Oct 28 '22 at 15:52

Jorge Cespedes

votes

4 answers

How to efficiently read the array columns in the tsv file into a single npz files for each column efficiently?

I've a data file that looks like this: 58f0965a62d62099f5c0771d35dbc218 0.868632614612579 [0.028979932889342308, 0.004080114420503378, 0.03757167607545853] [-0.006008833646774292, -0.010409083217382431,…

python arrays pandas numpy dask

asked Jul 28 '22 at 04:27

alvas

115,346
109
446
738

votes

2 answers

Using Matplotlib with Dask

Let's say we have pandas dataframe pd and a dask dataframe dd. When I want to plot pandas one with matplotlib I can easily do it: fig, ax = plt.subplots() ax.bar(pd["series1"], pd["series2"]) fig.savefig(path) However, when I am trying to do the…

python pandas matplotlib dask dask-dataframe

asked Jul 15 '22 at 22:58

MDDawid1

votes

1 answer

How to remove __null_dask_index from parquet file?

I am writing a df to a Parquet file using Dask: df.to_parquet(file, compression='snappy', write_metadata_file=False,\ engine='pyarrow', index=None) I need to present the contents of the file in an online parquet viewer, and the…

python dataframe dask parquet dask-dataframe

asked Jul 07 '22 at 02:21

krx

votes

2 answers

unable to transpose dask.dataframe - getting Unbound Local Error

I am trying to transpose a very large dataframe. I used Dask due to the size of the file and searched up how to transpose a dask dataframe. import pandas as pd import numpy as np import dask.dataframe as dd genematrix =…

python dataframe dask transpose

asked Jun 17 '22 at 16:44

Farzeen Nafees

votes

1 answer

Why is polars called the fastest dataframe library, isn't dask with cudf more powerfull?

Most of the benchmarks have dask and cuDF isolated, but i can use them together. Wouldn't Dask with cuDF be faster than polars?! Also, Polars only runs if the data fits in memory, but this isn't the case with dask. So why is there…

python dataframe dask python-polars cudf

asked Jun 15 '22 at 20:13

zacko

votes

1 answer

Reading an SQL query into a Dask DataFrame

I'm trying create a function that takes an SQL SELECT query as a parameter and use dask to read its results into a dask DataFrame using the dask.read_sql_query function. I am new to dask and to SQLAlchemy. I first tried this: import dask.dataFrame…

python pandas sqlalchemy dask dask-dataframe

asked May 24 '22 at 13:26

mkab

votes

1 answer

Replacing existing column in dask map_partitions gives SettingWithCopyWarning

I'm replacing column id2 in a dask dataframe using map_partitions. The result is that the values are replaced but with a pandas warning. What is this warning and how to apply the .loc suggestion in the example below? pdf = pd.DataFrame({ …

python pandas dataframe dask dask-dataframe

asked May 20 '22 at 16:18

ps0604

1,227
23
133
330

votes

2 answers

Applying a function to each timestep in an xarray.Dataset, and return lazy Dask array outputs

I have an xarray.Dataset with two 1D variables sun_azimuth and sun_elevation with multiple timesteps along the time dimension: import xarray as xr import numpy as np ds = xr.Dataset( data_vars={ "sun_azimuth": ("time", [10, 20, 30, 40,…

python numpy dask python-xarray dask-delayed

asked May 13 '22 at 05:59

Robbi Bishop-Taylor

votes

1 answer

Dask scatter with broadcast=True extremely slow

I have created a single (remote) scheduler and ten worker on different machines on the same network and try to distribute a dataframe from a client. My problem is that it takes 30min to do the scatter. from dask.distributed import Client df =…

python pandas dask dask-distributed

asked May 07 '22 at 21:44

Philipp -

Prev 1 2 3

…

99 100 Next