Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
4
votes
2 answers

Reading csv with separator in python dask

I am trying to create a DataFrame by reading a csv file separated by '#####' 5 hashes The code is: import dask.dataframe as dd df = dd.read_csv('D:\temp.csv',sep='#####',engine='python') res = df.compute() Error is: dask.async.ValueError: Dask…
Satya
  • 5,470
  • 17
  • 47
  • 72
4
votes
1 answer

variable column name in dask assign() or apply()

I have code that works in pandas, but I'm having trouble converting it to use dask. There is a partial solution here, but it does not allow me to use a variable as the name of the column I am creating/assigning to. Here's the working pandas…
kaz
  • 675
  • 2
  • 5
  • 13
3
votes
1 answer

TypeError when running compute that includes map_blocks and reduce

I am having difficulty diagnosing the cause of the error. My code involves running a convolution (with map_blocks) over some arrays if they belong to the same group of variables, otherwise just record the 2-dim array. I then do an argmax operation…
matsuo_basho
  • 2,833
  • 8
  • 26
  • 47
3
votes
1 answer

Dask tutorial failing with distributed.nanny - WARNING - Restarting worker

Interested in the possibilities offered by Dask, I started with the dask tutorial, and prepared my laptop by following the instructions as per the tutorial: cloning the repo and making a new conda env with: conda env create -f…
Emek
  • 31
  • 4
3
votes
2 answers

How to add a constant to negative values in array

Given the xarray below, I would like to add 10 to all negative values (i.e, -5 becomes 5, -4 becomes 6 ... -1 becomes 9, all values remain unchanged). a = xr.DataArray(np.arange(25).reshape(5, 5)-5, dims=("x", "y")) I tried: a[a<0]=10+a[a<0], but…
e5k
  • 194
  • 1
  • 14
3
votes
1 answer

DASK: merge throws error when one side's key is NA whereas pd.merge works

I have these sample dataframes: tdf1 = pd.DataFrame([{"id": 1, "val": 4}, {"id": 2, "val": 5}, {"id": 3, "val": 6}, {"id": pd.NA, "val": 7}, {"id": 4, "val": 8}]) tdf2 = pd.DataFrame([{"some_id": 1, "name": "Josh"}, {"some_id": 3, "name":…
Jorge Cespedes
  • 547
  • 1
  • 11
  • 21
3
votes
4 answers

How to efficiently read the array columns in the tsv file into a single npz files for each column efficiently?

I've a data file that looks like this: 58f0965a62d62099f5c0771d35dbc218 0.868632614612579 [0.028979932889342308, 0.004080114420503378, 0.03757167607545853] [-0.006008833646774292, -0.010409083217382431,…
alvas
  • 115,346
  • 109
  • 446
  • 738
3
votes
2 answers

Using Matplotlib with Dask

Let's say we have pandas dataframe pd and a dask dataframe dd. When I want to plot pandas one with matplotlib I can easily do it: fig, ax = plt.subplots() ax.bar(pd["series1"], pd["series2"]) fig.savefig(path) However, when I am trying to do the…
MDDawid1
  • 92
  • 11
3
votes
1 answer

How to remove __null_dask_index from parquet file?

I am writing a df to a Parquet file using Dask: df.to_parquet(file, compression='snappy', write_metadata_file=False,\ engine='pyarrow', index=None) I need to present the contents of the file in an online parquet viewer, and the…
krx
  • 85
  • 1
  • 7
3
votes
2 answers

unable to transpose dask.dataframe - getting Unbound Local Error

I am trying to transpose a very large dataframe. I used Dask due to the size of the file and searched up how to transpose a dask dataframe. import pandas as pd import numpy as np import dask.dataframe as dd genematrix =…
3
votes
1 answer

Why is polars called the fastest dataframe library, isn't dask with cudf more powerfull?

Most of the benchmarks have dask and cuDF isolated, but i can use them together. Wouldn't Dask with cuDF be faster than polars?! Also, Polars only runs if the data fits in memory, but this isn't the case with dask. So why is there…
zacko
  • 179
  • 2
  • 9
3
votes
1 answer

Reading an SQL query into a Dask DataFrame

I'm trying create a function that takes an SQL SELECT query as a parameter and use dask to read its results into a dask DataFrame using the dask.read_sql_query function. I am new to dask and to SQLAlchemy. I first tried this: import dask.dataFrame…
mkab
  • 933
  • 4
  • 16
  • 31
3
votes
1 answer

Replacing existing column in dask map_partitions gives SettingWithCopyWarning

I'm replacing column id2 in a dask dataframe using map_partitions. The result is that the values are replaced but with a pandas warning. What is this warning and how to apply the .loc suggestion in the example below? pdf = pd.DataFrame({ …
ps0604
  • 1,227
  • 23
  • 133
  • 330
3
votes
2 answers

Applying a function to each timestep in an xarray.Dataset, and return lazy Dask array outputs

I have an xarray.Dataset with two 1D variables sun_azimuth and sun_elevation with multiple timesteps along the time dimension: import xarray as xr import numpy as np ds = xr.Dataset( data_vars={ "sun_azimuth": ("time", [10, 20, 30, 40,…
3
votes
1 answer

Dask scatter with broadcast=True extremely slow

I have created a single (remote) scheduler and ten worker on different machines on the same network and try to distribute a dataframe from a client. My problem is that it takes 30min to do the scatter. from dask.distributed import Client df =…
Philipp -
  • 33
  • 4