Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
1 answer

How to filter a table stored in S3 with over 1 billon rows using Dask or another Python library?

I have a huge table (df1, ~1.5 billion rows) stored in a S3 bucket, divided in multiple parts or partitions with the Parquet extension. My goal is to filter it by keeping those rows where the value of a particular column exists in a column (with the…
CSR95
  • 121
  • 8
3
votes
2 answers

Updating values for a subset of a subset of a pandas dataframe too slow for large data set

Problem Statement: I'm working with transaction data for all of a hospital's visits and I need to remove every bad debt transaction after the first for each patient. Issue I'm Having: My code works on a small dataset, but the actual data set is…
3
votes
2 answers

directory globbing with partitioned dask read_parquet directories

I have a directory of partitioned weather station readings that I've written with pandas/pyarrow. c.to_parquet(path=f"data/{filename}.parquet", engine='pyarrow', compression='snappy', partition_cols=['STATION', 'ELEMENT']) When I attempt to read a…
Scott Syms
  • 31
  • 2
3
votes
2 answers

importing large CSV file using Dask

I am importing a very large csv file ~680GB using Dask, however, the output is not what I expect. My aim is to select only some columns (6/50), and perhaps filter them (this I am unsure of because there seems to be no data?): import dask.dataframe…
Stackbeans
  • 273
  • 1
  • 16
3
votes
1 answer

Dask: Continue with others task if one fails

I have a simple (but large) task Graph in Dask. This is a code example results = [] for params in SomeIterable: a = dask.delayed(my_function)(**params) b = dask.delayed(my_other_function)(a) …
Andrex
  • 602
  • 1
  • 7
  • 22
3
votes
1 answer

Filter Dask DataFrame rows by specific values of index

Is there effective solution to select specific rows in Dask DataFrame? I would like to get only those rows which index is in a closed set (using the isin function is not enough efficient for me). Are there any other effective solutions than …
mlech
  • 31
  • 3
3
votes
1 answer

Short Time Fourier Transform (spectrum analysis) in parallel & lazily using Dask (and / or xarray)

Question: I am trying to do spectrum analysis on long time series data (see example for data structure, it is basically 1d data with a time index). To save time and memory etc. I want to do this parallel and lazily (using xarray and / or dask). What…
n4321d
  • 1,059
  • 2
  • 12
  • 31
3
votes
1 answer

Annotations for custom graphs in dask

How can I specify resources per task in a custom graph like you can achieve with the dask.annotate context manager for dask collections?
Wox
  • 155
  • 7
3
votes
4 answers

TypeError: __dask_distributed_pack__() takes 3 positional arguments but 4 were given

I have some code where I convert a pandas dataframe into a dask dataframe and I apply some operations on the row. The code used to work just fine but it seems to crash now due to some internal error caused by dask. Does anyone know what the issue…
Patrick
  • 141
  • 4
3
votes
1 answer

How to shuffle elements in a Dask bag

I have a dataset in which a few elements, which are close to each other and generally end up in the same partition, cause more computation than others, because they have quadratic complexity. I want to randomly reshuffle them so that the workload…
della
  • 151
  • 3
3
votes
0 answers

Dask chunk masking

I have an application where I am loading raster data into a Dask array and then I only need to process the chunks which overlap with some region of interest. I know that I can create a Dask masked array, but I am looking for a way to prevent certain…
System123
  • 523
  • 5
  • 14
3
votes
1 answer

Unaccountable Dask memory usage

I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find…
Severin
  • 281
  • 2
  • 8
3
votes
0 answers

How to restart dask worker subprocess after task is done?

In django-q we have recycle which is The number of tasks a worker will process before recycling . Useful to release memory resources on a regular basis. When I start dask-worker with --nprocs 2, I get two worker subprocesses. I would like to recycle…
nurettin
  • 11,090
  • 5
  • 65
  • 85
3
votes
1 answer

How can I systematically reuse the results of delayed functions in Dask?

I am working on building a computation graph with Dask. Some of the intermediate values will be used multiple times, but I would like those calculations to only run once. I must be making a trivial mistake, because that's not what happens. Here is a…
poldpold
  • 53
  • 6
3
votes
1 answer

dask - CountVectorizer returns "ValueError('Cannot infer dataframe metadata with a `dask.delayed` argument')"

I have a Dask Dataframe with the following content: X_trn y_trn 0 java repeat task every random seconds p m alre... LQ_CLOSE 1 are java optionals immutable p d like to under... HQ 2 text…
mendy
  • 191
  • 1
  • 12