Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

How to filter a table stored in S3 with over 1 billon rows using Dask or another Python library?

I have a huge table (df1, ~1.5 billion rows) stored in a S3 bucket, divided in multiple parts or partitions with the Parquet extension. My goal is to filter it by keeping those rows where the value of a particular column exists in a column (with the…

asked Aug 26 '21 at 15:54

CSR95

votes

2 answers

Updating values for a subset of a subset of a pandas dataframe too slow for large data set

Problem Statement: I'm working with transaction data for all of a hospital's visits and I need to remove every bad debt transaction after the first for each patient. Issue I'm Having: My code works on a small dataset, but the actual data set is…

python pandas performance bigdata dask

asked Aug 16 '21 at 19:22

ResNonVerba

votes

2 answers

directory globbing with partitioned dask read_parquet directories

I have a directory of partitioned weather station readings that I've written with pandas/pyarrow. c.to_parquet(path=f"data/{filename}.parquet", engine='pyarrow', compression='snappy', partition_cols=['STATION', 'ELEMENT']) When I attempt to read a…

pandas dask parquet pyarrow dask-dataframe

asked Aug 13 '21 at 00:35

Scott Syms

votes

2 answers

importing large CSV file using Dask

I am importing a very large csv file ~680GB using Dask, however, the output is not what I expect. My aim is to select only some columns (6/50), and perhaps filter them (this I am unsure of because there seems to be no data?): import dask.dataframe…

python dataframe dask dask-dataframe vaex

asked Jul 02 '21 at 21:42

Stackbeans

votes

1 answer

Dask: Continue with others task if one fails

I have a simple (but large) task Graph in Dask. This is a code example results = [] for params in SomeIterable: a = dask.delayed(my_function)(**params) b = dask.delayed(my_other_function)(a) …

python dask dask-distributed dask-delayed

asked Jun 24 '21 at 16:54

Andrex

votes

1 answer

Filter Dask DataFrame rows by specific values of index

Is there effective solution to select specific rows in Dask DataFrame? I would like to get only those rows which index is in a closed set (using the isin function is not enough efficient for me). Are there any other effective solutions than …

python dask

asked Jun 17 '21 at 08:26

mlech

votes

1 answer

Short Time Fourier Transform (spectrum analysis) in parallel & lazily using Dask (and / or xarray)

Question: I am trying to do spectrum analysis on long time series data (see example for data structure, it is basically 1d data with a time index). To save time and memory etc. I want to do this parallel and lazily (using xarray and / or dask). What…

python scipy dask python-xarray

asked Jun 08 '21 at 19:04

n4321d

1,059
2
12
31

votes

1 answer

Annotations for custom graphs in dask

How can I specify resources per task in a custom graph like you can achieve with the dask.annotate context manager for dask collections?

dask

asked May 25 '21 at 00:36

Wox

votes

4 answers

TypeError: __dask_distributed_pack__() takes 3 positional arguments but 4 were given

I have some code where I convert a pandas dataframe into a dask dataframe and I apply some operations on the row. The code used to work just fine but it seems to crash now due to some internal error caused by dask. Does anyone know what the issue…

python dask dask-dataframe

asked May 20 '21 at 06:56

Patrick

votes

1 answer

How to shuffle elements in a Dask bag

I have a dataset in which a few elements, which are close to each other and generally end up in the same partition, cause more computation than others, because they have quadratic complexity. I want to randomly reshuffle them so that the workload…

python dask

asked May 11 '21 at 15:24

della

votes

0 answers

Dask chunk masking

I have an application where I am loading raster data into a Dask array and then I only need to process the chunks which overlap with some region of interest. I know that I can create a Dask masked array, but I am looking for a way to prevent certain…

python dask raster

asked May 05 '21 at 19:25

System123

votes

1 answer

Unaccountable Dask memory usage

I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find…

pandas dask dask-distributed

asked Apr 26 '21 at 18:35

Severin

votes

0 answers

How to restart dask worker subprocess after task is done?

In django-q we have recycle which is The number of tasks a worker will process before recycling . Useful to release memory resources on a regular basis. When I start dask-worker with --nprocs 2, I get two worker subprocesses. I would like to recycle…

dask distributed dask-distributed

asked Apr 20 '21 at 07:16

nurettin

11,090
5
65
85

votes

1 answer

How can I systematically reuse the results of delayed functions in Dask?

I am working on building a computation graph with Dask. Some of the intermediate values will be used multiple times, but I would like those calculations to only run once. I must be making a trivial mistake, because that's not what happens. Here is a…

python dask dask-delayed

asked Apr 14 '21 at 08:50

poldpold

votes

1 answer

dask - CountVectorizer returns "ValueError('Cannot infer dataframe metadata with a `dask.delayed` argument')"

I have a Dask Dataframe with the following content: X_trn y_trn 0 java repeat task every random seconds p m alre... LQ_CLOSE 1 are java optionals immutable p d like to under... HQ 2 text…

dataframe dask countvectorizer

asked Apr 11 '21 at 04:41

mendy

Prev 1 2 3

…

99 100 Next