Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
1 answer

dask : How to read CSV files into a DataFrame from Microsoft Azure Blob

S3Fs is a Pythonic file interface to S3, does DASK have any Pythonic interface to Azure Storage Blob. Python SDK's for Azure Storage Blob provide ways to read and write to blob, but the interface requires the file to be downloaded to the local…
3
votes
1 answer

dask groupby apply then merge back to dataframe

I would I go about creating a new column that is the result of a groupby and apply of another column while keeping the order of the dataframe (or at least be able to sort it back). example: I want to normalize a signal column by group import…
AlexFC
  • 68
  • 6
3
votes
1 answer

Reading LAZ to Dask dataframe using delayed loading

Action Reading multiple LAZ point cloud files to a Dask DataFrame. Problem Unzipping LAZ (compressed) to LAS (uncompressed) requires a lot of memory. Varying filesizes and multiple processes created by Dask result in MemoryError's. Attempts I tried…
Tom Hemmes
  • 2,000
  • 2
  • 17
  • 23
3
votes
2 answers

Dask Bag of dicts to Dask array

I need to convert a dask.Bag of {'imgs': np.array(img_list), 'lables': np.array(label_list)} into two separate dask.Array-s. Why I created Bag instead of go directly for Array? Because I'm processing that Bag multiple times through map(); didn't…
w00dy
  • 748
  • 1
  • 6
  • 23
3
votes
2 answers

dask: return None or empty from delayed task

I would like to return an empty dataframe/ None from a set of delayed tasks where parsing fails, e.g.; import dask.dataframe as dd import dask.delayed def _read(self, filename): try: df = pd.read_csv(filename, sep=';', decimal=',',…
morganics
  • 1,209
  • 13
  • 27
3
votes
1 answer

Dask.delayed doesn't .compute() inside class

I have a folder containing 497 pandas' dataframes stored as .parquet files. The folder total dimension is 7.6GB. I'm trying to develop a simple trading system. So I create 2 different classes, the main one is the Portfolio one, this class then…
ilpomo
  • 657
  • 2
  • 5
  • 19
3
votes
1 answer

Saving dataframe divisions to parquet with dask

I am currently trying to save and read information from dask to parquet files. But when trying to save a dataframe with dask "to_parquet" and loading it afterwards again with "read_parquet" it seems like the division information gets…
lennart
  • 33
  • 3
3
votes
2 answers

Dask: subset (or drop) rows from Dataframe by index

I'd like to take a subset of rows of a Dask dataframe based on a set of index keys. (Specifically, I want to find rows of ddf1 whose index is not in the index of ddf2.) Both cache.drop([overlap_list]) and diff = cache[should_keep_bool_array] either…
terry87
  • 445
  • 1
  • 4
  • 15
3
votes
4 answers

Compiling Executable with dask or joblib multiprocessing with cython results in errors

I'm converting some serial processed python jobs to multiprocessing with dask or joblib. Sadly I need to work on windows. When running from within IPython or from command line invoking the py-file with python everything is running fine. When…
Bastian Ebeling
  • 1,138
  • 11
  • 38
3
votes
1 answer

mapping a function of variable execution time over a large collection with Dask

I have a large collection of entries E and a function f: E --> pd.DataFrame. The execution time of function f can vary drastically for different inputs. Finally all DataFrames should be concatenated into a single DataFrame. The situation I'd like to…
Thomas Moerman
  • 882
  • 8
  • 16
3
votes
1 answer

Dask agg functions pickle error

I have the following dask dataframe @timestamp datetime64[ns] @version object dst object dst_port object host …
Apostolos
  • 7,763
  • 17
  • 80
  • 150
3
votes
2 answers

Filtering grouped df in Dask

Related to this similar question for Pandas: filtering grouped df in pandas Action To eliminate groups based on an expression applied to a different column than the groupby column. Problem Filter is not implemented for grouped…
Tom Hemmes
  • 2,000
  • 2
  • 17
  • 23
3
votes
1 answer

Merging Dask dataframes imported from csv files

I need to to import large datasets and merge them. I know there other questions similar to this but I could not find an answer specific to my problem. It appears that with dask I was able to read the large datasets into a dataframe but I could not…
jax
  • 840
  • 2
  • 17
  • 35
3
votes
0 answers

Dask get_dummies splitting

I'm in the process of migrating my pandas operations to dask. When I was using pandas the following line worked succesfully: triggers = df.triggers.str.get_dummies(','). It split the string at the commas before taking them to be dummy variables. For…
sachinruk
  • 9,571
  • 12
  • 55
  • 86
3
votes
1 answer

Python Dask Running Bag operations in parallel

I am trying to run a series of operations on a json file using Dask and read_text but I find that when I check Linux Systems Monitor, only one core is ever used at 100%. How do I know if the operations I am performing on a Dask Bag are able to be…
Billiam
  • 35
  • 3