Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

dask : How to read CSV files into a DataFrame from Microsoft Azure Blob

S3Fs is a Pythonic file interface to S3, does DASK have any Pythonic interface to Azure Storage Blob. Python SDK's for Azure Storage Blob provide ways to read and write to blob, but the interface requires the file to be downloaded to the local…

asked Dec 10 '17 at 18:01

Charles Selvaraj

votes

1 answer

dask groupby apply then merge back to dataframe

I would I go about creating a new column that is the result of a groupby and apply of another column while keeping the order of the dataframe (or at least be able to sort it back). example: I want to normalize a signal column by group import…

python dask

asked Dec 06 '17 at 20:10

AlexFC

votes

1 answer

Reading LAZ to Dask dataframe using delayed loading

Action Reading multiple LAZ point cloud files to a Dask DataFrame. Problem Unzipping LAZ (compressed) to LAS (uncompressed) requires a lot of memory. Varying filesizes and multiple processes created by Dask result in MemoryError's. Attempts I tried…

python dask dask-delayed laspy

asked Dec 06 '17 at 10:09

Tom Hemmes

2,000
2
17
23

votes

2 answers

Dask Bag of dicts to Dask array

I need to convert a dask.Bag of {'imgs': np.array(img_list), 'lables': np.array(label_list)} into two separate dask.Array-s. Why I created Bag instead of go directly for Array? Because I'm processing that Bag multiple times through map(); didn't…

python etl dask

asked Nov 28 '17 at 23:42

w00dy

votes

2 answers

dask: return None or empty from delayed task

I would like to return an empty dataframe/ None from a set of delayed tasks where parsing fails, e.g.; import dask.dataframe as dd import dask.delayed def _read(self, filename): try: df = pd.read_csv(filename, sep=';', decimal=',',…

python pandas dask

asked Nov 28 '17 at 16:20

morganics

1,209
13
27

votes

1 answer

Dask.delayed doesn't .compute() inside class

I have a folder containing 497 pandas' dataframes stored as .parquet files. The folder total dimension is 7.6GB. I'm trying to develop a simple trading system. So I create 2 different classes, the main one is the Portfolio one, this class then…

python shared-memory python-multiprocessing dask

asked Nov 22 '17 at 17:46

ilpomo

votes

1 answer

Saving dataframe divisions to parquet with dask

I am currently trying to save and read information from dask to parquet files. But when trying to save a dataframe with dask "to_parquet" and loading it afterwards again with "read_parquet" it seems like the division information gets…

python dataframe parquet dask dask-distributed

asked Nov 22 '17 at 17:24

lennart

votes

2 answers

Dask: subset (or drop) rows from Dataframe by index

I'd like to take a subset of rows of a Dask dataframe based on a set of index keys. (Specifically, I want to find rows of ddf1 whose index is not in the index of ddf2.) Both cache.drop([overlap_list]) and diff = cache[should_keep_bool_array] either…

python dask

asked Nov 18 '17 at 01:31

terry87

votes

4 answers

Compiling Executable with dask or joblib multiprocessing with cython results in errors

I'm converting some serial processed python jobs to multiprocessing with dask or joblib. Sadly I need to work on windows. When running from within IPython or from command line invoking the py-file with python everything is running fine. When…

python windows cython dask joblib

asked Nov 16 '17 at 08:53

Bastian Ebeling

1,138
11
38

votes

1 answer

mapping a function of variable execution time over a large collection with Dask

I have a large collection of entries E and a function f: E --> pd.DataFrame. The execution time of function f can vary drastically for different inputs. Finally all DataFrames should be concatenated into a single DataFrame. The situation I'd like to…

python pandas dataframe dask

asked Nov 11 '17 at 21:11

Thomas Moerman

votes

1 answer

Dask agg functions pickle error

I have the following dask dataframe @timestamp datetime64[ns] @version object dst object dst_port object host …

python pickle dask

asked Nov 10 '17 at 09:29

Apostolos

7,763
17
80
150

votes

2 answers

Filtering grouped df in Dask

Related to this similar question for Pandas: filtering grouped df in pandas Action To eliminate groups based on an expression applied to a different column than the groupby column. Problem Filter is not implemented for grouped…

python pandas dask

asked Oct 25 '17 at 08:08

Tom Hemmes

2,000
2
17
23

votes

1 answer

Merging Dask dataframes imported from csv files

I need to to import large datasets and merge them. I know there other questions similar to this but I could not find an answer specific to my problem. It appears that with dask I was able to read the large datasets into a dataframe but I could not…

python pandas dataframe pyspark dask

asked Oct 15 '17 at 17:24

jax

votes

0 answers

Dask get_dummies splitting

I'm in the process of migrating my pandas operations to dask. When I was using pandas the following line worked succesfully: triggers = df.triggers.str.get_dummies(','). It split the string at the commas before taking them to be dummy variables. For…

python pandas dask

asked Oct 10 '17 at 06:34

sachinruk

9,571
12
55
86

votes

1 answer

Python Dask Running Bag operations in parallel

I am trying to run a series of operations on a json file using Dask and read_text but I find that when I check Linux Systems Monitor, only one core is ever used at 100%. How do I know if the operations I am performing on a Dask Bag are able to be…

python json python-3.x parallel-processing dask

asked Oct 07 '17 at 23:48

Billiam

Prev 1 2 3

…

99 100 Next