Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

Is it possible to get the intersection of sets using dask?

I have a big dataset (50 million rows) in which I need to do some row-wise computations like getting the intersection of two sets (each in a different column) e.g. col_1:{1587004, 1587005, 1587006, 1587007} col_2:{1587004,…

python pandas dask

asked May 01 '19 at 12:02

Ivo Leist

votes

1 answer

xarray with dask sel is slow

A series of about 90 netCDF files each around 27 MB each, opened with xarray's open_mfdataset takes a long time to load a small space-time selection. Chunking dimensions yield marginal gain. decode_cf=True either inside the function or separate has…

python dask python-xarray

asked Apr 21 '19 at 22:51

tylertucker202

votes

1 answer

Read_json() dask is parallel?

I have the below code. It uses dask distributed to read 100 json files:(Workers: 5 Cores: 5 Memory: 50.00 GB) from dask.distributed import Client import dask.dataframe as dd client = Client('xxxxxxxx:8786') df =…

python bigdata dask

asked Apr 15 '19 at 19:09

MT467

votes

1 answer

How can I get result of Dask compute on a different machine than the one that submitted it?

I am using Dask behind a Django server and the basic setup I have is summarised here: https://github.com/MoonVision/django-dask-demo/ where the Dask client can be found here:…

dask dask-distributed dask-delayed

asked Apr 04 '19 at 10:26

Matt Nicolls

votes

1 answer

How to get the name of a csv file causing an error in dask.read_csv?

My objective is to parallelize reading many (500+) csv-files containing measurement data. To do so I pass a list of paths (source_files) to a synchronous client. Additionally I have specified dtypes, and column names (order_list). df =…

python csv error-handling dask

asked Mar 20 '19 at 11:32

boguspolenta

votes

0 answers

Dask hangs when using dask_xgboost train method

I am trying to reproduce the dask xgboost example from the dask-ml docs at http://ml.dask.org/examples/xgboost.html. Unfortunately, Dask doesn't seem to complete the training and I'm having a hard time tracking down the meaning of the errors and…

dask xgboost dask-ml

asked Mar 11 '19 at 19:30

chicagoson

votes

1 answer

Using Dask on an apply returning several columns (a DataFrame so)

I'm trying to use dask on an apply with a function that outputs 5 floats. I'll simplify in a example here. def func1(row, param): return float(row.Val1) * param, float(row.Val1) * np.power(param, 2) data = pd.DataFrame(np.array([["A01", 12],…

python pandas apply dask meta

asked Mar 07 '19 at 09:56

mbahin

votes

1 answer

DASK Metadata mismatch found in 'from_delayed' JSON file

I'm just starting my adventure with DASK and land I'm learning on an example dataset in json format. I know that this is not the easiest data format in the world for a beginner :) I have a dataset in the json format. I loaded the data via…

python dataset bigdata dask

asked Mar 04 '19 at 22:47

AWL

votes

1 answer

Actors and dask-workers

client = Client('127.0.0.1:8786',direct_to_workers=True) future1 = client.submit(Counter, workers= 'ninja',actor=True) counter1 = future1.result() print(counter1) All is well but what if the client gets restarted? How do I…

dask dask-distributed

asked Feb 28 '19 at 05:00

chak

votes

1 answer

How can I read many large .7z files containing many CSV files?

I have many .7z files every file containing many large CSV files (more than 1GB). How can I read this in python (especially pandas and dask data frame)? Should I change the compression format to something else?

pandas csv dataframe dask 7zip

asked Feb 16 '19 at 22:33

Eghbal

3,892
13
51
112

votes

1 answer

Multiple merge in Dask and field names

I am trying to merge multiple pandas dataframes onto a large Dask dataframe with fields ["a_id", "b_id", "c_id"]. Each pandas dataframe "A", "B", and "C" has a unique field ("a_id", "b_id", and "c_id") that joins it to the Dask dataframe. "B" and…

python pandas dask

asked Feb 15 '19 at 16:41

triphook

2,915
3
25
34

votes

1 answer

How do I share a large read-only object across Dask distributed workers

The Problem I'm trying to send a 2GB CPython read-only object (can be pickled) to dask distributed workers via apply(). This ends up consuming a lot of memory for processes/ threads (14+ GB). Is there a way to load the object only once into memory…

python python-multiprocessing dask concurrent.futures dask-distributed

asked Feb 09 '19 at 13:02

Hyperspace

votes

1 answer

Dask dataframe in requirements.txt?

I need to install a few python packages in a Docker container via requirements.txt, using pip install. One of the packages is dask. However, when installing it, it throws an error because it cannot find the package toolz. The question has been…

python docker dask

asked Feb 07 '19 at 16:04

giosans

1,136
1
12
30

votes

0 answers

How to read csv in parallel and write in Cassandra in parallel for achieving high throughput?

I have tried using execute, execute_async and execute_concurrent in Cassandra but for reading 10M rows, I could index them in Cassandra in no less than 55 mins. Note that I have had set the concurrent threads to 1000, tuned the YAML file's…

python python-3.x cassandra dask

asked Jan 17 '19 at 20:48

aviral sanjay

votes

2 answers

euclidean distance calculation using Python and Dask

I'm attempting to identify elements in the euclidean distance matrix that fall under a certain threshold. I then take the positional arguments for this search and use them to compare elements in a second array (for sake of demonstration this array…

python numpy dask euclidean-distance dask-delayed

asked Jan 17 '19 at 17:06

Quinn Anderson

Prev 1 2 3

…

99 100 Next