Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

2 answers

Cleanest way to support xarray, dask, and numpy arrays in one function

I have a function that accepts multiple 2D arrays and creates two new arrays with the same shape. It was originally written to only support numpy arrays, but was "hacked" to support dask arrays if a "chunks" attribute was seen. A user who was using…

asked Aug 09 '18 at 19:26

djhoese

3,567
1
27
45

votes

0 answers

specify how to partition dask dataframe?

I have a pandas df that's indexed by id and date. I would like to run some regressions for each id in parallel using dask. I know dask splits the df into N partitions but is there a way to force it to split by id column? This way when I do…

python python-3.x pandas dask

asked Aug 05 '18 at 21:41

Alex

1,281
1
13
26

votes

1 answer

How do I get adaptive dask workers to run some code on startup?

I'm creating a dask scheduler using dask-kubernetes and putting it into adaptive mode. from dask-kubernetes import KubeCluster cluster = KubeCluster() cluster.adapt(minimum=0, maximum=40) I need each worker to run some setup code when they are…

python dask dask-distributed dask-kubernetes

asked Aug 01 '18 at 11:04

Jacob Tomlinson

3,341
2
31
62

votes

1 answer

Save dask dataframe to csv and find out its length without computing twice

Say, i have some dask dataframe. I'd like to do some operations with it, than save to csv and print its len. As I understand, the following code will make dask to compute df twice, am I right? df = dd.read_csv('path/to/file',…

python dataframe dask

asked Jul 30 '18 at 12:51

elfinorr

votes

1 answer

How to reliably clean up dask scheduler/worker

I'm starting up a dask cluster in an automated way by ssh-ing into a bunch of machines and running dask-worker. I noticed that I sometimes run into problems when processes from a previous experiment are still running. Wha'ts the best way to clean up…

dask dask-distributed

asked Jul 29 '18 at 01:55

John

votes

1 answer

Converting a Dask column into new Dask column of type datetime

I have an unparsed column in a dask dataframe (df) that I am using pandas to convert to datetime and put into a new column in the dask dataframe. However it breaks as column assignment doesn't support type DatetimeIndex. df['New Column'] =…

python pandas datetime dask

asked Jul 19 '18 at 10:16

Usherwood

votes

1 answer

Dask - Drop duplicate index MemoryError

I'm getting a MemoryError when I try to drop duplicate timestamps on a large dataframe with the following code. import dask.dataframe as dd path = f's3://{container_name}/*' ddf = dd.read_parquet(path, storage_options=opts,…

dask

asked Jul 12 '18 at 08:44

Lee Chengkai

votes

1 answer

Masking in Dask

I was just wondering if someone could help show me how to apply functions such as "sum" or "mean" on masks arrays using dask. I wish to calculate the sum / mean of the array on only values where there is no mask. Code: import dask.array as…

mask dask

asked Jul 05 '18 at 15:30

Chen

votes

0 answers

Implementation of a recursive function using dask.delayed

How can I successfully implement Merge Sort using dask.delayed or with some other dask API. So that it becomes faster with parallelism.

dask dask-delayed

asked Jun 30 '18 at 07:59

Dhruv Kumar

votes

1 answer

Dask Apply of Python Function

I have a df: id log 0 24 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0* 1 25 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0* I…

python python-3.x pandas dataframe dask

asked Jun 26 '18 at 19:07

OverflowingTheGlass

2,324
1
27
75

votes

0 answers

Fair and realistic comparison using dask

In order to get a better idea of dask library in python I am trying to make a fair comparison between using dask and not. I used h5pyto create a big dataset which was used later on to measure mean in one of the axis as a numpy style operation. I…

python python-3.x dask h5py

asked Jun 25 '18 at 17:36

Fence

votes

1 answer

File Not Found Error in Dask program run on cluster

I have 4 machines, M1, M2, M3, and M4. The scheduler, client, worker runs on M1. I've put a csv file in M1. Rest of the machines are workers. When I run the program with read_csv file in dask. It gives me Error, file not found

dask dask-distributed dask-delayed dask.distributed

asked Jun 22 '18 at 11:33

Dhruv Kumar

votes

1 answer

Dask: set multiprocessing method from Python

Is there a way to set the multiprocessing method from Python? I do not see a method in the Client() API docs of Dask.distributed that indicates how to set this property. Update: For example, is there: client =…

dask dask-distributed

asked Jun 21 '18 at 22:31

ericmjl

13,541
12
51
80

votes

3 answers

Dask read_json metadata mismatch

I'm trying to load json files into a dask df. files = glob.glob('**/*.json', recursive=True) df = dd.read_json(files, lines = False) There are some missing values in the data, and some of the files have extra columns. Is there a way to specify a…

dask

asked Jun 19 '18 at 12:32

Maria

votes

0 answers

Dask scheduler behavior while reading/retrieving large datasets

This is a follow-up to this question. I'm experiencing problems with persisting a large dataset in distributed memory. I have a scheduler running on one machine and 8 workers each running on their own machines connected by 40 gigabit ethernet and a…

dask dask-distributed

asked Jun 18 '18 at 19:54

A.C.

Prev 1 2 3

…

99 100 Next