Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

How to repartition a dataframe into fixed sized partitions?

I have a dask dataframe created from delayed functions which is comprised of randomly sized partitions. I would like to repartition the dataframe into chunks of size (approx) 10000. I can calculate the correct number of partitions with…

python dataframe dask

asked Mar 17 '17 at 05:02

Dave Hirschfeld

votes

1 answer

How can I sort values within a Dask dataframe group?

I have this code which generates autoregressive terms within each unique combination of variables 'grouping A' and 'grouping B'. for i in range(1, 5): df.loc[:,'var_' + str(i)] = df.sort_values(by='date']) \ …

python pandas dataframe sorting dask

asked Mar 15 '17 at 14:30

user1566200

1,826
4
27
47

votes

1 answer

After using Dask pivot_table I lose the index column

I am loosing index column after I use pivot_table for Dask Dataframe and save data to Parquet file. import dask.dataframe as dd import pandas as…

python dask

asked Mar 06 '17 at 21:44

keiv.fly

3,343
4
26
45

votes

1 answer

Dask get_dummies Does Not Transform Variable(s)

I'm trying to use get_dummies via dask but it does not transform my variable, nor does it error out: >>> import dask.dataframe as dd >>> import pandas as pd >>> df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv') >>>…

python pandas dask dummy-variable

asked Jan 25 '17 at 14:47

Frank B.

1,813
5
24
44

votes

1 answer

Calling dask inside dask spawned process

We have a large project that comprises of numerous tasks. We use a dask graph to schedule each task. A small sample of the graph is as follows. Note that dask is set to multiprocessing mode. dask_graph: universe: !!python/tuple…

python multiprocessing dask

asked Jan 17 '17 at 20:43

Samaneh Navabpour

votes

1 answer

dask and parallel hdf5 writing

In my code I save multiple processed images (numpy arrays) in parallel on an hdf5 file using mpi (mpi4py/h5py). In order to do that the file need to be opened using the driver=mpio option. import h5py from mpi4py import…

hdf5 h5py scikit-image dask mpi4py

asked Dec 28 '16 at 18:17

s1mc0d3

votes

1 answer

Pickle error when connecting to dask.distributed cluster

It is my simple code.Trying to run my 1st program. from dask.distributed import Client client = Client('192.168.1.102:8786') def inc(x): return x + 1 x = client.submit(inc, 10) print(x.result()) when trying to run this code by using this…

python python-3.x distributed dask

asked Dec 04 '16 at 19:39

Sudip Das

1,178
1
9
24

votes

0 answers

Dask Dataframe Load by Index

I have a pandas dataframe with metadata on a bunch of text documents: meta_df = pd.read_csv( "./mdenny_copy_early2015/Metadata/Metadata/Bill_Metadata_1993-2014.csv", low_memory=False, parse_dates=['time'], …

python dask

asked Nov 07 '16 at 05:27

saul.shanabrook

3,068
3
31
49

votes

2 answers

read process and concatenate pandas dataframe in parallel with dask

I'm trying to read and process in parallel a list of csv files and concatenate the output in a single pandas dataframe for further processing. My workflow consist of 3 steps: create a series of pandas dataframe by reading a list of csv files (all…

python pandas multiprocessing dask

asked Nov 04 '16 at 11:27

epifanio

1,228
1
16
26

votes

2 answers

How to programm a stencil with Dask

In many occasions, scientists simulates a system's dynamics using a Stencil, this is convolving a mathematical operator over a grid. Commonly, this operation consumes a lot of computational resources. Here is a good explanation of the idea. In…

python dask

asked Oct 18 '16 at 20:10

Guillermo Cornejo Suárez

votes

1 answer

Resetting dask dataframe index to allow join

Given http://dask.readthedocs.io/en/latest/dataframe-api.html#dask.dataframe.DataFrame.reset_index says dask doesn't support drop=True for reset_index() how do I join 2 dataframes together with different index (as viewed by head())

dask

asked Aug 27 '16 at 12:01

mobcdi

1,532
2
28
49

votes

1 answer

How to use Future with map method of the Executor from dask.distrubuted (Python library)?

I am running dask.distributed cluster. My task includes chained computations, where the last step is a parallel processing of a list, created on previous steps, using Executor.map method. The length of the list is not known in advance, because it…

python python-2.7 distributed dask

asked Aug 24 '16 at 15:34

wl2776

4,099
4
35
77

votes

1 answer

Difference in processing time between map_block and map_overlap is it due to dask.array to np.array conversion?

Introduction I have an image stack (ImgStack) made of 42 planes each of 2048x2048 px and a function that I use for the analysis: def All(ImgStack): some filtering more filtering I determined that the most efficient way to process the array…

python image-processing h5py dask scikit-image

asked Apr 01 '16 at 10:57

s1mc0d3

votes

2 answers

why is dot product in dask slower than in numpy

a dot product in dask seems to run much slower than in numpy: import numpy as np x_np = np.random.normal(10, 0.1, size=(1000,100)) y_np = x_np.transpose() %timeit x_np.dot(y_np) # 100 loops, best of 3: 7.17 ms per loop import dask.array as…

python numpy profiling dask

asked Dec 23 '15 at 11:02

istern

votes

2 answers

Why does dask.dataframe compute() result gives IndexError in specific cases? How to find reason of async error?

When using current version of dask ('0.7.5', github: [a1]) due to large size of data, I was able to perform partitioned calculations by means of dask.dataframe api. But for a large DataFrame that was stored as record in bcolz ('0.12.1', github:…

pandas dask bcolz

asked Dec 23 '15 at 00:17

RA Prism

Prev 1 2 3

…

99 100 Next