Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
4
votes
1 answer

How to repartition a dataframe into fixed sized partitions?

I have a dask dataframe created from delayed functions which is comprised of randomly sized partitions. I would like to repartition the dataframe into chunks of size (approx) 10000. I can calculate the correct number of partitions with…
Dave Hirschfeld
  • 768
  • 2
  • 6
  • 15
4
votes
1 answer

How can I sort values within a Dask dataframe group?

I have this code which generates autoregressive terms within each unique combination of variables 'grouping A' and 'grouping B'. for i in range(1, 5): df.loc[:,'var_' + str(i)] = df.sort_values(by='date']) \ …
user1566200
  • 1,826
  • 4
  • 27
  • 47
4
votes
1 answer

After using Dask pivot_table I lose the index column

I am loosing index column after I use pivot_table for Dask Dataframe and save data to Parquet file. import dask.dataframe as dd import pandas as…
keiv.fly
  • 3,343
  • 4
  • 26
  • 45
4
votes
1 answer

Dask get_dummies Does Not Transform Variable(s)

I'm trying to use get_dummies via dask but it does not transform my variable, nor does it error out: >>> import dask.dataframe as dd >>> import pandas as pd >>> df_d = dd.read_csv('/datasets/dask_example/dask_get_dummies_example.csv') >>>…
Frank B.
  • 1,813
  • 5
  • 24
  • 44
4
votes
1 answer

Calling dask inside dask spawned process

We have a large project that comprises of numerous tasks. We use a dask graph to schedule each task. A small sample of the graph is as follows. Note that dask is set to multiprocessing mode. dask_graph: universe: !!python/tuple…
4
votes
1 answer

dask and parallel hdf5 writing

In my code I save multiple processed images (numpy arrays) in parallel on an hdf5 file using mpi (mpi4py/h5py). In order to do that the file need to be opened using the driver=mpio option. import h5py from mpi4py import…
s1mc0d3
  • 523
  • 2
  • 15
4
votes
1 answer

Pickle error when connecting to dask.distributed cluster

It is my simple code.Trying to run my 1st program. from dask.distributed import Client client = Client('192.168.1.102:8786') def inc(x): return x + 1 x = client.submit(inc, 10) print(x.result()) when trying to run this code by using this…
Sudip Das
  • 1,178
  • 1
  • 9
  • 24
4
votes
0 answers

Dask Dataframe Load by Index

I have a pandas dataframe with metadata on a bunch of text documents: meta_df = pd.read_csv( "./mdenny_copy_early2015/Metadata/Metadata/Bill_Metadata_1993-2014.csv", low_memory=False, parse_dates=['time'], …
saul.shanabrook
  • 3,068
  • 3
  • 31
  • 49
4
votes
2 answers

read process and concatenate pandas dataframe in parallel with dask

I'm trying to read and process in parallel a list of csv files and concatenate the output in a single pandas dataframe for further processing. My workflow consist of 3 steps: create a series of pandas dataframe by reading a list of csv files (all…
epifanio
  • 1,228
  • 1
  • 16
  • 26
4
votes
2 answers

How to programm a stencil with Dask

In many occasions, scientists simulates a system's dynamics using a Stencil, this is convolving a mathematical operator over a grid. Commonly, this operation consumes a lot of computational resources. Here is a good explanation of the idea. In…
4
votes
1 answer

Resetting dask dataframe index to allow join

Given http://dask.readthedocs.io/en/latest/dataframe-api.html#dask.dataframe.DataFrame.reset_index says dask doesn't support drop=True for reset_index() how do I join 2 dataframes together with different index (as viewed by head())
mobcdi
  • 1,532
  • 2
  • 28
  • 49
4
votes
1 answer

How to use Future with map method of the Executor from dask.distrubuted (Python library)?

I am running dask.distributed cluster. My task includes chained computations, where the last step is a parallel processing of a list, created on previous steps, using Executor.map method. The length of the list is not known in advance, because it…
wl2776
  • 4,099
  • 4
  • 35
  • 77
4
votes
1 answer

Difference in processing time between map_block and map_overlap is it due to dask.array to np.array conversion?

Introduction I have an image stack (ImgStack) made of 42 planes each of 2048x2048 px and a function that I use for the analysis: def All(ImgStack): some filtering more filtering I determined that the most efficient way to process the array…
s1mc0d3
  • 523
  • 2
  • 15
4
votes
2 answers

why is dot product in dask slower than in numpy

a dot product in dask seems to run much slower than in numpy: import numpy as np x_np = np.random.normal(10, 0.1, size=(1000,100)) y_np = x_np.transpose() %timeit x_np.dot(y_np) # 100 loops, best of 3: 7.17 ms per loop import dask.array as…
istern
  • 363
  • 1
  • 4
  • 13
4
votes
2 answers

Why does dask.dataframe compute() result gives IndexError in specific cases? How to find reason of async error?

When using current version of dask ('0.7.5', github: [a1]) due to large size of data, I was able to perform partitioned calculations by means of dask.dataframe api. But for a large DataFrame that was stored as record in bcolz ('0.12.1', github:…
RA Prism
  • 59
  • 6