Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

3 answers

dask: shared memory in parallel model

I've read the dask documentation, blogs and SO, but I'm still not 100% clear on how to do it. My use case: I have about 10GB of reference data. Once loaded they are read-only. Usually we are loading them into Dask/Pandas dataframes I need these…

asked Nov 17 '18 at 12:03

Juergen

votes

1 answer

Is there an advantage to pre-scattering data objects in Dask?

If I pre-scatter a data object across worker nodes, does it get copied in its entirety to each of the worker nodes? Is there an advantage in doing so if that data object is big? Using the futures interface as an example: client.scatter(data,…

python parallel-processing dask

asked Oct 25 '18 at 20:02

ericmjl

13,541
12
51
80

votes

1 answer

Dask compute is very slow

I have a dataframe that consist of 5 million records. I am trying to process it using below code by leveraging the dask dataframes in python import dask.dataframe as dd dask_df = dd.read_csv(fullPath) …

python python-3.x performance dask dask-distributed

asked Oct 07 '18 at 11:00

Neno M.

votes

1 answer

Drop rows in dask dataFrame on condition

I'm trying to drop some rows in my dask dataframe with : df.drop(df[(df.A <= 3) | (df.A > 1000)].index) But this one doesn't work and return NotImplementedError: Drop currently only works for axis=1 I really need help

python-3.x dataframe data-analysis dask

asked Sep 20 '18 at 15:10

Mdhvince

votes

2 answers

How to use 'loc' for column selection of a dataframe in dask

Anyone can tell me how i should select one column with 'loc' in a dataframe using dask? As a side note, when i am loading the dataframe using dd.read_csv with header equals to "None", the column name is starting from zero to 131094. I am about to…

python pandas dataframe distributed dask

asked Aug 26 '18 at 00:08

user8034918

votes

1 answer

What's the difference between dask=parallelized and dask=allowed in xarray's apply_ufunc?

In the xarray documentation for the function apply_ufunc it says: dask: ‘forbidden’, ‘allowed’ or ‘parallelized’, optional How to handle applying to objects containing lazy data in the form of dask arrays: ‘forbidden’ (default): raise an…

python numpy dask python-xarray numpy-ufunc

asked Aug 07 '18 at 22:16

ThomasNicholas

1,273
11
21

votes

0 answers

Time series decimation benchmark: Dask vs Vaex

I currently use Vaex to generate binned data for histograms and to decimate big time-series data. Essentially I reduce millions of time series points into a number of bins and compute the mean & max & min for each bin. I would like to compare Vaex…

python numpy dask vaex

asked Jul 25 '18 at 10:16

DougR

3,196
1
28
29

votes

1 answer

custom dask graphs with functions that need dask computed keyword arguments

How can one construct a custom dask graph using a function that requires keyword arguments that are the result of another dask task? The dask documentation and several stackoverflow questions suggest using partial, toolz, or…

python dask

asked Jul 04 '18 at 17:14

Will Holmgren

votes

1 answer

Can I create a dask array with a delayed shape

Is it possible to create a dask array from a delayed value by specifying its shape with an other delayed value? My algorithm won't give me the shape of the array until pretty late in the computation. Eventually, I will be creating some blocks with…

dask

asked Jul 02 '18 at 22:33

hmaarrfk

votes

1 answer

memory usage when indexing a large dask dataframe on a single multicore machine

I am trying to turn the Wikipedia CirrusSearch dump into Parquet backed dask dataframe indexed by title on a 450G 16-core GCP instance. CirrusSearch dumps come as a single json line formatted file. The English Wipedia dumps contain 5M recards and…

parquet dask dask-distributed fastparquet dask.distributed

asked Jun 29 '18 at 05:31

Daniel Mahler

7,653
5
51
90

votes

2 answers

How to replicate data when it is faster to compute than transfer in dask distributed?

I have a largish object (150 MB) that I need to broadcast to all dask distributed workers so it can be used in future tasks. I've tried a couple of approaches: Client.scatter(broadcast=True): This required sending all the data from one machine…

dask dask-distributed

asked Jun 25 '18 at 15:42

Stan Seibert

votes

2 answers

using dask for scraping via requests

I like the simplicity of dask and would love to use it for scraping a local supermarket. My multiprocessing.cpu_count() is 4, but this code only achieves a 2x speedup. Why? from bs4 import BeautifulSoup import dask, requests, time import pandas as…

python-requests screen-scraping dask dask-delayed

asked May 15 '18 at 16:21

Sergio Lucero

votes

1 answer

How do I time out a job submitted to Dask?

I am using Dask to run a pool of tasks, retrieving results in the order they complete by the as_completed method, and potentially submitting new tasks to the pool each time one returns: # Initial set of jobs futures =…

python dask

asked Apr 19 '18 at 15:42

emitra17

votes

0 answers

Dask groupby with multiple columns issue

I have the following dataframe created by using dataframe.from_delayed method tha has the following columns _id hour_timestamp http_method total_hits username hour weekday. Some details on the source…

python dataframe dask dask-distributed

asked Apr 02 '18 at 08:01

Apostolos

7,763
17
80
150

votes

1 answer

Asymmetric slicing python

Consider the following matrix: X = np.arange(9).reshape(3,3) array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) Let say I want to subset the following array array([[0, 4, 2], [3, 7, 5]]) It is possible with some…

python numpy vectorization slice dask

asked Apr 01 '18 at 15:18

jmamath

Prev 1 2 3

…

99 100 Next