Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
5
votes
3 answers

dask: shared memory in parallel model

I've read the dask documentation, blogs and SO, but I'm still not 100% clear on how to do it. My use case: I have about 10GB of reference data. Once loaded they are read-only. Usually we are loading them into Dask/Pandas dataframes I need these…
Juergen
  • 699
  • 7
  • 20
5
votes
1 answer

Is there an advantage to pre-scattering data objects in Dask?

If I pre-scatter a data object across worker nodes, does it get copied in its entirety to each of the worker nodes? Is there an advantage in doing so if that data object is big? Using the futures interface as an example: client.scatter(data,…
ericmjl
  • 13,541
  • 12
  • 51
  • 80
5
votes
1 answer

Dask compute is very slow

I have a dataframe that consist of 5 million records. I am trying to process it using below code by leveraging the dask dataframes in python import dask.dataframe as dd dask_df = dd.read_csv(fullPath) …
Neno M.
  • 123
  • 1
  • 6
5
votes
1 answer

Drop rows in dask dataFrame on condition

I'm trying to drop some rows in my dask dataframe with : df.drop(df[(df.A <= 3) | (df.A > 1000)].index) But this one doesn't work and return NotImplementedError: Drop currently only works for axis=1 I really need help
Mdhvince
  • 103
  • 1
  • 7
5
votes
2 answers

How to use 'loc' for column selection of a dataframe in dask

Anyone can tell me how i should select one column with 'loc' in a dataframe using dask? As a side note, when i am loading the dataframe using dd.read_csv with header equals to "None", the column name is starting from zero to 131094. I am about to…
user8034918
  • 441
  • 1
  • 9
  • 20
5
votes
1 answer

What's the difference between dask=parallelized and dask=allowed in xarray's apply_ufunc?

In the xarray documentation for the function apply_ufunc it says: dask: ‘forbidden’, ‘allowed’ or ‘parallelized’, optional How to handle applying to objects containing lazy data in the form of dask arrays: ‘forbidden’ (default): raise an…
ThomasNicholas
  • 1,273
  • 11
  • 21
5
votes
0 answers

Time series decimation benchmark: Dask vs Vaex

I currently use Vaex to generate binned data for histograms and to decimate big time-series data. Essentially I reduce millions of time series points into a number of bins and compute the mean & max & min for each bin. I would like to compare Vaex…
DougR
  • 3,196
  • 1
  • 28
  • 29
5
votes
1 answer

custom dask graphs with functions that need dask computed keyword arguments

How can one construct a custom dask graph using a function that requires keyword arguments that are the result of another dask task? The dask documentation and several stackoverflow questions suggest using partial, toolz, or…
Will Holmgren
  • 696
  • 5
  • 12
5
votes
1 answer

Can I create a dask array with a delayed shape

Is it possible to create a dask array from a delayed value by specifying its shape with an other delayed value? My algorithm won't give me the shape of the array until pretty late in the computation. Eventually, I will be creating some blocks with…
hmaarrfk
  • 417
  • 5
  • 10
5
votes
1 answer

memory usage when indexing a large dask dataframe on a single multicore machine

I am trying to turn the Wikipedia CirrusSearch dump into Parquet backed dask dataframe indexed by title on a 450G 16-core GCP instance. CirrusSearch dumps come as a single json line formatted file. The English Wipedia dumps contain 5M recards and…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
5
votes
2 answers

How to replicate data when it is faster to compute than transfer in dask distributed?

I have a largish object (150 MB) that I need to broadcast to all dask distributed workers so it can be used in future tasks. I've tried a couple of approaches: Client.scatter(broadcast=True): This required sending all the data from one machine…
5
votes
2 answers

using dask for scraping via requests

I like the simplicity of dask and would love to use it for scraping a local supermarket. My multiprocessing.cpu_count() is 4, but this code only achieves a 2x speedup. Why? from bs4 import BeautifulSoup import dask, requests, time import pandas as…
Sergio Lucero
  • 862
  • 1
  • 12
  • 21
5
votes
1 answer

How do I time out a job submitted to Dask?

I am using Dask to run a pool of tasks, retrieving results in the order they complete by the as_completed method, and potentially submitting new tasks to the pool each time one returns: # Initial set of jobs futures =…
emitra17
  • 51
  • 2
5
votes
0 answers

Dask groupby with multiple columns issue

I have the following dataframe created by using dataframe.from_delayed method tha has the following columns _id hour_timestamp http_method total_hits username hour weekday. Some details on the source…
Apostolos
  • 7,763
  • 17
  • 80
  • 150
5
votes
1 answer

Asymmetric slicing python

Consider the following matrix: X = np.arange(9).reshape(3,3) array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) Let say I want to subset the following array array([[0, 4, 2], [3, 7, 5]]) It is possible with some…
jmamath
  • 190
  • 2
  • 13