Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
2
votes
1 answer

Avoid recomputing same values in Dask?

I would expect in the following code, the first computation to take 3+sec and the second one to be much faster. What should I do to get dask to avoid re-doing a computation to the client? (I had previously searched for the answer to this question,…
julienl
  • 161
  • 12
2
votes
1 answer

Python Datashader to plot large 2D arrays of points

I am looking for some help/advise in the use of datashader to plot a large 2D data array as a series of points, colored by amplitude. The data I deal with is housed in several 2D HDF5 datasets, with a time index stored in a separate dataset. The…
George Crowther
  • 548
  • 5
  • 16
2
votes
1 answer

Dask: outer join read from multiple csv files

import dask.dataframe as dd import numpy as np from dask import delayed df1 = pd.DataFrame({'a': np.arange(10), 'b': np.random.rand()}) df1 = df1.astype({'a':np.float64}) df2 = pd.DataFrame({'a': np.random.rand(5), 'c':…
Alexander Reshytko
  • 2,126
  • 1
  • 20
  • 28
2
votes
1 answer

dask.DataFrame.apply and variable length data

I would like to apply a function to a dask.DataFrame, that returns a Series of variable length. An example to illustrate this: def generate_varibale_length_series(x): '''returns pd.Series with variable length''' n_columns =…
Arco Bast
  • 3,595
  • 2
  • 26
  • 53
2
votes
1 answer

What is the most efficient way to utilize dask multiprocessing scheduler if data flow between tasks is big?

We have a dask compute graph (quite custom so we use dask delayed instead of collections). I've read in the docs that current scheduling policy is LIFO so that a worker process has big chances to get the data it has just computed for further steps…
Alexander Reshytko
  • 2,126
  • 1
  • 20
  • 28
2
votes
1 answer

How does dask distribute work amongst the cluster?

Can dask distributed handle uneven worker nodes? For example, if there is a dask worker on a 4 core computer and a second dask worker on a 2 core computer, will all 6 cores to be utilised? Also is it a strict requirement for dask to distribute the…
Greg
  • 8,175
  • 16
  • 72
  • 125
2
votes
1 answer

How to make dworkers for multiprocess?

I am working on Distributed cluster computing. To implement such system I am trying to use python libs that is dask.distriuted. But there has a problem that is the dworkers are not for multiprocess, means 2 or 3 dworkers, works together but don't…
Saikat Kundu
  • 350
  • 1
  • 15
2
votes
2 answers

How to combine two pandas dataframes with a conditional?

There are two pandas dataframes I have which I would like to combine with a rule. This is the first dataframe import pandas as pd df1 = pd.Dataframe() df1 rank begin end labels first 30953 31131 label1 first 31293 31435 …
EB2127
  • 1,788
  • 3
  • 22
  • 43
2
votes
1 answer

convert a dask dataframe to a matrix or 2-d array

Is there a way we can convert a dask dataframe to a matrix or 2-d array? I know that dask does not support yet multiindexing. I don't know how we can use dask delayed for this.
Alger Remirata
  • 529
  • 1
  • 5
  • 17
2
votes
1 answer

Dask dataframe has no attribute '_meta_nonempty' while merging large CSVs in Python

I tried Pandas with: import pandas as pd df1 = pd.read_csv("csv1.csv") df2 = pd.read_csv("csv2.csv") my_keys = ["my_id", "my_subid"] joined_df = pd.merge(df1, df1, on=my_keys) joined_df.to_csv('out_df.csv', index=False) And got a memory error after…
CommonSurname
  • 76
  • 1
  • 5
2
votes
1 answer

dask csv reading order

I have a time series which values are stored in different csv. Each csv is sorted and contains a variable seconds that is a time scan. df = dd.read_csv('/home/data/derived/ips_subnets.7days/*') df.head() seconds IP …
Donbeo
  • 17,067
  • 37
  • 114
  • 188
2
votes
0 answers

using DataFrame with ask.multiprocessing not executing in parallel

Why the dask dosesn't use all of the cores available? I'm running this code import pandas as pd import numpy as np for year in range(2000, 2005): #i have change days idx = pd.date_range(str(year), str(year + 1), freq='d', closed='left') …
sami
  • 501
  • 2
  • 6
  • 18
2
votes
1 answer

MemoryError merging two dataframes with pandas and dasks---how can I do this?

I have two dataframes in pandas. I would like to merge these two dataframes, but I keep running into Memory Errors. What is a work around I could use? Here is the setup: import pandas as pd df1 = pd.read_cvs("first1.csv") df2 =…
EB2127
  • 1,788
  • 3
  • 22
  • 43
2
votes
1 answer

dask.bag processing data out-of-memory

I'm trying to use dask bag for wordcount 30GB of json files, I strict according to the tutoral from offical web: http://dask.pydata.org/en/latest/examples/bag-word-count-hdfs.html But still not work, my single machine is 32GB memory and 8 cores…
SharpLu
  • 1,136
  • 2
  • 12
  • 28
2
votes
1 answer

correct pattern for dask compute minimum?

Is this the correct way to call compute()? def call_minmax_duration(data): mmin = dd.DataFrame.min(data).compute() mmax = dd.DataFrame.max(data).compute() return mmin, mmax
Dervin Thunk
  • 19,515
  • 28
  • 127
  • 217