Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

Avoid recomputing same values in Dask?

I would expect in the following code, the first computation to take 3+sec and the second one to be much faster. What should I do to get dask to avoid re-doing a computation to the client? (I had previously searched for the answer to this question,…

python distributed dask

asked Jan 26 '17 at 20:17

julienl

votes

1 answer

Python Datashader to plot large 2D arrays of points

I am looking for some help/advise in the use of datashader to plot a large 2D data array as a series of points, colored by amplitude. The data I deal with is housed in several 2D HDF5 datasets, with a time index stored in a separate dataset. The…

python bokeh dask datashader

asked Jan 06 '17 at 10:41

George Crowther

votes

1 answer

Dask: outer join read from multiple csv files

import dask.dataframe as dd import numpy as np from dask import delayed df1 = pd.DataFrame({'a': np.arange(10), 'b': np.random.rand()}) df1 = df1.astype({'a':np.float64}) df2 = pd.DataFrame({'a': np.random.rand(5), 'c':…

dask

asked Dec 14 '16 at 15:06

Alexander Reshytko

2,126
1
20
28

votes

1 answer

dask.DataFrame.apply and variable length data

I would like to apply a function to a dask.DataFrame, that returns a Series of variable length. An example to illustrate this: def generate_varibale_length_series(x): '''returns pd.Series with variable length''' n_columns =…

python dask

asked Dec 13 '16 at 23:46

Arco Bast

3,595
2
26
53

votes

1 answer

What is the most efficient way to utilize dask multiprocessing scheduler if data flow between tasks is big?

We have a dask compute graph (quite custom so we use dask delayed instead of collections). I've read in the docs that current scheduling policy is LIFO so that a worker process has big chances to get the data it has just computed for further steps…

python parallel-processing dask

asked Dec 12 '16 at 23:43

Alexander Reshytko

2,126
1
20
28

votes

1 answer

How does dask distribute work amongst the cluster?

Can dask distributed handle uneven worker nodes? For example, if there is a dask worker on a 4 core computer and a second dask worker on a 2 core computer, will all 6 cores to be utilised? Also is it a strict requirement for dask to distribute the…

dask

asked Dec 09 '16 at 14:07

Greg

8,175
16
72
125

votes

1 answer

How to make dworkers for multiprocess?

I am working on Distributed cluster computing. To implement such system I am trying to use python libs that is dask.distriuted. But there has a problem that is the dworkers are not for multiprocess, means 2 or 3 dworkers, works together but don't…

ipython distributed-computing distributed dask

asked Dec 08 '16 at 20:58

Saikat Kundu

votes

2 answers

How to combine two pandas dataframes with a conditional?

There are two pandas dataframes I have which I would like to combine with a rule. This is the first dataframe import pandas as pd df1 = pd.Dataframe() df1 rank begin end labels first 30953 31131 label1 first 31293 31435 …

python pandas dataframe merge dask

asked Dec 02 '16 at 03:17

EB2127

1,788
3
22
43

votes

1 answer

convert a dask dataframe to a matrix or 2-d array

Is there a way we can convert a dask dataframe to a matrix or 2-d array? I know that dask does not support yet multiindexing. I don't know how we can use dask delayed for this.

python dask

asked Dec 01 '16 at 18:03

Alger Remirata

votes

1 answer

Dask dataframe has no attribute '_meta_nonempty' while merging large CSVs in Python

I tried Pandas with: import pandas as pd df1 = pd.read_csv("csv1.csv") df2 = pd.read_csv("csv2.csv") my_keys = ["my_id", "my_subid"] joined_df = pd.merge(df1, df1, on=my_keys) joined_df.to_csv('out_df.csv', index=False) And got a memory error after…

python pandas dask

asked Nov 30 '16 at 03:07

CommonSurname

votes

1 answer

dask csv reading order

I have a time series which values are stored in different csv. Each csv is sorted and contains a variable seconds that is a time scan. df = dd.read_csv('/home/data/derived/ips_subnets.7days/*') df.head() seconds IP …

python csv dask

asked Nov 29 '16 at 13:39

Donbeo

17,067
37
114
188

votes

0 answers

using DataFrame with ask.multiprocessing not executing in parallel

Why the dask dosesn't use all of the cores available? I'm running this code import pandas as pd import numpy as np for year in range(2000, 2005): #i have change days idx = pd.date_range(str(year), str(year + 1), freq='d', closed='left') …

dask

asked Nov 27 '16 at 17:14

sami

votes

1 answer

MemoryError merging two dataframes with pandas and dasks---how can I do this?

I have two dataframes in pandas. I would like to merge these two dataframes, but I keep running into Memory Errors. What is a work around I could use? Here is the setup: import pandas as pd df1 = pd.read_cvs("first1.csv") df2 =…

python pandas merge out-of-memory dask

asked Nov 23 '16 at 17:38

EB2127

1,788
3
22
43

votes

1 answer

dask.bag processing data out-of-memory

I'm trying to use dask bag for wordcount 30GB of json files, I strict according to the tutoral from offical web: http://dask.pydata.org/en/latest/examples/bag-word-count-hdfs.html But still not work, my single machine is 32GB memory and 8 cores…

dask blaze

asked Nov 01 '16 at 21:00

SharpLu

1,136
2
12
28

votes

1 answer

correct pattern for dask compute minimum?

Is this the correct way to call compute()? def call_minmax_duration(data): mmin = dd.DataFrame.min(data).compute() mmax = dd.DataFrame.max(data).compute() return mmin, mmax

python dask

asked Nov 01 '16 at 19:07

Dervin Thunk

19,515
28
127
217

Prev 1 2 3

…

99 100 Next