Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

2 answers

dask job killed because memory usage?

Hi I have a python script that uses dask library to handle a very large data frame, larger than the physical memory. I notice that the job get killed in the middle of a run if the memory usage stays at 100% of the computer for some time. Is it…

python performance dask

asked Jan 24 '18 at 14:26

Bo Qiang

votes

1 answer

sort very large data with dask?

I need to sort a data table that is well over the size of the physical memory of the machine I am using. Pandas cannot handle it because it needs to read the entire data into memory. Can dask handle that? Thanks!

python pandas dataframe sorting dask

asked Jan 16 '18 at 18:36

Bo Qiang

votes

2 answers

How to explicitly stop a running/live task through dask.?

I have a simple task which is scheduled by dask-scheduler and is running on a worker node. My requirement is, I want to have the control to stop the task on demand as and when the user wants..

dask dask-distributed dask-delayed

asked Jan 05 '18 at 09:24

TheCodeCache

votes

1 answer

create dask DataFrame from a list of dask Series

I need to create a a dask DataFrame from a set of dask Series, analogously to constructing a pandas DataFrame from lists pd.DataFrame({'l1': list1, 'l2': list2}) I am not seeing anything in the API. The dask DataFrame constructor is not supposed to…

dataframe dask

asked Jan 04 '18 at 06:47

Daniel Mahler

7,653
5
51
90

votes

1 answer

Dask Event loop was unresponsive - work not parallelized

This is a follow-up to this question. I'm now trying to run Dask on multiple EC2 nodes on AWS. I'm able to start up the scheduler on the first machine: I then start up workers on several other machines. From the other machines I'm able to access…

dask dask-distributed

asked Jan 04 '18 at 00:04

user554481

1,875
4
26
47

votes

1 answer

Local Dask worker unable to connect to local scheduler

While running Dask 0.16.0 on OSX 10.12.6 I'm unable to connect a local dask-worker to a local dask-scheduler. I simply want to follow the official Dask tutorial. Steps to reproduce: Step 1: run dask-scheduler Step 2: Run dask-worker…

dask dask-distributed

asked Jan 02 '18 at 19:57

user554481

1,875
4
26
47

votes

1 answer

Dask performances: workflow doubts

I'm confused about how to get the best from dask. The problem I have a dataframe which contains several timeseries (every one has its own key) and I need to run a function my_fun on every each of them. One way to solve it with pandas involves df =…

dask dask-distributed

asked Dec 04 '17 at 19:20

rpanai

12,515
2
42
64

votes

2 answers

python dask dataframe splitting column of tuples into two columns

I am using python 2.7 with dask I have a dataframe with one column of tuples that I created like this: table[col] = table.apply(lambda x: (x[col1],x[col2]), axis = 1, meta = pd.Dataframe) I want to re convert this tuple column into two seperate…

python python-2.7 pandas dataframe dask

asked Nov 19 '17 at 12:21

thebeancounter

4,261
8
61
109

votes

0 answers

Dask one-hot encoding MemoryError

I'm trying to encode categorical data with one-hot encoding using dask and export it to csv. The data in question is "movie-actors.dat" from hetrec2011-movielens-2k-v2 (available at https://grouplens.org/datasets/hetrec-2011/). It looks like this…

python dataframe out-of-memory dask

asked Nov 19 '17 at 11:41

Pstrg

votes

1 answer

dask distributed 1.19 client logging?

The following code used to emit logs at some point, but no longer seems to do so. Shouldn't configuration of the logging mechanism in each worker permit logs to appear on stdout? If not, what am I overlooking? import logging from distributed import…

python python-3.x logging dask dask-distributed

asked Nov 13 '17 at 15:43

lebedov

1,371
2
12
27

votes

1 answer

Multiple images mean dask.delayed vs. dask.array

Background I have a list with the paths of thousand image stacks (3D numpy arrays) preprocessed and saved as .npy binaries. Case Study I would like to calculate the mean of all the images and in order to speed the analysis I thought to parallelise…

arrays dask dask-distributed dask-delayed

asked Oct 26 '17 at 11:51

s1mc0d3

votes

0 answers

How to load dataframe on all dask workers

I have a few thousand CSV files in S3, and I want to load them, concatenate them together into a single pandas dataframe, and share that entire dataframe with all dask workers on a cluster. All of the files are approximately the same size (~1MB). …

python performance amazon-s3 dask dask-distributed

asked Oct 23 '17 at 19:09

Peter Lubans

votes

1 answer

Subsetting Dask DataFrames

Is this a valid way of loading subsets of a dask dataframe to memory: while i < len_df: j = i + batch_size if j > len_df: j = len_df subset = df.loc[i:j,'source_country_codes'].compute() I read somewhere that this may not be…

python dask

asked Oct 18 '17 at 22:30

sachinruk

9,571
12
55
86

votes

1 answer

dask dataframe set_index throws error

I have a dask dataframe created from parquet file on HDFS. When creating setting index using api: set_index, it fails with below error. File "/ebs/d1/agent/conda/envs/py361/lib/python3.6/site-packages/dask/dataframe/shuffle.py", line 64, in…

dask dask-distributed

asked Oct 08 '17 at 01:19

Santosh Kumar

votes

2 answers

Aggregate a Dask dataframe and produce a dataframe of aggregates

I have a Dask dataframe that looks like this: url referrer session_id ts customer url1 ref1 xxx 2017-09-15 00:00:00 a.com url2 ref2 yyy 2017-09-15 00:00:00 a.com url2 ref3 yyy …

group-by aggregation dask

asked Sep 23 '17 at 02:15

j-bennet

Prev 1 2 3

…

99 100 Next