Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
4
votes
2 answers

dask job killed because memory usage?

Hi I have a python script that uses dask library to handle a very large data frame, larger than the physical memory. I notice that the job get killed in the middle of a run if the memory usage stays at 100% of the computer for some time. Is it…
Bo Qiang
  • 739
  • 2
  • 13
  • 34
4
votes
1 answer

sort very large data with dask?

I need to sort a data table that is well over the size of the physical memory of the machine I am using. Pandas cannot handle it because it needs to read the entire data into memory. Can dask handle that? Thanks!
Bo Qiang
  • 739
  • 2
  • 13
  • 34
4
votes
2 answers

How to explicitly stop a running/live task through dask.?

I have a simple task which is scheduled by dask-scheduler and is running on a worker node. My requirement is, I want to have the control to stop the task on demand as and when the user wants..
TheCodeCache
  • 820
  • 1
  • 7
  • 27
4
votes
1 answer

create dask DataFrame from a list of dask Series

I need to create a a dask DataFrame from a set of dask Series, analogously to constructing a pandas DataFrame from lists pd.DataFrame({'l1': list1, 'l2': list2}) I am not seeing anything in the API. The dask DataFrame constructor is not supposed to…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
4
votes
1 answer

Dask Event loop was unresponsive - work not parallelized

This is a follow-up to this question. I'm now trying to run Dask on multiple EC2 nodes on AWS. I'm able to start up the scheduler on the first machine: I then start up workers on several other machines. From the other machines I'm able to access…
user554481
  • 1,875
  • 4
  • 26
  • 47
4
votes
1 answer

Local Dask worker unable to connect to local scheduler

While running Dask 0.16.0 on OSX 10.12.6 I'm unable to connect a local dask-worker to a local dask-scheduler. I simply want to follow the official Dask tutorial. Steps to reproduce: Step 1: run dask-scheduler Step 2: Run dask-worker…
user554481
  • 1,875
  • 4
  • 26
  • 47
4
votes
1 answer

Dask performances: workflow doubts

I'm confused about how to get the best from dask. The problem I have a dataframe which contains several timeseries (every one has its own key) and I need to run a function my_fun on every each of them. One way to solve it with pandas involves df =…
rpanai
  • 12,515
  • 2
  • 42
  • 64
4
votes
2 answers

python dask dataframe splitting column of tuples into two columns

I am using python 2.7 with dask I have a dataframe with one column of tuples that I created like this: table[col] = table.apply(lambda x: (x[col1],x[col2]), axis = 1, meta = pd.Dataframe) I want to re convert this tuple column into two seperate…
thebeancounter
  • 4,261
  • 8
  • 61
  • 109
4
votes
0 answers

Dask one-hot encoding MemoryError

I'm trying to encode categorical data with one-hot encoding using dask and export it to csv. The data in question is "movie-actors.dat" from hetrec2011-movielens-2k-v2 (available at https://grouplens.org/datasets/hetrec-2011/). It looks like this…
Pstrg
  • 71
  • 5
4
votes
1 answer

dask distributed 1.19 client logging?

The following code used to emit logs at some point, but no longer seems to do so. Shouldn't configuration of the logging mechanism in each worker permit logs to appear on stdout? If not, what am I overlooking? import logging from distributed import…
lebedov
  • 1,371
  • 2
  • 12
  • 27
4
votes
1 answer

Multiple images mean dask.delayed vs. dask.array

Background I have a list with the paths of thousand image stacks (3D numpy arrays) preprocessed and saved as .npy binaries. Case Study I would like to calculate the mean of all the images and in order to speed the analysis I thought to parallelise…
s1mc0d3
  • 523
  • 2
  • 15
4
votes
0 answers

How to load dataframe on all dask workers

I have a few thousand CSV files in S3, and I want to load them, concatenate them together into a single pandas dataframe, and share that entire dataframe with all dask workers on a cluster. All of the files are approximately the same size (~1MB). …
Peter Lubans
  • 355
  • 2
  • 8
4
votes
1 answer

Subsetting Dask DataFrames

Is this a valid way of loading subsets of a dask dataframe to memory: while i < len_df: j = i + batch_size if j > len_df: j = len_df subset = df.loc[i:j,'source_country_codes'].compute() I read somewhere that this may not be…
sachinruk
  • 9,571
  • 12
  • 55
  • 86
4
votes
1 answer

dask dataframe set_index throws error

I have a dask dataframe created from parquet file on HDFS. When creating setting index using api: set_index, it fails with below error. File "/ebs/d1/agent/conda/envs/py361/lib/python3.6/site-packages/dask/dataframe/shuffle.py", line 64, in…
Santosh Kumar
  • 761
  • 5
  • 28
4
votes
2 answers

Aggregate a Dask dataframe and produce a dataframe of aggregates

I have a Dask dataframe that looks like this: url referrer session_id ts customer url1 ref1 xxx 2017-09-15 00:00:00 a.com url2 ref2 yyy 2017-09-15 00:00:00 a.com url2 ref3 yyy …
j-bennet
  • 310
  • 3
  • 11