Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

2 answers

Use xarray with custom function and resample

I'm trying to take an array and resample it with a custom function. From this post: Apply function along time dimension of XArray def special_mean(x, drop_min=False): s = np.sum(x) n = len(x) if drop_min: s = s - x.min() n -=…

dask python-xarray

asked Feb 21 '20 at 16:55

blueduckyy

votes

1 answer

Dask Length of values does not match length of index error

I encountered a very strange error having to do with assigning a new column to an existing dask dataframe. Given the below minimal example, import pandas as pd from dask import dataframe as dd from dask import array as da foo =…

python pandas dask

asked Feb 18 '20 at 16:36

emilaz

1,722
1
15
31

votes

1 answer

Reading custom file format to Dask dataframe

I have a huge custom text file (cant load the entire data into one pandas dataframe) which I want to read into Dask dataframe. I wrote a generator to read and parse the data in chunks and create pandas dataframes. I want to load these pandas…

python pandas dataframe hdfs dask

asked Jan 24 '20 at 14:25

najeem

1,841
13
29

votes

2 answers

How best to parallelize grakn queries with Python?

I run Windows 10, Python 3.7, and have a 6-core CPU. A single Python thread on my machine submits 1,000 inserts per second to grakn. I'd like to parallelize my code to insert and match even faster. How are people doing this? My only experience with…

python-3.x multiprocessing dask vaticle-typedb

asked Jan 20 '20 at 12:09

davideps

votes

1 answer

How to check if dask dataframe is empty if lazily evaluated?

I am aware of this question. But check the code(minimal-working example) below: import dask.dataframe as dd import pandas as pd # intialise data of lists. data = {'Name': ['Tom', 'nick', 'krish', 'jack'], 'Age': [20, 21, 19, 18]} # Create…

python-3.x dask dask-distributed

asked Dec 28 '19 at 13:16

MehmedB

1,059
1
16
42

votes

3 answers

memory efficient way to create a column that indicates a unique combination of values from a set of columns

I want to find a more efficient way (in terms of peak memory usage and possibly time) to do the work of panda's groupby.ngroup so that I don't run into memory issues when working with large datasets (I provide reasons for why this column is useful…

python pandas numpy dataframe dask

asked Dec 13 '19 at 23:05

jtorca

1,531
2
17
31

votes

2 answers

how to make a memory efficient multiple dimension groupby/stack using xarray?

I have a large time series of np.float64 with a 5-min frequency (size is ~2,500,000 ~=24 years). I'm using Xarray to represent it in-memory and the time-dimension is named 'time'. I want to group-by 'time.hour' and then 'time.dayofyear' (or…

numpy time-series dask python-xarray

asked Dec 12 '19 at 19:15

Ziskin_Ziv

votes

0 answers

dask 100GB dataframe sorting / set_index on new column out of memory issues

I have a dask dataframe of around 100GB and 4 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster. I converted the dataframe to 150 partiitions (700MB each). However My simple set_index()…

sorting indexing dask dask-distributed

asked Dec 02 '19 at 14:21

user670186

2,588
6
37
55

votes

2 answers

Groupby multiple columns and aggregation with dask

dask dataframe looks like this: A B C D 1 foo xx this 1 foo xx belongs 1 foo xx together 4 bar xx blubb i want to groupy by columns A,B,C and join the strings from D with a blank between, to get A …

python pandas dataframe pandas-groupby dask

asked Nov 29 '19 at 13:05

bucky

votes

2 answers

How to use multiple cores with sklearn dbscan?

I'm trying to process a large volume of data through dbscan and would love to use all cores available to me on the machine to speed up the computation. I'm using a custom distance metric, but the distance matrix is not precomputed. I have tried…

python scikit-learn parallel-processing dask dbscan

asked Nov 18 '19 at 01:24

Lauren K

votes

1 answer

Dask distributed workers always leak memory when running many tasks

What are some strategies to work around or debug this? distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 26.17 GB -- Worker memory limit: 32.66…

memory-leaks dask

asked Oct 07 '19 at 18:48

mathtick

6,487
13
56
101

votes

1 answer

df.groupby(...).apply(...) function in dask dataframe

I was using Python dask to process a large csv panel data set (15+GB), and I needed to conduct a groupby(...).apply(...) function to delete the last observations for each stock in each day. My dataset looks like stock date time spread …

python pandas dataframe group-by dask

asked Sep 15 '19 at 10:42

FlyUFalcon

votes

1 answer

How do I filter dask.dataframe.read_parquet with timestamp?

I am trying to read some parquet files using dask.dataframe.read_parquet method. In the data I have a column named timestamp, which contains data such as: 0 2018-12-20 19:00:00 1 2018-12-20 20:00:00 2 2018-12-20 21:00:00 3 2018-12-20…

python dataframe google-cloud-storage dask parquet

asked Sep 14 '19 at 11:26

Aladejubelo Oluwashina

votes

1 answer

How to assign tasks to GPU and CPU Dask workers?

I'm setting up a Dask script to be executed on PSC Bridges P100 GPU nodes. These nodes offer 2 GPUs and 32 CPU-cores. I would like to start CPU and GPU-based dask-workers. The CPU workers will be started: dask-worker --nprocs 1 --nthreads 1 while…

dask

asked Aug 20 '19 at 12:21

neiron21

votes

1 answer

Does dask dataframe.persist() keep results for the next query?

I'm trying to understand how df.persist() works in dask. Would I build the same expression again, would it recalculate it or load it from cache? E.g. what happens when I do: ddf =…

python pandas dataframe bigdata dask

asked Aug 06 '19 at 13:39

Mark Horvath

1,136
1
9
24

Prev 1 2 3

…

99 100 Next