Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
4
votes
2 answers

Use xarray with custom function and resample

I'm trying to take an array and resample it with a custom function. From this post: Apply function along time dimension of XArray def special_mean(x, drop_min=False): s = np.sum(x) n = len(x) if drop_min: s = s - x.min() n -=…
blueduckyy
  • 201
  • 1
  • 11
4
votes
1 answer

Dask Length of values does not match length of index error

I encountered a very strange error having to do with assigning a new column to an existing dask dataframe. Given the below minimal example, import pandas as pd from dask import dataframe as dd from dask import array as da foo =…
emilaz
  • 1,722
  • 1
  • 15
  • 31
4
votes
1 answer

Reading custom file format to Dask dataframe

I have a huge custom text file (cant load the entire data into one pandas dataframe) which I want to read into Dask dataframe. I wrote a generator to read and parse the data in chunks and create pandas dataframes. I want to load these pandas…
najeem
  • 1,841
  • 13
  • 29
4
votes
2 answers

How best to parallelize grakn queries with Python?

I run Windows 10, Python 3.7, and have a 6-core CPU. A single Python thread on my machine submits 1,000 inserts per second to grakn. I'd like to parallelize my code to insert and match even faster. How are people doing this? My only experience with…
davideps
  • 541
  • 3
  • 13
4
votes
1 answer

How to check if dask dataframe is empty if lazily evaluated?

I am aware of this question. But check the code(minimal-working example) below: import dask.dataframe as dd import pandas as pd # intialise data of lists. data = {'Name': ['Tom', 'nick', 'krish', 'jack'], 'Age': [20, 21, 19, 18]} # Create…
MehmedB
  • 1,059
  • 1
  • 16
  • 42
4
votes
3 answers

memory efficient way to create a column that indicates a unique combination of values from a set of columns

I want to find a more efficient way (in terms of peak memory usage and possibly time) to do the work of panda's groupby.ngroup so that I don't run into memory issues when working with large datasets (I provide reasons for why this column is useful…
jtorca
  • 1,531
  • 2
  • 17
  • 31
4
votes
2 answers

how to make a memory efficient multiple dimension groupby/stack using xarray?

I have a large time series of np.float64 with a 5-min frequency (size is ~2,500,000 ~=24 years). I'm using Xarray to represent it in-memory and the time-dimension is named 'time'. I want to group-by 'time.hour' and then 'time.dayofyear' (or…
Ziskin_Ziv
  • 116
  • 4
4
votes
0 answers

dask 100GB dataframe sorting / set_index on new column out of memory issues

I have a dask dataframe of around 100GB and 4 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster. I converted the dataframe to 150 partiitions (700MB each). However My simple set_index()…
user670186
  • 2,588
  • 6
  • 37
  • 55
4
votes
2 answers

Groupby multiple columns and aggregation with dask

dask dataframe looks like this: A B C D 1 foo xx this 1 foo xx belongs 1 foo xx together 4 bar xx blubb i want to groupy by columns A,B,C and join the strings from D with a blank between, to get A …
bucky
  • 392
  • 4
  • 18
4
votes
2 answers

How to use multiple cores with sklearn dbscan?

I'm trying to process a large volume of data through dbscan and would love to use all cores available to me on the machine to speed up the computation. I'm using a custom distance metric, but the distance matrix is not precomputed. I have tried…
Lauren K
  • 125
  • 6
4
votes
1 answer

Dask distributed workers always leak memory when running many tasks

What are some strategies to work around or debug this? distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 26.17 GB -- Worker memory limit: 32.66…
mathtick
  • 6,487
  • 13
  • 56
  • 101
4
votes
1 answer

df.groupby(...).apply(...) function in dask dataframe

I was using Python dask to process a large csv panel data set (15+GB), and I needed to conduct a groupby(...).apply(...) function to delete the last observations for each stock in each day. My dataset looks like stock date time spread …
FlyUFalcon
  • 314
  • 1
  • 4
  • 18
4
votes
1 answer

How do I filter dask.dataframe.read_parquet with timestamp?

I am trying to read some parquet files using dask.dataframe.read_parquet method. In the data I have a column named timestamp, which contains data such as: 0 2018-12-20 19:00:00 1 2018-12-20 20:00:00 2 2018-12-20 21:00:00 3 2018-12-20…
4
votes
1 answer

How to assign tasks to GPU and CPU Dask workers?

I'm setting up a Dask script to be executed on PSC Bridges P100 GPU nodes. These nodes offer 2 GPUs and 32 CPU-cores. I would like to start CPU and GPU-based dask-workers. The CPU workers will be started: dask-worker --nprocs 1 --nthreads 1 while…
neiron21
  • 71
  • 5
4
votes
1 answer

Does dask dataframe.persist() keep results for the next query?

I'm trying to understand how df.persist() works in dask. Would I build the same expression again, would it recalculate it or load it from cache? E.g. what happens when I do: ddf =…
Mark Horvath
  • 1,136
  • 1
  • 9
  • 24