Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
2 answers

Cleanest way to support xarray, dask, and numpy arrays in one function

I have a function that accepts multiple 2D arrays and creates two new arrays with the same shape. It was originally written to only support numpy arrays, but was "hacked" to support dask arrays if a "chunks" attribute was seen. A user who was using…
djhoese
  • 3,567
  • 1
  • 27
  • 45
3
votes
0 answers

specify how to partition dask dataframe?

I have a pandas df that's indexed by id and date. I would like to run some regressions for each id in parallel using dask. I know dask splits the df into N partitions but is there a way to force it to split by id column? This way when I do…
Alex
  • 1,281
  • 1
  • 13
  • 26
3
votes
1 answer

How do I get adaptive dask workers to run some code on startup?

I'm creating a dask scheduler using dask-kubernetes and putting it into adaptive mode. from dask-kubernetes import KubeCluster cluster = KubeCluster() cluster.adapt(minimum=0, maximum=40) I need each worker to run some setup code when they are…
Jacob Tomlinson
  • 3,341
  • 2
  • 31
  • 62
3
votes
1 answer

Save dask dataframe to csv and find out its length without computing twice

Say, i have some dask dataframe. I'd like to do some operations with it, than save to csv and print its len. As I understand, the following code will make dask to compute df twice, am I right? df = dd.read_csv('path/to/file',…
elfinorr
  • 189
  • 3
  • 12
3
votes
1 answer

How to reliably clean up dask scheduler/worker

I'm starting up a dask cluster in an automated way by ssh-ing into a bunch of machines and running dask-worker. I noticed that I sometimes run into problems when processes from a previous experiment are still running. Wha'ts the best way to clean up…
John
  • 935
  • 6
  • 17
3
votes
1 answer

Converting a Dask column into new Dask column of type datetime

I have an unparsed column in a dask dataframe (df) that I am using pandas to convert to datetime and put into a new column in the dask dataframe. However it breaks as column assignment doesn't support type DatetimeIndex. df['New Column'] =…
Usherwood
  • 359
  • 3
  • 11
3
votes
1 answer

Dask - Drop duplicate index MemoryError

I'm getting a MemoryError when I try to drop duplicate timestamps on a large dataframe with the following code. import dask.dataframe as dd path = f's3://{container_name}/*' ddf = dd.read_parquet(path, storage_options=opts,…
3
votes
1 answer

Masking in Dask

I was just wondering if someone could help show me how to apply functions such as "sum" or "mean" on masks arrays using dask. I wish to calculate the sum / mean of the array on only values where there is no mask. Code: import dask.array as…
Chen
  • 29
  • 5
3
votes
0 answers

Implementation of a recursive function using dask.delayed

How can I successfully implement Merge Sort using dask.delayed or with some other dask API. So that it becomes faster with parallelism.
Dhruv Kumar
  • 399
  • 2
  • 13
3
votes
1 answer

Dask Apply of Python Function

I have a df: id log 0 24 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0* 1 25 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0* I…
OverflowingTheGlass
  • 2,324
  • 1
  • 27
  • 75
3
votes
0 answers

Fair and realistic comparison using dask

In order to get a better idea of dask library in python I am trying to make a fair comparison between using dask and not. I used h5pyto create a big dataset which was used later on to measure mean in one of the axis as a numpy style operation. I…
Fence
  • 323
  • 2
  • 12
3
votes
1 answer

File Not Found Error in Dask program run on cluster

I have 4 machines, M1, M2, M3, and M4. The scheduler, client, worker runs on M1. I've put a csv file in M1. Rest of the machines are workers. When I run the program with read_csv file in dask. It gives me Error, file not found
Dhruv Kumar
  • 399
  • 2
  • 13
3
votes
1 answer

Dask: set multiprocessing method from Python

Is there a way to set the multiprocessing method from Python? I do not see a method in the Client() API docs of Dask.distributed that indicates how to set this property. Update: For example, is there: client =…
ericmjl
  • 13,541
  • 12
  • 51
  • 80
3
votes
3 answers

Dask read_json metadata mismatch

I'm trying to load json files into a dask df. files = glob.glob('**/*.json', recursive=True) df = dd.read_json(files, lines = False) There are some missing values in the data, and some of the files have extra columns. Is there a way to specify a…
Maria
  • 159
  • 10
3
votes
0 answers

Dask scheduler behavior while reading/retrieving large datasets

This is a follow-up to this question. I'm experiencing problems with persisting a large dataset in distributed memory. I have a scheduler running on one machine and 8 workers each running on their own machines connected by 40 gigabit ethernet and a…
A.C.
  • 53
  • 4