Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
7
votes
1 answer

Python PANDAS: Converting from pandas/numpy to dask dataframe/array

I am working to try to convert a program to be parallelizable/multithreaded with the excellent dask library. Here is the program I am working on converting: Python PANDAS: Stack by Enumerated Date to Create Records Vectorized import pandas as…
Pylander
  • 1,531
  • 1
  • 17
  • 36
7
votes
3 answers

Semaphores in dask.distributed?

I have a dask cluster with n workers and want the workers to do queries to the database. But the database is only capable of handling m queries in parallel where m < n. How can I model that in dask.distributed? Only m workers should work on such a…
7
votes
2 answers

what is the default directory where dask workers store results or files.?

[mapr@impetus-i0057 latest_code_deepak]$ dask-worker 172.26.32.37:8786 distributed.nanny - INFO - Start Nanny at: 'tcp://172.26.32.36:50930' distributed.diskutils - WARNING - Found stale lock file and directory…
TheCodeCache
  • 820
  • 1
  • 7
  • 27
7
votes
1 answer

Dask: create strictly increasing index

As is well documented, Dask creates a strictly increasing index on a per partition basis when reset_index is called, resulting in duplicate indices over the whole set. What is the best way (e.g. computationally quickest) to create a strictly…
morganics
  • 1,209
  • 13
  • 27
7
votes
1 answer

Is there a way to get the nlargest items per group in dask?

I have the following dataset: location category percent A 5 100.0 B 3 100.0 C 2 50.0 4 13.0 D 2 75.0 3 59.0 4 …
whisperstream
  • 1,897
  • 3
  • 20
  • 25
7
votes
1 answer

What is the equivalent to iloc for dask dataframe?

I have a situation where I need to index a dask dataframe by location. I see that there is not an .iloc method available. Is there an alternative? Or am I required to use label-based indexing? For example, I would like to import dask.dataframe…
Tim Morton
  • 240
  • 1
  • 3
  • 11
7
votes
1 answer

Why do pandas and dask perform better when importing from CSV compared to HDF5?

I am working with a system that currently operates with large (>5GB) .csv files. To increase performance, I am testing (A) different methods to create dataframes from disk (pandas VS dask) as well as (B) different ways to store results to disk (.csv…
sudonym
  • 3,788
  • 4
  • 36
  • 61
7
votes
2 answers

Groupby.transform doesn't work in dask dataframe

i'm using the following dask.dataframe AID: AID FID ANumOfF 0 1 X 1 1 1 Y 5 2 2 Z 6 3 2 A 1 4 2 X 11 5 2 B 18 I know in a pandas dataframe I could…
BKS
  • 2,227
  • 4
  • 32
  • 53
7
votes
1 answer

Dask.dataframe : out of memory when merging and groupby

I am new to Dask and having some troubles with it. I am using a machine ( 4GB RAM, 2 cores) to analyse two csv files ( key.csv: ~2 million rows about 300Mb, sig.csv: ~12 million row about 600Mb). With these data, pandas can't fit in the memory, so I…
7
votes
2 answers

Row by row processing of a Dask DataFrame

I need to process a large file and to change some values. I would like to do something like that: for index, row in dataFrame.iterrows(): foo = doSomeStuffWith(row) lol = doOtherStuffWith(row) dataFrame['colx'][index] =…
Caerbanog
  • 71
  • 1
  • 1
  • 2
7
votes
1 answer

Non deterministic results with dask

I'm getting non deterministic results for some matrix computations with dask. I narrowed it down to this simple example: import numpy as np import dask.array as da seed = 1234 np.random.seed(seed) N = 1000 p = 10 X = np.random.random((N, p + 1)) X…
Thrasibule
  • 319
  • 2
  • 6
7
votes
2 answers

Item Assignment Not Supported in Dask

What are the ways we can have perform item assignment in Dask Arrays? Even a very simple item assignment like: a[0] = 2 does not work.
Alger Remirata
  • 529
  • 1
  • 5
  • 17
7
votes
1 answer

How to convert an xarray dataset to pandas dataframes inside a dask dataframe

I have a calculation that expects a pandas dataframe as input. I'd like to run this calculation on data stored in a netCDF file that expands to 51GB - currently I've been opening the file with xarray.open_dataset and using chunks (my understanding…
7
votes
1 answer

What are the scaling limits of Dask.distributed?

Are there any anecdotal cases of Dask.distributed deployments with hundreds of worker nodes? Is distributed meant to scale to a cluster of this size?
bcollins
  • 3,379
  • 4
  • 19
  • 35
7
votes
1 answer

How to map a column with dask

I want to apply a mapping on a DataFrame column. With Pandas this is straight forward: df["infos"] = df2["numbers"].map(lambda nr: custom_map(nr, hashmap)) This writes the infos column, based on the custom_map function, and uses the rows in numbers…
wishi
  • 7,188
  • 17
  • 64
  • 103