Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

Python PANDAS: Converting from pandas/numpy to dask dataframe/array

I am working to try to convert a program to be parallelizable/multithreaded with the excellent dask library. Here is the program I am working on converting: Python PANDAS: Stack by Enumerated Date to Create Records Vectorized import pandas as…

asked Feb 14 '18 at 19:16

Pylander

1,531
1
17
36

votes

3 answers

Semaphores in dask.distributed?

I have a dask cluster with n workers and want the workers to do queries to the database. But the database is only capable of handling m queries in parallel where m < n. How can I model that in dask.distributed? Only m workers should work on such a…

dask dask-distributed

asked Feb 07 '18 at 15:23

Christian Trebing

votes

2 answers

what is the default directory where dask workers store results or files.?

[mapr@impetus-i0057 latest_code_deepak]$ dask-worker 172.26.32.37:8786 distributed.nanny - INFO - Start Nanny at: 'tcp://172.26.32.36:50930' distributed.diskutils - WARNING - Found stale lock file and directory…

dask dask-distributed dask-delayed

asked Feb 07 '18 at 06:32

TheCodeCache

votes

1 answer

Dask: create strictly increasing index

As is well documented, Dask creates a strictly increasing index on a per partition basis when reset_index is called, resulting in duplicate indices over the whole set. What is the best way (e.g. computationally quickest) to create a strictly…

python python-3.x dask

asked Nov 30 '17 at 10:55

morganics

1,209
13
27

votes

1 answer

Is there a way to get the nlargest items per group in dask?

I have the following dataset: location category percent A 5 100.0 B 3 100.0 C 2 50.0 4 13.0 D 2 75.0 3 59.0 4 …

pandas grouping dask top-n

asked Nov 10 '17 at 17:06

whisperstream

1,897
3
20
25

votes

1 answer

What is the equivalent to iloc for dask dataframe?

I have a situation where I need to index a dask dataframe by location. I see that there is not an .iloc method available. Is there an alternative? Or am I required to use label-based indexing? For example, I would like to import dask.dataframe…

python dask

asked Oct 16 '17 at 15:19

Tim Morton

votes

1 answer

Why do pandas and dask perform better when importing from CSV compared to HDF5?

I am working with a system that currently operates with large (>5GB) .csv files. To increase performance, I am testing (A) different methods to create dataframes from disk (pandas VS dask) as well as (B) different ways to store results to disk (.csv…

python hdf5 dask

asked Sep 01 '17 at 05:23

sudonym

3,788
4
36
61

votes

2 answers

Groupby.transform doesn't work in dask dataframe

i'm using the following dask.dataframe AID: AID FID ANumOfF 0 1 X 1 1 1 Y 5 2 2 Z 6 3 2 A 1 4 2 X 11 5 2 B 18 I know in a pandas dataframe I could…

python python-3.x pandas dataframe dask

asked Apr 04 '17 at 12:56

BKS

2,227
4
32
53

votes

1 answer

Dask.dataframe : out of memory when merging and groupby

I am new to Dask and having some troubles with it. I am using a machine ( 4GB RAM, 2 cores) to analyse two csv files ( key.csv: ~2 million rows about 300Mb, sig.csv: ~12 million row about 600Mb). With these data, pandas can't fit in the memory, so I…

dask

asked Mar 29 '17 at 13:14

Tho Le Phuoc

votes

2 answers

Row by row processing of a Dask DataFrame

I need to process a large file and to change some values. I would like to do something like that: for index, row in dataFrame.iterrows(): foo = doSomeStuffWith(row) lol = doOtherStuffWith(row) dataFrame['colx'][index] =…

python pandas dask

asked Mar 17 '17 at 15:19

Caerbanog

votes

1 answer

Non deterministic results with dask

I'm getting non deterministic results for some matrix computations with dask. I narrowed it down to this simple example: import numpy as np import dask.array as da seed = 1234 np.random.seed(seed) N = 1000 p = 10 X = np.random.random((N, p + 1)) X…

python dask

asked Mar 02 '17 at 02:20

Thrasibule

votes

2 answers

Item Assignment Not Supported in Dask

What are the ways we can have perform item assignment in Dask Arrays? Even a very simple item assignment like: a[0] = 2 does not work.

dask

asked Dec 02 '16 at 15:32

Alger Remirata

votes

1 answer

How to convert an xarray dataset to pandas dataframes inside a dask dataframe

I have a calculation that expects a pandas dataframe as input. I'd like to run this calculation on data stored in a netCDF file that expands to 51GB - currently I've been opening the file with xarray.open_dataset and using chunks (my understanding…

python pandas dask python-xarray

asked Nov 07 '16 at 22:32

user3766692

votes

1 answer

What are the scaling limits of Dask.distributed?

Are there any anecdotal cases of Dask.distributed deployments with hundreds of worker nodes? Is distributed meant to scale to a cluster of this size?

python distributed-computing dask

asked Oct 26 '16 at 02:08

bcollins

3,379
4
19
35

votes

1 answer

How to map a column with dask

I want to apply a mapping on a DataFrame column. With Pandas this is straight forward: df["infos"] = df2["numbers"].map(lambda nr: custom_map(nr, hashmap)) This writes the infos column, based on the custom_map function, and uses the rows in numbers…

python pandas dask

asked Oct 13 '16 at 11:37

wishi

7,188
17
64
103

Prev 1 2 3

…

99 100 Next