Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

0 answers

Dask memory overload

I have the following code in which gives me the value counts of the categorical variables and the number of NaN values in the column. Its running on a single computer. The data is kaggle elo-merchant competition. merchant_headers =…

python-3.x dask

asked Jan 16 '19 at 11:18

Alain Michael Janith Schroter

votes

2 answers

How can I efficiently transpose a 67 gb file/Dask dataframe without loading it entirely into memory?

I have 3 rather large files (67gb, 36gb, 30gb) that I need to train models on. However, the features are rows and the samples are columns. Since Dask hasn't implemented transpose and stores DataFrames split by row, I need to write something to do…

python dataframe dask

asked Jan 15 '19 at 23:29

Joe B

votes

1 answer

How to serialize custom classes as structs using pyarrow in dask dataframes?

I have dask dataframe that has a column of type List[MyClass]. I want to save this dataframe to parquet files. Dask is using pyarrow as the backend, but it supports only primitive types. import pandas as pd import dask.dataframe as dd class…

python parquet dask pyarrow

asked Jan 11 '19 at 15:18

cheap_grayhat

votes

2 answers

Dask Dataframe View Entire Row

I want to see the entire row for a dask dataframe without the fields being cutoff, in pandas the command is pd.set_option('display.max_colwidth', -1), is there an equivalent for dask? I was not able to find anything.

python-3.x dask

asked Jan 02 '19 at 06:13

Maria Nazari

votes

0 answers

Worker crashes during simple aggregation

I am trying to aggregate various columns on a 450 million row data set. When I use Dask's built in aggregations like 'min', 'max', 'std', 'mean' keep crashing a worker in the process. The file that I am using can be found here:…

dask dask-distributed

asked Dec 28 '18 at 07:20

DannyK

votes

1 answer

Python Dask .visualize() does not show the full graph

My Dask .visualize() does not show the graph properly. The code was taken from http://github.com/dask/dask-tutorial/ 01_dask.delayed.ipynb notebook. I installed graphviz using pip and apt. Even though the graph is displayed, it is not fully shown.…

python-3.x dask

asked Dec 27 '18 at 00:40

Alain Michael Janith Schroter

votes

0 answers

How to perform a Diff on a Dask SeriesGroup object

I have a multi-index dask dataframe, which I need to perform a groupby, followed by a diff on. This operation is trivial in pure pandas via the following command: df.groupby('IndexName')['ValueName'].diff(). Dask, however, doesn't implement the…

dask

asked Dec 17 '18 at 12:22

IAS_LLC

votes

1 answer

Mask dataframe column based on datetime index

Very similar to this question except I need to consider both date and time; indexer_between_time does not appear to support any datetime formats I can find. I have a dask dataframe that looks like this: logger_volt lat …

python pandas dask

asked Dec 13 '18 at 15:28

ZachP

votes

1 answer

Dask lazy initialization very slow for list comprehension

I'm trying to see if Dask would be a suitable addition to my project and wrote some very simple test cases to look into it's performance. However, Dask is taking a relatively long time to simply perform the lazy initialization. @delayed def…

python loops parallel-processing dask dask-delayed

asked Dec 04 '18 at 22:21

ltt

votes

1 answer

Scheduler closing stream warning

I have a periodic batch job running on my laptop. The code looks like this: client = Client() print(client.scheduler_info()) topic='raw_data' start = datetime.datetime.now() delta = datetime.timedelta(minutes=2) while True: end = start + delta …

python dask dask-distributed

asked Nov 27 '18 at 21:13

Apostolos

7,763
17
80
150

votes

0 answers

How to ignore dtype when read file csv using Dask Dataframe

I have a large file csv, it has 9600 columns and each column has a different type. When I read file using Dask Datafame and use attribute head(), I get error Mismatched dtypes found in pd.read_csv/pd.read_table. How can I ignore it. I use pandas…

python pandas dask

asked Nov 21 '18 at 14:47

Hoàng Quốc Cường

votes

1 answer

Slow Dask performance compared to native sklearn

I'm new to using Dask but have experienced painfully slow performance when attempting to re-write native sklearn functions in Dask. I've simplified the use-case as much as possible in hope of getting some help. Using standard sklearn/numpy/pandas…

python scikit-learn dask

asked Nov 15 '18 at 13:32

Sykomaniac

votes

0 answers

Dask, Tensorflow serving (and Kubernetes and Streamz)

What is the current 'state of technology' when having a pipeline composed of python code and Tensorflow/Keras models? We are trying to have scalability and reactive design using dask and Streamz (for servers registered using Kubernetes). But…

tensorflow kubernetes cluster-computing dask

asked Nov 14 '18 at 15:45

Holi

votes

1 answer

Override dask scheduler_port

I've tried several ports without success: 8787 is indeed busy serving rstudio. I could redirect rstudio, but shouldn't the following work? from distributed import Client, LocalCluster cluster = LocalCluster( scheduler_port = 8785 , n_workers = 2…

dask

asked Nov 03 '18 at 23:43

user2105469

1,413
3
20
37

votes

1 answer

SQL-style explode on Dask Series or DataFrame column

I have a Dask Series that contains a column with a list of values. I want to perform a SQL-style explode to create a new row for each index value and corresponding list element. For this particular problem, the lists are all of the same…

python dataframe explode series dask

asked Oct 29 '18 at 19:41

marshackVB

Prev 1 2 3

…

99 100 Next