Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
0 answers

Dask memory overload

I have the following code in which gives me the value counts of the categorical variables and the number of NaN values in the column. Its running on a single computer. The data is kaggle elo-merchant competition. merchant_headers =…
3
votes
2 answers

How can I efficiently transpose a 67 gb file/Dask dataframe without loading it entirely into memory?

I have 3 rather large files (67gb, 36gb, 30gb) that I need to train models on. However, the features are rows and the samples are columns. Since Dask hasn't implemented transpose and stores DataFrames split by row, I need to write something to do…
Joe B
  • 912
  • 2
  • 15
  • 36
3
votes
1 answer

How to serialize custom classes as structs using pyarrow in dask dataframes?

I have dask dataframe that has a column of type List[MyClass]. I want to save this dataframe to parquet files. Dask is using pyarrow as the backend, but it supports only primitive types. import pandas as pd import dask.dataframe as dd class…
cheap_grayhat
  • 400
  • 4
  • 7
3
votes
2 answers

Dask Dataframe View Entire Row

I want to see the entire row for a dask dataframe without the fields being cutoff, in pandas the command is pd.set_option('display.max_colwidth', -1), is there an equivalent for dask? I was not able to find anything.
Maria Nazari
  • 660
  • 1
  • 9
  • 27
3
votes
0 answers

Worker crashes during simple aggregation

I am trying to aggregate various columns on a 450 million row data set. When I use Dask's built in aggregations like 'min', 'max', 'std', 'mean' keep crashing a worker in the process. The file that I am using can be found here:…
DannyK
  • 103
  • 2
  • 10
3
votes
1 answer

Python Dask .visualize() does not show the full graph

My Dask .visualize() does not show the graph properly. The code was taken from http://github.com/dask/dask-tutorial/ 01_dask.delayed.ipynb notebook. I installed graphviz using pip and apt. Even though the graph is displayed, it is not fully shown.…
3
votes
0 answers

How to perform a Diff on a Dask SeriesGroup object

I have a multi-index dask dataframe, which I need to perform a groupby, followed by a diff on. This operation is trivial in pure pandas via the following command: df.groupby('IndexName')['ValueName'].diff(). Dask, however, doesn't implement the…
IAS_LLC
  • 135
  • 11
3
votes
1 answer

Mask dataframe column based on datetime index

Very similar to this question except I need to consider both date and time; indexer_between_time does not appear to support any datetime formats I can find. I have a dask dataframe that looks like this: logger_volt lat …
ZachP
  • 631
  • 6
  • 11
3
votes
1 answer

Dask lazy initialization very slow for list comprehension

I'm trying to see if Dask would be a suitable addition to my project and wrote some very simple test cases to look into it's performance. However, Dask is taking a relatively long time to simply perform the lazy initialization. @delayed def…
ltt
  • 417
  • 3
  • 12
3
votes
1 answer

Scheduler closing stream warning

I have a periodic batch job running on my laptop. The code looks like this: client = Client() print(client.scheduler_info()) topic='raw_data' start = datetime.datetime.now() delta = datetime.timedelta(minutes=2) while True: end = start + delta …
Apostolos
  • 7,763
  • 17
  • 80
  • 150
3
votes
0 answers

How to ignore dtype when read file csv using Dask Dataframe

I have a large file csv, it has 9600 columns and each column has a different type. When I read file using Dask Datafame and use attribute head(), I get error Mismatched dtypes found in pd.read_csv/pd.read_table. How can I ignore it. I use pandas…
3
votes
1 answer

Slow Dask performance compared to native sklearn

I'm new to using Dask but have experienced painfully slow performance when attempting to re-write native sklearn functions in Dask. I've simplified the use-case as much as possible in hope of getting some help. Using standard sklearn/numpy/pandas…
Sykomaniac
  • 175
  • 3
  • 13
3
votes
0 answers

Dask, Tensorflow serving (and Kubernetes and Streamz)

What is the current 'state of technology' when having a pipeline composed of python code and Tensorflow/Keras models? We are trying to have scalability and reactive design using dask and Streamz (for servers registered using Kubernetes). But…
Holi
  • 384
  • 2
  • 15
3
votes
1 answer

Override dask scheduler_port

I've tried several ports without success: 8787 is indeed busy serving rstudio. I could redirect rstudio, but shouldn't the following work? from distributed import Client, LocalCluster cluster = LocalCluster( scheduler_port = 8785 , n_workers = 2…
user2105469
  • 1,413
  • 3
  • 20
  • 37
3
votes
1 answer

SQL-style explode on Dask Series or DataFrame column

I have a Dask Series that contains a column with a list of values. I want to perform a SQL-style explode to create a new row for each index value and corresponding list element. For this particular problem, the lists are all of the same…
marshackVB
  • 43
  • 1
  • 5