Questions tagged [dask-delayed]

Dask.Delayed refers to the python interface that consists of the delayed function, which wraps a function or object to create Delayed proxies. Use this tag for questions related to the python interface.

Dask.Delayed refers to the python interface that consists of the delayed function, which wraps a function or object to create Delayed proxies. Use this tag for questions related to the python interface.

290 questions
0
votes
1 answer

Can you use xarray inside of dask delayed functions

I want to use dask.delayed in my work. Some of the functions I want to apply it to will work with xarray based data. I know xarray itself can use dask under the hood. For dask itself, it is recommended you don't pass dask arrays to delayed…
Caleb
  • 3,839
  • 7
  • 26
  • 35
0
votes
1 answer

Delayed Decorator in Dask Library - results are counter productive

Trying to learn how to use dask library and followed the link https://www.machinelearningplus.com/python/dask-tutorial/ Code with dask delayed decorator import time import dask from dask import delayed @delayed def square(x): return…
madmatrix
  • 205
  • 1
  • 4
  • 12
0
votes
0 answers

Applying Tensorflow TextVectorization and StringIndexers to Dask Partitions on Parallel

I am trying to build a pipeline to parallelize writing tfrecords files on datasets that are too large to fit into memory. I have successfully used dask to do this many times in the past, but I have a new dataset requiring that TextVectorization and…
scribbles
  • 4,089
  • 7
  • 22
  • 29
0
votes
1 answer

Read custom binary file in parallel with dask

I currently read a custom binary file to a dask.bags using a generator and dask.delayed: @dask.delayed def get_entry_from_binary(file_name, chunk_size=8+4+4): with open(file_name, "rb") as f: while (entry := f.read(chunk_size)): …
BBG
  • 73
  • 1
  • 9
0
votes
0 answers

Why I fail to export large dask df to txt file?

My problem is to export a dask df with 10 000 000 rows and 11 columns to a .txt file. This is my code: csv_files = glob.glob("xxx_*.csv") used_cols = ["word", "word_freq", "doc_freq", "advis_word_freq", "advis_doc_freq", "story_word_freq",…
0
votes
2 answers

Create new column in Dask DataFrame with specific value for each partition

I have two Dask DataFrames with the same number of partitions. The first one has few columns and few rows for each partition (so Pandas DataFrame), but the number of rows could differs between two partitions (not the columns). The second Dask…
0
votes
1 answer

How to access nested data in Dask Bag while using dask mongo

Below is the sample data - ({'age': 61, 'name': ['Emiko', 'Oliver'], 'occupation': 'Medical Student', 'telephone': '166.814.5565', 'address': {'address': '645 Drumm Line', 'city': 'Kennewick'}, 'credit-card': {'number': '3792 459318…
gauravpks
  • 15
  • 2
0
votes
2 answers

Dask Distributed: Limit Dask distributed worker to 1 CPU

My system has 4 CPU, 16 GB RAM. My Aim is to deploy dask distributed workers that use 1 CPU each ONLY to run code assigned to them. I am deploying a scheduler container and worker containers using docker to run a code that uses Dask delayed and dask…
SMI
  • 71
  • 1
  • 11
0
votes
0 answers

Serializable object not serializable in dask

I am calling dask.delay on the following function, for multiple "self" (different objects of same class) in a loop. This is the delayed function, defined inside a custom defined subclass of keras.engine.training.Model: def fit(self, X:…
Marx
  • 13
  • 3
0
votes
1 answer

How to limit # of parallel tasks on kuberay when computing a dask array?

I'm using the delayed annotation to create a lazy data cube @dask.delayed def _delayed_func(x: int, y: int, z: int) -> np.array: return np.random.randn(1000,1000,1000) I use this function to create lazy data cubes def create_cube(size): …
Romeo Kienzler
  • 3,373
  • 3
  • 36
  • 58
0
votes
0 answers

Python parallel processing requests.post() calls with dask errors

I'm using python 3.8.5 on an Ubuntu machine. I have a function getData that makes an API call with requests.post for a specific string value which returns a pandas DataFrame converted to a list. When I iterate in serial over a list calendarCols_All…
Dr. Andrew
  • 2,532
  • 3
  • 26
  • 42
0
votes
0 answers

how to create column from delayed function in Dask

is it possible to create a column from delayed function in Dask? e.g. if we create a column in pyspark by df.withColumn('datetime', F.lit(datetime.now()) the value of this column is not calculated until we request. My question is - can we do similar…
Hawii Hawii
  • 47
  • 1
  • 5
0
votes
2 answers

Dask delayed data mismatch

I wish to combine many dataframes into 1 dataframe with dask. However when I try to read those dataframes with dd.from_delayed(parts, meta=types) I get the error Metadata mismatch found in 'from_delayed'. The full error: Metadata mismatch found in…
Sam
  • 338
  • 1
  • 4
  • 17
0
votes
1 answer

Use Dask Dataframe On delayed function

I have three sources and a Dask Dataframe for each of them. I need to apply a function that computes an operation that combines data from the three sources. The operation requires a state to be calculated ( I can't change that). The three sources…
0
votes
1 answer

dask distributed code is slower than corresponding serial execution

I have this dask example of a standalone python script that runs on my desktop that has 4 CPU nodes It takes 0.735 seconds currently. The goal is to use separate processes on my Linux to overcome the limitations of the GIL etc. import numpy as…
gansub
  • 1,164
  • 3
  • 20
  • 47