Questions tagged [dask-delayed]

Dask.Delayed refers to the python interface that consists of the delayed function, which wraps a function or object to create Delayed proxies. Use this tag for questions related to the python interface.

Dask.Delayed refers to the python interface that consists of the delayed function, which wraps a function or object to create Delayed proxies. Use this tag for questions related to the python interface.

290 questions
3
votes
1 answer

File Not Found Error in Dask program run on cluster

I have 4 machines, M1, M2, M3, and M4. The scheduler, client, worker runs on M1. I've put a csv file in M1. Rest of the machines are workers. When I run the program with read_csv file in dask. It gives me Error, file not found
Dhruv Kumar
  • 399
  • 2
  • 13
3
votes
1 answer

Can we create a Dask cluster having multiple CPU machines as well as multiple GPU machines both.?

Can we create a dask-cluster with some CPU and some GPU machines together. If yes, how to control a certain task must run only on CPU machine, or some other type of task should run only on GPU machine, and if not specified, it should pick whichever…
TheCodeCache
  • 820
  • 1
  • 7
  • 27
3
votes
1 answer

Using dask delayed with functions returning lists

I am trying to use dask.delayed to build up a task graph. This mostly works quite nicely, but I regularly run into situations like this, where I have a number of delayed objects that have a method returning a list of objects of a length that is not…
tt293
  • 500
  • 4
  • 14
3
votes
1 answer

Reading LAZ to Dask dataframe using delayed loading

Action Reading multiple LAZ point cloud files to a Dask DataFrame. Problem Unzipping LAZ (compressed) to LAS (uncompressed) requires a lot of memory. Varying filesizes and multiple processes created by Dask result in MemoryError's. Attempts I tried…
Tom Hemmes
  • 2,000
  • 2
  • 17
  • 23
2
votes
2 answers

Dask dataframe parallel task

I want to create features(additional columns) from a dataframe and I have the following structure for many functions. Following this documentation https://docs.dask.org/en/stable/delayed-best-practices.html I have come up with the code…
J.Ewa
  • 205
  • 3
  • 14
2
votes
0 answers

How could I make my code work parallelize with dask?

First import some packages: import numpy as np from dask import delayed Suppose I have two NumPy arrays: a1 = np.ones(5000000) a2 = np.ones(8000000) I would like to show the sum and length of the two arrays, and the functions are shown as: def…
Liang Ce
  • 31
  • 4
2
votes
1 answer

What is the best way to lag a value in a Dask Dataframe?

I have a Dask Dataframe called data which is extremely large and cannot be fit into main memory, and is importantly not sorted. The dataframe is unique on the following key: [strike, expiration, type, time]. What I need to accomplish in Dask is the…
2
votes
1 answer

Can Dask automatically create a tree to parallelize a computation and reduce the copies between workers?

I've structured this in two sections, BACKGROUND and QUESTION. The Question is all the way at the bottom. BACKGROUND: Suppose I want to (using Dask distributed) do an embarrassingly parallel computation like summing 16 gigantic dataframes. I know…
user5406764
  • 1,627
  • 2
  • 16
  • 23
2
votes
1 answer

Get PARTITION_ID in Dask for Data Frame

Is it possible to get the partition_id in dask after splitting pandas DFs For example: import dask.dataframe as dd import pandas as pd df = pd.DataFrame(np.random.randn(10,2), columns=["A","B"]) df_parts = dd.from_pandas(df, npartitions=2) part1 =…
data_person
  • 4,194
  • 7
  • 40
  • 75
2
votes
1 answer

cluster.adapt() kill workers before moving their memory data to others

I am using Dask with Slurm cluster: cluster = SLURMCluster(cores=64, processes=64, memory="128G", walltime="24:00:00") #export DASK_DISTRIBUTED__SCHEDULER__ALLOWED_FAILURES=100 cluster.adapt(minimum_jobs=1, maximum_jobs=2, interval="20 s",…
2
votes
1 answer

Dask hanging when called from command prompt

I have a program that is running as expected when run in a Jupyter Notebook cell, but is failing/hanging when put into a python file and called from either a Jupyter Notebook or from the command line. Here is the test code: import pandas as pd …
mpLoNsTa
  • 23
  • 6
2
votes
1 answer

How to load my train.tfrecord files in saturn cloud for running via Dask?

I am working on Object Detection and I have two record files. Train.tfrecord(1.6GB) and Test.tfrecord(65MB) file. How do I load the training file in Saturn cloud, as I want to speed up the training time using Dask in Saturn Cloud?
uNIKx
  • 123
  • 2
  • 13
2
votes
1 answer

KilledWorker error in dask when doing embarrassingly parallel data concatenation

I have an embarrassingly parallel workload where I am reading a group of parquet files, concatenating them into bigger parquet files, and then writing it back to the disk. I am running this in a distributed computer (with distributed filesystem)…
2
votes
1 answer

Why dask.delayed takes longer than serial code when working with networkx?

I would like to speed up the execution of a function my_func() using parallel computation with dask.delayed. In a loop over 3 dimensions, my_func() extracts a value from an iris.cube.Cube (which is essentially a dask.array loaded from a file outside…
2
votes
1 answer

Dask high memory usage when computing two values with common dependency

I am using Dask on a single machine (LocalCluster with 4 processes, 16 threads, 68.56GB memory) and am running into worker memory problems when trying to compute two results at once which share a dependency. In the example shown below, computing…
user73445
  • 23
  • 3