Questions tagged [dask-delayed]

Dask.Delayed refers to the python interface that consists of the delayed function, which wraps a function or object to create Delayed proxies. Use this tag for questions related to the python interface.

Dask.Delayed refers to the python interface that consists of the delayed function, which wraps a function or object to create Delayed proxies. Use this tag for questions related to the python interface.

290 questions
0
votes
1 answer

Communicate progress of work inside a Dask delayed task back to Client thread

I would like to use a Dask delayed task to call an external program, which outputs it's progress to STDOUT. In the delayed, I plan to monitor the STDOUT and would like to update the Client process that is waiting for the delayed task with progress…
ipetrik
  • 1,749
  • 18
  • 28
0
votes
1 answer

fastest way to load 1.5 million images into a dask cluster

I'm trying to persist 1.5 million images to a dask cluster as a dask array, and then get some summary stats. I'm following an image processing tutorial from @mrocklin's blog and have edited my script to be a minimally reproducible example: import…
skeller88
  • 4,276
  • 1
  • 32
  • 34
0
votes
1 answer

How to create a custom Dask worker with imports

I'm setting up Dask, and I can use dask for multiprocessing just fine. I run into issues, however, when I want to use pre-configured Dask workers. They don't have the same imports I do with my main process. I was wondering. How do I add custom…
Kivo360
  • 781
  • 3
  • 9
  • 12
0
votes
1 answer

How to parallel compute csv file store on each worker without use hdfs?

A concept same as data localy on hadoop but I don't want to use hdfs. I have 3 dask-worker . I want to compute a big csv filename for example mydata.csv. I split mydata.csv to small file (mydata_part_001.csv ... mydata_part_100.csv) and store in…
0
votes
1 answer

Using Dask Delayed on Small/Partitioned Dataframes

I am working with time series data that is formatted as each row is a single instance of a ID/time/data. This means that the rows don't correspond 1 to 1 for each ID. Each ID has many rows across time. I am trying to use dask delayed to have a…
0
votes
1 answer

Why is threaded dask example executing in parallel

For teaching purposes, I'm trying to create simple examples using dask delayed that highlight the GIL when using threads and not processes. I'm using the single-machine scheduler for now to keep things simple. My understanding was that switching…
jrinker
  • 2,010
  • 2
  • 14
  • 17
0
votes
1 answer

Computing dask delayed objects stored in dataframe

I am looking for the best way to compute many dask delayed obejcts stored in a dataframe. I am unsure if the pandas dataframe should be converted to a dask dataframe with delayed objects within, or if the compute call should be called on all values…
skurp
  • 389
  • 3
  • 13
0
votes
1 answer

dask.delayed and import statements

I'm learning dask and I want to generate random strings. But this only works if the import statements are inside the function f. This works: import dask from dask.distributed import Client, progress c = Client(host='scheduler') def f(): from…
offwhitelotus
  • 1,049
  • 9
  • 15
0
votes
0 answers

How to use dask to write a huge list of column data into columns of a excel file?

I need a way to get the list containing specific column data into excel but getting memory error how can I use dask to complete this task, my system is having only 8 GB ram. I'm creating a excel file out of a huge .dat file(containing text just like…
Ravi Teja
  • 1
  • 1
0
votes
1 answer

How do the batching instructions of Dask delayed best practices work?

I guess I'm missing something (still a Dask Noob) but I'm trying the batching suggestion to avoid too many Dask tasks from here: https://docs.dask.org/en/latest/delayed-best-practices.html and can't make them work. This is what I tried: import…
K.-Michael Aye
  • 5,465
  • 6
  • 44
  • 56
0
votes
1 answer

Using Not Yet Implemented Pandas Functions in Dask

I believe I saw a recommendation in one of the Dask tutorials on how to use Pandas functions that are not yet implemented in the Dask framework when working with Dask dataframes, but I seem to have misplaced where I saw that. For example, I would…
dan
  • 183
  • 13
0
votes
1 answer

Dask: what function variable is best to choose for visualize()

I am trying to understand Dask delayed more deeply so I decided to work through the examples here. I modified some of the code to reflect how I want to use Dask (see below). But the results are different than what I expected ie. a tuple vs list. …
MikeB2019x
  • 823
  • 8
  • 23
0
votes
0 answers

Right way to use dask for efficient conditional pairwise row operations in a DataFrame

I have the following sequential code: c = [] for ind, a in df.iterrows(): for ind, b in df.iterrows(): if a.hit_id < b.hit_id : c.append(dist(a, b)) c = numpy.array(c) But the number of rows in the dataframe is close to 106.…
0
votes
1 answer

Dask visualise multiple output nodes in a Dask graph

The Dask graph that I'm creating has multiple outputs. I was wondering if it's possible to visualise multiple dask outputs at the same time. When I try using dask.visualize(graph). Where graph is a tuple or dictionary of Dask nodes. It appears to…
CMCDragonkai
  • 6,222
  • 12
  • 56
  • 98
0
votes
1 answer

How to parallelize a nested loop with dask.distributed?

I am trying to parallelize a nested loop using dask distribute that looks this way: @dask.delayed def delayed_a(e): a = do_something_with(e) return something @dask.delayed def delayed_b(element): computations = [] for e in element: …