Questions tagged [dask-delayed]

Dask.Delayed refers to the python interface that consists of the delayed function, which wraps a function or object to create Delayed proxies. Use this tag for questions related to the python interface.

Dask.Delayed refers to the python interface that consists of the delayed function, which wraps a function or object to create Delayed proxies. Use this tag for questions related to the python interface.

290 questions
2
votes
0 answers

Large Dask Processes Fail When Creating and Storing DataFrame

I have a number of image files that I'm running a face recognition model on, in order to generate a Dask Dataframe of facial encodings, the file paths for the images that contain each face, and the coordinates in the image of each face. Because I…
DataOrc
  • 769
  • 2
  • 8
  • 18
2
votes
1 answer

Do dask delayed functions use the same conda environment?

I've installed dask using conda. When I create delayed functions and run them over my PBS cluster using dask, how do I ensure that the worker nodes activate the same conda environment before running the delayed functions?
2
votes
2 answers

Dask: How to use delayed functions with worker resources?

I want to make a Dask Delayed flow which includes CPU and GPU tasks. GPU tasks can only run on GPU workers, and a GPU worker only has one GPU and can only handle one GPU task at a time. Unfortunately, I see no way to specify worker resources in the…
braddock
  • 1,345
  • 2
  • 11
  • 13
2
votes
1 answer

Adding a new column to dask dataframe throws ValueError: Length of values does not match length of index

i understand that this traceback ValueError: Length of values does not match length of index arises from the fact that one dataframe is longer or shorter than the other dataframe during ddf.assign(new_col=ts_col or the same operation in…
gies0r
  • 4,723
  • 4
  • 39
  • 50
2
votes
1 answer

In Dask, is there a way to process dependencies as they become available, as in multiprocessing.imap_unordered?

I have a simple graph structure that takes N independent tasks and then aggregates them. I do not care in what order the results of the independent tasks are aggregated. Is there a way that I can speed up computation by acting on the dependencies as…
2
votes
1 answer

How to write dask dataframe into single csv in aws s3 using dask delayed so it can be faster?

Currently I am using below code but its taking too much time. As I am converting dask dataframe to buffer and using multipart-upload to upload it in s3 def multi_part_upload_with_s3(file_buffer_obj,BUCKET_NAME,key_path): client =…
2
votes
1 answer

Is it possible to read parquet metadata from Dask?

I have thousands of parquet files that I need to process. Before processing the files, I'm trying to get various information about the files using the parquet metadata, such as number of rows in each partition, mins, maxs, etc. I tried reading…
dan
  • 183
  • 13
2
votes
1 answer

How to reduce time taken by to convert dask dataframe to pandas dataframe

I have a function to read large csv files using dask dataframe and then convert to pandas dataframe, which takes quite a lot time. The code is: def t_createdd(Path): dataframe = dd.read_csv(Path, sep = chr(1), encoding = "utf-16") return…
K.S
  • 113
  • 13
2
votes
1 answer

Dask distributed apparently not releasing memory on task completion

I'm trying to execute a custom dask graph on a distributed system, the thing is that it seems to be not releasing memory of finished tasks. Am I doing something wrong? I've tried changing the number of processes and using a local cluster but it…
2
votes
1 answer

MODIS(MYD06_L2) file concatenation using xarray and dask

I try to open multiple MODIS files (MYD06_L2) using xarray (xr.open_mfdataset). I can open a single file or may be some files but i am not able to open many files or one day file as they have different dimensions. d06 = xr.open_mfdataset(M06_2040,…
2
votes
0 answers

Generating batches of images in dask

I just started with dask because it offers great parallel processing power. I have around 40000 images on my disk which I am going to use for building a classifier using some DL library, say Keras or TF. I collected this meta-info(image path and…
enterML
  • 2,110
  • 4
  • 26
  • 38
2
votes
1 answer

Controlling number of cores/threads in dask

I have a workstation with these specifications: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list:…
muammar
  • 951
  • 2
  • 13
  • 32
2
votes
1 answer

Reading large CSV files using delayed (DASK)

I'm using delayed to read many large CSV files: import pandas as pd def function_1(x1, x2): df_d1 = pd.read_csv(x1) # Some calculations on df_d1 using x2. return df_d1 def function_2(x3): df_d2 = pd.read_csv(x3) …
Eghbal
  • 3,892
  • 13
  • 51
  • 112
2
votes
1 answer

Merging a huge list of dataframes using dask delayed

I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask. I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed. I…
2
votes
1 answer

Dask dataframe from delayed zip csv

I am trying to create a dask dataframe from a set of zipped CSV files. Reading up on the problem, it seems that dask needs to use dask.distributed delayed() import glob import dask.dataframe as dd import zipfile import pandas as pd from…
user3237314
  • 21
  • 1
  • 3