0

is it possible to create a column from delayed function in Dask?

e.g. if we create a column in pyspark by df.withColumn('datetime', F.lit(datetime.now()) the value of this column is not calculated until we request.

My question is - can we do similar thing in Dask? As far as i know dask is also lazy by default, but it seems like there is no way to achieve the same result as using pyspark?

Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
Hawii Hawii
  • 47
  • 1
  • 5
  • 1
    Rather than asking us to translate spark, can you just describe more specifically what you’re trying to do in dask? Also, have you seen the function `dask.dataframe.from_delayed`? – Michael Delgado Feb 28 '23 at 17:11
  • i have a dataframe and want to add a column that stores the timestamp of the compute time. if i just set the column to `datetime.now()`, the value can differ significantly from the desired one when the computation time is lengthy a workaround is to set the column after compute, just thinking is there another way to do the work – Hawii Hawii Feb 28 '23 at 17:57
  • Huh. You want an entire column to have the same timestamp over and over? Seems like a good use case for a standalone variable… but yeah you could use `from_delayed` to do this. Alternatively if you want to have the time vary by partition you could map a function which assigns the column using `df.map_partitions` – Michael Delgado Feb 28 '23 at 19:06
  • 1
    If you want implementation help please edit the question to remove the spark references and clarify your goals and then set up a sample problem using code as a [mre]. Thanks! – Michael Delgado Feb 28 '23 at 19:08

0 Answers0