0

Let's suppose we have a dataframe with millions of records and we are processing this using pyspark. At some point of time we are adding a column => lit(datetime.datetime.now()). In general the completion time of previous action will be different in different compute nodes. So if the column is created in above mentioned way records in different nodes should have different value for the column atleast they should differ in some milliseconds. But when I did the same all the records were having same value for that column.

Could someone help me in understanding this behavior.

  • how is `current_timestamp` variable created? it is used with `lit()` so it takes it as a static literal value – samkart Jul 19 '22 at 12:34
  • @samkart the execution of lit(datetime.datetime.now()) takes place at different point of time in different compute nodes. So there should be minute difference in timestamp between records present in different compute nodes. But every record shows same timestamp in my case. What might be the reason. Correct me if I am wrong in understanding the concept. – Krishna Pavan Ayitha Jul 20 '22 at 10:17
  • lit(datetime.datetime.now()) is not executed per node. are you saying you got different timestamps when you tried? – samkart Jul 20 '22 at 10:20
  • @samkart No. I got same value in every record. But I was expecting a difference in the value for records present in different nodes. The process goes like this: Function1=>Function2=>lit(datetime.datetime.now())=>Action is triggred.There will be skewness in data in different compute nodes. So processing time for 1st 2 functions will be different so when execution reaches lit(datetime.datetime.now()) at time t1 in compute node1 and t2 in compute node2.. So datetime.datetime.now() value will be different in different nodes. So all rows in a node should have same value and differ from other nodes – Krishna Pavan Ayitha Jul 21 '22 at 05:26
  • `datetime.datetime.now()` is executed at runtime and the *literal* value is posted to the nodes (as a static input). – samkart Jul 21 '22 at 05:33

2 Answers2

2

Based on your interesting comment(s), here's a test that uses the same function but once as UDF (that executors will handle) and once as literal value (executed at runtime; not by executors).

def curr_ts():
    import datetime

    return datetime.datetime.now()

curr_ts_udf = func.udf(curr_ts, TimestampType())

data_sdf. \
    withColumn('ts_pydttm', func.lit(datetime.datetime.now())). \
    withColumn('ts_pyudf', curr_ts_udf()). \
    show(10, truncate=False)

# +----+----------+----+--------------------------+--------------------------+
# |col1|dt        |col3|ts_pydttm                 |ts_pyudf                  |
# +----+----------+----+--------------------------+--------------------------+
# |1   |2022-01-01|1   |2022-07-21 05:59:56.281733|2022-07-21 05:59:56.462249|
# |1   |2022-01-02|-1  |2022-07-21 05:59:56.281733|2022-07-21 05:59:56.462347|
# |1   |2022-01-03|1   |2022-07-21 05:59:56.281733|2022-07-21 05:59:56.462395|
# |1   |2022-01-04|0   |2022-07-21 05:59:56.281733|2022-07-21 05:59:56.462406|
# |1   |2022-01-05|1   |2022-07-21 05:59:56.281733|2022-07-21 05:59:56.462415|
# |1   |2022-01-06|1   |2022-07-21 05:59:56.281733|2022-07-21 05:59:56.462425|
# |1   |2022-01-07|0   |2022-07-21 05:59:56.281733|2022-07-21 05:59:56.462434|
# |1   |2022-01-08|1   |2022-07-21 05:59:56.281733|2022-07-21 05:59:56.462443|
# |1   |2022-01-09|1   |2022-07-21 05:59:56.281733|2022-07-21 05:59:56.462453|
# |1   |2022-01-10|-1  |2022-07-21 05:59:56.281733|2022-07-21 05:59:56.462462|
# +----+----------+----+--------------------------+--------------------------+
# only showing top 10 rows

# root
#  |-- col1: long (nullable = true)
#  |-- dt: date (nullable = true)
#  |-- col3: long (nullable = true)
#  |-- ts_pydttm: timestamp (nullable = false)
#  |-- ts_pyudf: timestamp (nullable = true)

Look at the values of ts_pydttm, they're all same as the current timestamp was passed as a static (literal) value. Now, look at the values of ts_pyudf, they're different for all records as the executors run the UDF (calls the current timestamp) for each record processing. Both of them result in a timestamp field, but are handled differently.

samkart
  • 6,007
  • 2
  • 14
  • 29
1

As samkart said you should use lit function and add the current timestamp as a new column to the dataframe, that way the current timestamp will use the current time in the driver Example ():

from pyspark.sql.functions import lit, to_timestamp
import datetime

current_time = datetime.datetime.now()

print(current_time)
# 2022-07-19 13:40:06.425681

df = spark.sql("select 1")
df = df.withColumn('current_ts', lit(current_time))

df.printSchema()

# root
#  |-- 1: integer (nullable = false)
#  |-- current_ts: timestamp (nullable = false)

df.show(1, False)
# +---+--------------------------+                                                
# |1  |current_ts                |
# +---+--------------------------+
# |1  |2022-07-19 13:40:06.425681|
# +---+--------------------------+

# Query the data (if X is your filtering column)
df_results = df.where("X >= current_ts")

# remove the current timestamp column
df_results = df_results.drop('current_ts')
Guy Melul
  • 159
  • 3
  • What if we are directly using lit(datetime.datetime.now()) to add a column. Now this piece of code will execute at different time in different compute nodes. So lit(datetime.datetime.now()) should produce different values in different compute nodes(difference might be in millseconds) – Krishna Pavan Ayitha Jul 20 '22 at 10:12