How to add another column in a spark streaming dataframe whose value depends upon the value of another column

Question

So i have a streaming source suppose adClick with a timestamp column named "adClick_time" and now i want to calculate the difference between this timestamp and current time. Since it is a streaming source so as the data reach currentTime will keep getting updated. But most of the approaches i used are fixing this current time at the time of compilation or something like that.

So if possible can you provide me a way to get an answer to this. I have tried these approaches-

SparkSession spark = SparkSession.builder().config("spark.sql.session.timeZone", "UTC").config("spark.sql.shuffle.partitions",5).master("local").appName("streamstreamJoinTest").getOrCreate();
spark.sparkContext().setLogLevel("WARN");
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.S");

//reading test input data
Dataset<Row> adClickDF =readStream(spark,"adClick_flow","adClickFlowTest");

Dataset<Long>delay=adClickDF.map((MapFunction<Row, Long>) x-> (sdf.parse(x.getAs("adClick_flow_time").toString()).getTime()/1000-currentTimeMillis()/1000), Encoders.LONG());

But the main problem here is, how can i attach it with adClickDF as column named "delay"

So i tried another approach that is -

adClickDF.withColumn("time_epoch", functions.unix_timestamp(functions.col("adClick_flow_time"))) .withColumn("delay", functions.expr(currentTimeMillis()/1000 + " - time_epoch"));

but here i got delay in negative so i was thinking might be it was fixing this at compile time.

PS: the schema is continuously varying for different sources so i require solution that do not requires manual passing of schema.

How to add another column in a spark streaming dataframe whose value depends upon the value of another column

0 Answers0