How to include kafka timestamp value as columns in spark structured streaming?

Question

I am looking for the solution for adding timestamp value of kafka to my Spark structured streaming schema. I have extracted the value field from kafka and making dataframe. My issue is, I need to get the timestamp field (from kafka) also along with the other columns.

Here is my current code:

val kafkaDatademostr = spark
  .readStream 
  .format("kafka")
  .option("kafka.bootstrap.servers","zzzz.xxx.xxx.xxx.com:9002")
  .option("subscribe","csvstream")
  .load

val interval = kafkaDatademostr.select(col("value").cast("string")).alias("csv")
  .select("csv.*")

val xmlData = interval.selectExpr("split(value,',')[0] as ddd" ,
    "split(value,',')[1] as DFW",
    "split(value,',')[2] as DTG",
    "split(value,',')[3] as CDF",
    "split(value,',')[4] as DFO",
    "split(value,',')[5] as SAD",
    "split(value,',')[6] as DER",
    "split(value,',')[7] as time_for",
    "split(value,',')[8] as fort")

How can I get the timestamp from kafka and add as columns along with other columns?

score 2 · Accepted Answer · answered Jan 22 '19 at 04:53

2

Timestamp is included in the source schema. Just add a "select timestamp" to get the timestamp like the below.

val interval = kafkaDatademostr.select(col("value").cast("string").alias("csv"), col("timestamp")).select("csv.*", "timestamp")

answered Jan 22 '19 at 04:53

Joe Widen

2,378
1
15
21

score 1 · Answer 2 · answered Jan 22 '19 at 11:03

At Apache Spark official web page you can find guide: Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)

There you can find information about the schema of DataFrame that is loaded from Kafka.

Each row from Kafka source has following columns:

key - message key
value - message value
topic - name message topic
partition - partitions from which that message came from
offset - offset of the message
timestamp - timestamp
timestampType timestamp type

All of above columns are available to query. In your example you use only value, so to get timestamp just need to add timestamp to your select statement:

  val allFields = kafkaDatademostr.selectExpr(
    s"CAST(value AS STRING) AS csv",
    s"CAST(key AS STRING) AS key",
    s"topic as topic",
    s"partition as partition",
    s"offset as offset",
    s"timestamp as timestamp",
    s"timestampType as timestampType"
  )

What if I want to add timestamp to Kafka from Spark instead of the other way around? I didn't find any such guide in the link provided here. Is there any configuration I should check in Kafka or Spark? — Cyber Knight, Mar 12 '20 at 06:44

score 0 · Answer 3 · answered Apr 06 '19 at 20:28

In my case of Kafka, I was receiving the values in JSON format. Which contains the actual data along with original Event Time not kafka timestamp. Below is the schema.

val mySchema = StructType(Array(
      StructField("time", LongType),
      StructField("close", DoubleType)
    ))

In order to use watermarking feature of Spark Structured Streaming, I had to cast the time field into the timestamp format.

val df1 = df.selectExpr("CAST(value AS STRING)").as[(String)]
      .select(from_json($"value", mySchema).as("data"))
      .select(col("data.time").cast("timestamp").alias("time"),col("data.close"))

Now you can use the time field for window operation as well as watermarking purpose.

import spark.implicits._
val windowedData = df1.withWatermark("time","1 minute")
                      .groupBy(
                          window(col("time"), "1 minute", "30 seconds"),
                          $"close"
                      ).count()

I hope this answer clarifies.

How to include kafka timestamp value as columns in spark structured streaming?

3 Answers3