0

I'm facing an issue with my Glue script that reads events from Kafka. Currently, I'm using Spark Structured Streaming and the script reads events starting from the earliest offset. However, I would like to modify it to read events based on a specific timestamp.

I tried using the startingOffsets option with a timestamp value, but it seems that Spark Structured Streaming does not directly support this feature for Kafka as a data source.

Is there a workaround or alternative approach to achieve timestamp-based reading from Kafka using Glue and Spark Structured Streaming? How can I modify my script to accomplish this?

Here is a simplified version of my Glue script:

import sys
import boto3
import traceback
import json
import pyspark
from pyspark import SparkContext,SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StructType,StructField, StringType, IntegerType,BooleanType,DoubleType

sc = SparkContext()
sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider", )
hadoopConf.set("com.amazonaws.services.s3a.enableV4", "true")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark = SparkSession(sc).builder.getOrCreate()

try:
    options = {
      "kafka.sasl.jaas.config": 'org.apache.kafka.common.security.plain.PlainLoginModule required username="USERNAME" password="PASSWORD";',
      "kafka.sasl.mechanism": "PLAIN",
      "kafka.security.protocol": "SASL_SSL",
      "kafka.bootstrap.servers": "kafka_server",
      "subscribe": "my_topic_name",
        "startingOffsets":"earliest"
    }

    df = spark.readStream.format("kafka").options(**options).load()
    
    df=df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

    df.writeStream.format("json") \
      .option("checkpointLocation", "s3://s3://mybucket/test/")\
      .outputMode("append") \
      .option("path",  "s3://mybucket/test/") \
      .start() \
      .awaitTermination()
      
except Exception as e:
  print(e)

Version with timestamp

It doesn't work, the job stop running without retrieving anything

import sys
import boto3
import traceback
import json
import pyspark
from pyspark import SparkContext,SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StructType,StructField, StringType, IntegerType,BooleanType,DoubleType

sc = SparkContext()
sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider", )
hadoopConf.set("com.amazonaws.services.s3a.enableV4", "true")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark = SparkSession(sc).builder.getOrCreate()

try:
    options = {
             "kafka.sasl.jaas.config": 'org.apache.kafka.common.security.plain.PlainLoginModule required username="USERNAME" password="PASSWORD";',
      "kafka.sasl.mechanism": "PLAIN",
      "kafka.security.protocol": "SASL_SSL",
      "kafka.bootstrap.servers": "lkc-xg1ox-lqjjp.eu-west-3.aws.glb.confluent.cloud:9092",
      "subscribe": "dev_cop_out_customeragreement_event_outstanding_ini",
      "startingOffsets": "timestamp",  # Change to read from a specific timestamp
        "startingTimestamp": "2023-06-20T00:00:00Z"  # Specify the desired starting timestamp
    }

    df = spark.readStream.format("kafka").options(**options).load()
    
    df=df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

    df.writeStream.format("json") \
      .option("checkpointLocation", "s3://mybucket/test/")\
      .outputMode("append") \
      .option("path",  "s3://mybucket/test/") \
      .start() \
      .awaitTermination()
      
except Exception as e:
  print(e)
Smaillns
  • 2,540
  • 1
  • 28
  • 40

1 Answers1

0

In the Spark documentation, it shows replacing startingOffsets value with a dictionary of partitions and offsets, not timestamps. You can build such a data structure using kafka-python offsets_for_timex function

You could probably do the same with importing JVM KafkaConsumer library with Pyspark, but then you'd have extra logic around type conversions

Otherwise, depending on your Spark version, there's also startingTimestamp and startingOffsetsByTimestamp

Note: startingTimestamp takes precedence over startingOffsetsByTimestamp and startingOffsets.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • I just updated the question, I already tried **startingTimestamp** may I missed something – Smaillns Jul 10 '23 at 11:06
  • Kafka defaults to remove data after 7 days. The date in the question is before that... If that time/offset isn't found, the consumer may hang/skip to the end of the topic – OneCricketeer Jul 10 '23 at 11:28
  • that's stange ! we already have data in the topic – Smaillns Jul 10 '23 at 13:11
  • Sure, but you need to tell Kafka consumer to start from the first available offset, if the one you've requested (from a timestamp, or number) is not available – OneCricketeer Jul 10 '23 at 15:18