How to make the kafka stream more stable? as it it will run constantly without having us to start the run again after it fails (so far we are thinking about using the "continous" run mode to make it automatically start a new run even after a failure)
How to achieve optimization with the kafka Stream? there are moments where the total CPU usage of the entirety of the server almost maxed out and we are suspecting that this might have to do with the kafka Streams. We are not sure if we have the correct approach with how we set up the kafka stream so if anyone knows what is the industry standard to do so, we could definitely use some advice.
If anyone has any suggestions please do share in the comments
Edit Below is a template on how we run our kafka streams:
First we define the schema of the json format that kafka message's value has:
schema = StructType([
StructField("test_1", StringType()),
StructField("test_2" , StringType()),
StructField("test_3" , IntegerType()),
StructField("test_4" , StringType()),
StructField("test_5" , StringType()),
])
schema
then I set up the kafka connector in the following way:
df = spark.readStream.format("kafka")\
.option("kafka.bootstrap.servers", "kafka_address")\
.option("subscribe","topic_0")\
.option("startingOffsets", "earliest")\
.option("kafka.security.protocol", "SASL_SSL")\
.option("kafka.sasl.mechanism", "PLAIN")\
.option("failOnDataLoss", "false")\
.option("kafka.group.id", "gpid_th")\
.option("kafka.sasl.jaas.config", """kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="api name" password="api key";""")\
.load()\
.select(from_json(col("value").cast("string"), schema).alias("value"))\
After we set up the connector as indicated above, and extracting the content of the json formatted value of the kafka message, we proceed on with a series of data cleaning and finally write the cleaned data into a a delta table with the code below:
query_wtcth = df.writeStream\
.format("delta")\
.option("mergeSchema", "true")\
.option("checkpointLocation", "/checkpoint save path")\
.option("path", "save path")\
.start(mode = "append")
Then we call in the Schedule and put manual and start a run by pressing the "run now" and the stream starts running. When the overall CPU usage peaks at 99 percent like in the first picture there are usually only 6 of these kafka streams that are running. That is why we suspect that the kafka streams might be taking up too much ressource and hence why we want to see if there is way to optimize this process.