What is the best way to set up a kafka connection with apache spark

Question

How to make the kafka stream more stable? as it it will run constantly without having us to start the run again after it fails (so far we are thinking about using the "continous" run mode to make it automatically start a new run even after a failure)
How to achieve optimization with the kafka Stream? there are moments where the total CPU usage of the entirety of the server almost maxed out and we are suspecting that this might have to do with the kafka Streams. We are not sure if we have the correct approach with how we set up the kafka stream so if anyone knows what is the industry standard to do so, we could definitely use some advice.

If anyone has any suggestions please do share in the comments

Edit Below is a template on how we run our kafka streams:

First we define the schema of the json format that kafka message's value has:
schema = StructType([ 
        StructField("test_1", StringType()),
        StructField("test_2" , StringType()),
        StructField("test_3" , IntegerType()),
        StructField("test_4" , StringType()),
        StructField("test_5" , StringType()),
        ])
schema


then I set up the kafka connector in the following way: 

df = spark.readStream.format("kafka")\
    .option("kafka.bootstrap.servers", "kafka_address")\
    .option("subscribe","topic_0")\
    .option("startingOffsets", "earliest")\
    .option("kafka.security.protocol", "SASL_SSL")\
    .option("kafka.sasl.mechanism", "PLAIN")\
    .option("failOnDataLoss", "false")\
    .option("kafka.group.id", "gpid_th")\
    .option("kafka.sasl.jaas.config", """kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="api name" password="api key";""")\
    .load()\
    .select(from_json(col("value").cast("string"), schema).alias("value"))\

After we set up the connector as indicated above, and extracting the content of the json formatted value of the kafka message, we proceed on with a series of data cleaning and finally write the cleaned data into a a delta table with the code below:

query_wtcth = df.writeStream\
    .format("delta")\
    .option("mergeSchema", "true")\
    .option("checkpointLocation", "/checkpoint save path")\
    .option("path", "save path")\
    .start(mode = "append")

Then we call in the Schedule and put manual and start a run by pressing the "run now" and the stream starts running. When the overall CPU usage peaks at 99 percent like in the first picture there are usually only 6 of these kafka streams that are running. That is why we suspect that the kafka streams might be taking up too much ressource and hence why we want to see if there is way to optimize this process.

please provide more information on what you are doing in your code, what kind of errors you get, etc. - without this information it's hard to say something, as stream may fail for very different reasons — Alex Ott, Mar 25 '23 at 12:56
Just edited the post and added more detail of the code that we use to run this process. If there is any other info you need please do tell me and I will see what else I can provide. — Trodenn, Mar 28 '23 at 17:38
I don't see the picture. But imho if you have 6 streams on the same cluster, then it will be expected. — Alex Ott, Apr 01 '23 at 12:04

What is the best way to set up a kafka connection with apache spark

0 Answers0