Where should I put my credential data streaming with Kafka in databricks?

Question

I have some values in Azure Key Vault (AKV)

A simple initial googling was giving me

username = dbutils.secrets.get(scope = "DATAAI-CEC", key = "dai-kafka-cec-api-key")
pwd = dbutils.secrets.get(scope = "DATAAI-CEC", key = "dai-kafka-cec-secret")

from kafka import KafkaConsumer

consumer = KafkaConsumer('TOPIC', 
                     bootstrap_servers = 'SERVER:PORT', 
                     enable_auto_commit = False, 
                     auto_offset_reset = 'earliest', 
                     consumer_timeout_ms = 2000, 
                     security_protocol = 'SASL_SSL', 
                     sasl_mechanism = 'PLAIN', 
                     sasl_plain_username = username, 
                     sasl_plain_password = pwd)

This one works one time when the cell in databricks runs, however, after a single run it is finished, and it is not listening to Kafka messages anymore, and the cluster goes to the off state after the configured time (in my case 30 minutes) So it doesn't solve my problem

My next google search was this blog on databricks (Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2)

from pyspark.sql.types import *
from pyspark.sql.functions import from_json
from pyspark.sql.functions import *

schema = StructType() \
  .add("EventHeader", StructType() \
    .add("UUID", StringType()) \
    .add("APPLICATION_ID", StringType())
    .add("FORMAT", StringType())) \
  .add("EmissionReportMessage", StructType() \
    .add("reportId", StringType()) \
    .add("startDate", StringType()) \
    .add("endDate", StringType()) \
    .add("unitOfMeasure", StringType()) \
    .add("reportLanguage", StringType()) \
    .add("companies", ArrayType(StructType([StructField("ccid", StringType(), True)]))))


parsed_kafka = spark \
                .readStream \
                .format("kafka") \
                .option("kafka.bootstrap.servers", "SERVER:PORT") \
                .option("subscribe", "TOPIC") \
                .option("startingOffsets", "earliest") \
                .load()\
                .select(from_json(col("value").cast("string"), schema).alias("kafka_parsed_value"))

There are some issues

Where should I put my GenID or user/pass info?
When I run the display command, it runs, but it will never stop, and it will never show the result

@Alex Ott your input is appreciated. I added the option("kafka.sasl.jaas.config", 'org.apache.kafka.common.security.plain.PlainLoginModule required username="{}" password="{}";'.format(username, password)) part, and I added query = parsed_kafka \ .writeStream \ .format("console") \ .start() query.awaitTermination() the query is running forever — Ali Saberi, Jul 19 '22 at 21:24

OneCricketeer · Answer 1 · 2022-07-19T18:43:25.300

1

however, after a single run it is finished, and it is not listening to Kafka messages anymore

Given that you have enable_auto_commit = False, it should continue to work on following runs. But this isn't using Spark...

Where should I put my GenID or user/pass info

You would add SASL/SSL properties into option() parameters.

Ex. For SASL_PLAIN

option("kafka.sasl.jaas.config", 
  'org.apache.kafka.common.security.plain.PlainLoginModule required username="{}" password="{}";'.format(username, password))

See related question

it will never stop

Because you run a streaming query starting with readStream rather than a batched read.

it will never show the result

You'll need to use parsed_kafka.writeStream.format("console"), for example somewhere (assuming you want to start with readStream, rather than display() and read

edited Jul 19 '22 at 18:43

answered Jul 19 '22 at 18:37

OneCricketeer

179,855
19
132
245

for enable_auto_commit = False it goes to databricks cluster configuration? – Ali Saberi Jul 19 '22 at 18:59
No? That is the code from your first code block. That is a KafkaConsumer parameter. Has nothing to do with Spark/Databricks – OneCricketeer Jul 19 '22 at 19:00
got you. The thing is I want to stream. I am expecting when a new consumer arrives, I would be able to see it or it could trigger my functions and run some works. but it is sleeping, and the cluster will sleep after 30 mins. What is the command to fetch new data? – Ali Saberi Jul 19 '22 at 19:15
Spark automatically consumes new Kafka data when you use `readStream`. This starts one consumer and waits for a producer to send data. It will never stop. If you use `read`, then you need to manually trigger a new consumer batch (e.g. run the notebook again) – OneCricketeer Jul 19 '22 at 19:37

Where should I put my credential data streaming with Kafka in databricks?

1 Answers1