I have a pyspark application that is consuming messages from a Kafka topic, these messages are serialized by org.apache.kafka.connect.json.JsonConverter
. I'm using confluent Kafka JDBC connector to do this
The issue is, when I consume the messages, the ID column comes in some kind of encoded text such as "ARM=" when it should be a number type.
Here is the code I have now
spark = SparkSession.builder.appName("my app").getOrCreate()
sc = spark.sparkContext
sc.setLogLevel('WARN')
ssc = StreamingContext(sc, 5)
kafka_params = {
"bootstrap.servers": "kafkahost:9092",
"group.id": "Deserialize"
}
kafka_stream = KafkaUtils.createDirectStream(ssc, ['mytopic'], kafka_params)
kafka_stream.foreachRDD(lambda rdd: rdd.foreach(lambda x: print(x))
ssc.start()
ssc.awaitTermination()
I am aware the createDirectStream has a valueDecoder parameter I can set, the problem is I don't know how to use this for decoding. I am also aware of the schema before hand so I will be able to create one if need be.
For reference, this is the JSON I am getting when I print out rdd.foreach
{
"schema": {
"type": "struct",
"fields": [
{
"type": "bytes",
"optional": False,
"name": "org.apache.kafka.connect.data.Decimal",
"version": 1,
"parameters": {
"scale": "0"
},
"field": "ID"
},
{
"type": "string",
"optional": True,
"field": "COLUMN1"
}
],
"optional": False
},
"payload": {
"ID": "AOo=",
"COLUMN1": "some string"
}
}