xml kafka message to dataframe using pyspark

Question

I have an xml message from a kafka topic and im looking to convert the incoming message to a dataframe and use it further. I have achieved the same for json messages but not with xml. if anyone could help on the same.

code with json(working alright):

from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([
    StructField("key", StringType()),
    StructField("value", StringType())
])

df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "topic-name") \
    .load() \
    .select(from_json(df.value.cast("string"), schema).alias("parsed_value"))

I tried to use xml parse from xmltodict module to create a udf and apply it on the field of the data frame. It works as expected.

import xmltodict
import json
from pyspark.sql import functions as sf
from pyspark.sql.types import ArrayType, StringType

def xml_to_json_udf(xml_string):
    xml_dict = xmltodict.parse(xml_string)
    json_data = json.dumps(xml_dict)
    return json_data 

xml_to_json_udf = sf.udf(parse_xml, StringType())

Later used in the dataframe like this :

df = df.withColumn('xml_column_as_xml', xml_to_json_udf(df._kafka_value))  # _kafka_value in string datatype

However one error that i face is when the message comes from a kafka producer the string value doesnt get parsed as an xml(still trying to figure it out).

Try using spark-xml library... Otherwise, you need to write your own UDF where you can use xml2dict — OneCricketeer, Jul 26 '23 at 23:50
I created an udf as you had mentioned to parse the xml before making it a json. but somehow the incoming kafka message is adding a '\' before the equals sign and due to that the xml parser is not working. if i pass it without that the xml to json works properly. i tried all sorts to escape or remove the backslash but it always fails with the "xml.parsers.expat.ExpatError: not well-formed (invalid token) " my code is like below: def xml_to_json_udf(xml_string): xml_dict = xmltodict.parse(xml_string) json_data = json.dumps(xml_dict) return json_data — Deepak, Aug 08 '23 at 11:39
Please [edit] your question to include your new issue/code. You'll need to look at your kafka producer code to know why strings would be escaped — OneCricketeer, Aug 08 '23 at 12:53
I tried it using my edited code, but still facing an issue with the expat error due to some random string causing the malformed xml issue — Deepak, Aug 08 '23 at 14:05
Well, you can use try-except within your udf to `return None` or empty string, when parsing fails, then you can filter that out later — OneCricketeer, Aug 08 '23 at 16:29
So finally, this is what i did and was able to resolve my issue. The kafka message had text containing('/n','/t' and '/"') which was not letting the xmltodict to convert them to a json. i used python string (xml_string1.replace('\\"', '"').replace('\\n','').replace('\\t','').replace('\\r','')) after which i was able to resolve the expat issue. Thanks @OneCricketeer for your valuable guidance on this — Deepak, Aug 14 '23 at 18:22

Deepak · Answer 1 · 2023-08-21T16:20:10.497

Finally i used the below code to get this done. May be it will help someone in the future. this is what i did and was able to resolve my issue. The kafka message had text containing('/n','/t' and '/"') which was not letting the xmltodict to convert them to a json. i used python string (xml_string1.replace('\"', '"').replace('\n','').replace('\t','').replace('\r','')) after which i was able to resolve the expat issue

import xmltodict
import json
from pyspark.sql import functions as sf
from pyspark.sql.types import ArrayType, StringType

#xml_string is the data coming from the dataframe
def parse_xml(xml_string):
    #creating a functional string with the incoming data
    xml_string1=f"{xml_string}"
    #Trim the string
    xml_string1=xml_string1.strip()
    #check if my input string starts and ends with double qoutes and remove them
    if xml_string1.startswith('\"') and xml_string1.endswith('\"'):
        xml_string1 = xml_string1[1:-1]
    #replace the unwanted characters
    xml_string1 = xml_string1.replace('\\"', '"').replace('\\n','').replace('\\t','').replace('\\r','')
    xml_string1=f"{xml_string1}"
    #parse the new string 
    xml_dict = xmltodict.parse(xml_string1)
    #convert to json
    json_data = json.dumps(xml_dict)
    #print(json_data,"jsondata")
    return json_data

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Aug 20 '23 at 08:00

xml kafka message to dataframe using pyspark

1 Answers1