I have a requirement where I need to process a column in a dataframe containing an XML. I am trying to convert the XML column from dataframe into multiple individual columns based on the tags.
MyDataframe look like this - +---------+--------------------+ |id| xmldata| +---------+--------------------+ | 18284|<?xml version="1....| | 18307|<?xml version="1....| | 18297|<?xml version="1....| | 18282|<?xml version="1....| | 18304|<?xml version="1....| +---------+--------------------+
xml looks like this- `` (18284,"123277311<Customers test="a">100"), (18307,"176344<Customers test="b">200"), (18297,"299366<Customers test="c">300")
val xmlrdd = df.select("xmldata").map(a => a.getString(0)).rdd val xmldf = new XmlReader().xmlRdd(spark.sqlContext, xmlrdd).select($"details.Customers._test".as("cust_test"),$"details.Customers._VALUE".as("cust_val"),$"details.addr.line1".as("Addrl1"),$"details.addr.line2".as("Addrl2"),$"details.addr.line3".as("Addrl3")) xmldf.show
``
if I try to select element, which is expected but not present in any of the xml record. How to Handle xml non mandatory xml elements ?? I want NULL values for xml elements which are not present in xml while parsing.