I am using spark xpath to get the attribute values from an xml string. The xpath returns an array of values from the xml tag. If there are multiple rows present in a tag with one of the rows having an attribute with null, the xpath function is ignoring that value in the returned array. What I am looking for is, if the value is not present a default string to be returned so that it will not alter the order of values in the array.
df = spark.createDataFrame([['<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns1:entities><ns1:entity ns1:type="PHYSICIANS"><ns1:entity ns1:instance="207" ns1:type="PHYSICIAN" ns1:id="P1"><ns1:attribute ns1:name="ID">2071</ns1:attribute><ns1:attribute ns1:name="NAME"></ns1:attribute></ns1:entity><ns1:entity ns1:instance="208" ns1:type="PHYSICIAN" ns1:id="P2"><ns1:attribute ns1:name="ID">2081</ns1:attribute><ns1:attribute ns1:name="NAME">Dr. James Hanover</ns1:attribute></ns1:entity></ns1:entity></ns1:entities>']], ['visitors'])
df = df.selectExpr('xpath(visitors,"./entities/entity[@type=\'PHYSICIANS\']/entity/attribute[@name=\'ID\']/text()") ID','xpath(visitors,"./entities/entity[@type=\'PHYSICIANS\']/entity/attribute[@name=\'NAME\']/text()") NAME',)
display(df)
This gives me an output as below
What I am expecting as an output is as below
Can someone please help.?