Spark xpath function to return null if no value present for an attribute

Question

I am using spark xpath to get the attribute values from an xml string. The xpath returns an array of values from the xml tag. If there are multiple rows present in a tag with one of the rows having an attribute with null, the xpath function is ignoring that value in the returned array. What I am looking for is, if the value is not present a default string to be returned so that it will not alter the order of values in the array.

df = spark.createDataFrame([['<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns1:entities><ns1:entity ns1:type="PHYSICIANS"><ns1:entity ns1:instance="207" ns1:type="PHYSICIAN" ns1:id="P1"><ns1:attribute ns1:name="ID">2071</ns1:attribute><ns1:attribute ns1:name="NAME"></ns1:attribute></ns1:entity><ns1:entity ns1:instance="208" ns1:type="PHYSICIAN" ns1:id="P2"><ns1:attribute ns1:name="ID">2081</ns1:attribute><ns1:attribute ns1:name="NAME">Dr. James Hanover</ns1:attribute></ns1:entity></ns1:entity></ns1:entities>']], ['visitors'])

df = df.selectExpr('xpath(visitors,"./entities/entity[@type=\'PHYSICIANS\']/entity/attribute[@name=\'ID\']/text()") ID','xpath(visitors,"./entities/entity[@type=\'PHYSICIANS\']/entity/attribute[@name=\'NAME\']/text()") NAME',)

display(df)

This gives me an output as below

What I am expecting as an output is as below

Can someone please help.?

score 1 · Answer 1 · answered Sep 06 '22 at 18:34

I think the xpath is not working since it is parsing and creating a list. Since the first row and name column has an empty string, it is not added to the result list. I fixed the issue by replacing the empty string with a space.

However, this answer gives a single row, with two lists. Not what we really want.

A better solution is to use the xml library from databricks with a-lot simpler xml file.

The code below creates an example physician file using your data.

#
# 1 - create data file
#

# raw data 
xml_str = """
<?xml version="1.0"?>
<physicians>
   <physician>
      <id>2071</id>
      <name></name>
   </physician>
   <physician>
      <id>2081</id>
      <name>Dr. James Hanover</name>
   </physician>
</physicians>
"""

# raw file
dbutils.fs.put("/temp/physicians.xml", xml_str, True)

The next step is to install the library to the cluster. Please see documentation.

https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/xml

Next, read the file into a data frame that has 2 rows and 2 columns.

#
# 2 - read data file
#

df2 = spark.read \
    .format("xml") \
    .option("rootTag", "physicians") \
    .option("rowTag", "physician") \
    .load("/temp/physicians.xml")

display(df2)

Spark xpath function to return null if no value present for an attribute

1 Answers1