How To Read XML File from Azure Data Lake In Synapse Notebook without Using Spark

Question

I have an XML file stored in Azure Data Lake which I need to read from Synapse notebook. But when I read this using spark-xml library, I get this error:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `d:col`

Sample xml looks like this:

<m:properties>
            <d:FileSystemObjectType m:type="Edm.Int32">0</d:FileSystemObjectType>
            <d:Id m:type="Edm.Int32">10</d:Id>
            <d:Modified m:type="Edm.DateTime">2021-03-25T15:35:17Z</d:Modified>
            <d:Created m:type="Edm.DateTime">2021-03-25T15:35:17Z</d:Created>
            <d:ID m:type="Edm.Int32">10</d:ID>
            <d:Title m:null="true" />
            <d:Description m:type="Edm.String">Test</d:Description>
            <d:PurposeCode m:type="Edm.Int32">1</d:PurposeCode>
</m:properties>

Notice there are tags for d:Id and d:ID which are causing the duplicate error. Found this documentation that states that although they are of different case, they are considered duplicate: https://learn.microsoft.com/en-us/azure/databricks/kb/sql/dupe-column-in-metadata But I cannot modify the xml and have to read as it is. Is there a work around so I can still read the xml?

Or, is there a way to read the xml without using spark? I'm thinking of reading the xml file using the scala.xml.XML library to load the file and parse the file. But when I attempt this, I get an error:

abfss:/<container>@<adls>.dfs.core.windows.net/<directory>/<xml_file>.xml (No such file or directory)

Code snippet below:

import scala.xml.XML
val xml = XML.loadFile("abfss://<container>@<adls>.dfs.core.windows.net/<directory>/<xml_file>.xml")

Note: error really only displayed abfss:/ as opposed to the path on the parameter which has //

Thanks.

Note a full answer since I'm not familiar with Azure, but `XML.loadFile` is only going to work for local files. If you can somehow obtain an `InputStream` or `Reader` of the data at that `abfss://` URI, you could use `XML.load` instead. — Dylan, Apr 20 '22 at 14:09
Thanks @Dylan. I found a spark configuration that sets spark to case-sensitive and it works fine now: spark.conf.set("spark.sql.caseSensitive", "true") — oneDerer, Apr 21 '22 at 01:06

score 0 · Accepted Answer · answered Apr 21 '22 at 01:08

0

Found a way to set spark to be case sensitive and is able now to read the xml successfully:

spark.conf.set("spark.sql.caseSensitive", "true")

answered Apr 21 '22 at 01:08

oneDerer

287
3
10

How To Read XML File from Azure Data Lake In Synapse Notebook without Using Spark

1 Answers1