0

I am trying to parse simple XML by supplying XSD schema. Using the approach given here.

https://github.com/databricks/spark-xml#xsd-support

XML is here:

<note>
  <to>aa</to>
  <from>bb</from>
  <heading>cc</heading>
  <body>dd</body>
</note>

XSD is here:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="note">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="to" type="xs:string"/>
      <xs:element name="from" type="xs:string"/>
      <xs:element name="heading" type="xs:string"/>
      <xs:element name="body" type="xs:string"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

</xs:schema>

I am trying to read this XSD and trying to build schema like below.

import com.databricks.spark.xml.util.XSDToSchema
import java.nio.file.Paths
val schemaParsed = XSDToSchema.read(Paths.get("<local_linux_path>/sample_file.xsd"))
print(schemaParsed)

Here schema successfully parsed. Next I am reading XML file like below.

val df = spark.read.format("com.databricks.spark.xml").schema(schemaParsed).load("<hdfs_path>/sample_file.xml")

After this step I can display schema of Dataframe using df.printSchema() , But content is coming as empty if I am giving df.show()

Please guide me where I am doing wrong here.

Note: This question is exactly same as this: How to parse XML with XSD using spark-xml package?

But reposting same question again as I am not able to comment there. Thanks in advance.

Keds
  • 1
  • 1

1 Answers1

0

I believe your XSD element names are incorrect. You can use tools like this online XSD / XML validator to pick out the errors.

Do these element names match your schema?

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="note">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="to" type="xs:string"/>
      <xs:element name="from" type="xs:string"/>
      <xs:element name="heading" type="xs:string"/>
      <xs:element name="body" type="xs:string"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>
</xs:schema>

RowTag You may also need to pass in a rowTag to spark to define the tag that marks each row of data:

.options(rowTag='note')

Experimental features in use According the this documentation, the XSDToSchema utility is experimental and only works with certain schemas.

John Glenn
  • 1,469
  • 8
  • 13
  • Hey sry, seems like some typo error from my side. I have already checked that names are matching. let me update correct names in question. – Keds Jan 27 '22 at 14:22
  • @Keds - I've updated my answer with two other considerations. Hopefully they help. – John Glenn Jan 27 '22 at 14:45