Streaming XML schema validation using .xsd files in Databricks

Question

I'm struggling to find some documentation on how schema validation can be performed in Databricks using xsd files. I found a bunch or answers for batch pipelines that loads xml from a file however I'm obtaining mine from a Event Hubs stream source.

This is what I currently have:

import com.databricks.spark.xml.functions.from_xml
import com.databricks.spark.xml.schema_of_xml
import spark.implicits._
import org.apache.spark.sql.functions._  

val dfEventHub = spark
  .readStream
  .format("eventhubs")
  .options(customEventhubParameters.toMap)
  .load()
  .select(($"body").cast("string"), $"properties.source")

dfEventHub.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>


// Attemped this way reading as text but can't use as validator  
//   val schema = spark.read.text("/mnt/schemas/test_schema.xsd")


// Also obtaining the schema like this which seem to work but it only writes NULL to destination dataframe instead givin a proper success/failure response

//   val schema = XSDToSchema.read(Paths.get("/dbfs/mnt/schemas/test_schema.xsd"))

  
  if(batchDF.isEmpty == false){
       var parameters = collection.mutable.Map.empty[String, String]
       var schema: StructType = null
       val rdd:RDD[String] = batchDF.select($"body").as[String].rdd
       val relation = XmlRelation(
       () => rdd,
       None,
       parameters.toMap,
       schema)(spark.sqlContext) // <= this is where I need some magic happening :-)
       spark.baseRelationToDataFrame(relation)
      .write.format("delta")
      .mode("append")
      .saveAsTable("testtingxml")
  }   
}.start()

Does anyone has managed to work around the same scenario?

Ideally What I'm expecting is to be able for each row that I receive from Event Hub dataframe, obtain the XML, validate against a specific xsd schema file and add successful ones to a new dataframe and failed ones to another so I can proceed processing them accordingly.

I found a similar implementation without the schema validation here => Reading schema of streaming Dataframe in Spark Structured Streaming

If anyone could shed some light, that would be much appreciated.

I don't know databricks. But I'm wondering why you mention XSLT in the title, and not in the question? — Michael Kay, Feb 02 '21 at 08:50
Good pick up, thanks. haven't had my bucket of coffee when posted — Daniel Schnabel, Feb 03 '21 at 00:24

Streaming XML schema validation using .xsd files in Databricks

0 Answers0