I'm struggling to find some documentation on how schema validation can be performed in Databricks using xsd files. I found a bunch or answers for batch pipelines that loads xml from a file however I'm obtaining mine from a Event Hubs stream source.
This is what I currently have:
import com.databricks.spark.xml.functions.from_xml
import com.databricks.spark.xml.schema_of_xml
import spark.implicits._
import org.apache.spark.sql.functions._
val dfEventHub = spark
.readStream
.format("eventhubs")
.options(customEventhubParameters.toMap)
.load()
.select(($"body").cast("string"), $"properties.source")
dfEventHub.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
// Attemped this way reading as text but can't use as validator
// val schema = spark.read.text("/mnt/schemas/test_schema.xsd")
// Also obtaining the schema like this which seem to work but it only writes NULL to destination dataframe instead givin a proper success/failure response
// val schema = XSDToSchema.read(Paths.get("/dbfs/mnt/schemas/test_schema.xsd"))
if(batchDF.isEmpty == false){
var parameters = collection.mutable.Map.empty[String, String]
var schema: StructType = null
val rdd:RDD[String] = batchDF.select($"body").as[String].rdd
val relation = XmlRelation(
() => rdd,
None,
parameters.toMap,
schema)(spark.sqlContext) // <= this is where I need some magic happening :-)
spark.baseRelationToDataFrame(relation)
.write.format("delta")
.mode("append")
.saveAsTable("testtingxml")
}
}.start()
Does anyone has managed to work around the same scenario?
Ideally What I'm expecting is to be able for each row that I receive from Event Hub dataframe, obtain the XML, validate against a specific xsd schema file and add successful ones to a new dataframe and failed ones to another so I can proceed processing them accordingly.
I found a similar implementation without the schema validation here => Reading schema of streaming Dataframe in Spark Structured Streaming
If anyone could shed some light, that would be much appreciated.