1

I'm new with Apache Spark Structured Streaming. I'm trying to read some events from Event Hub (in XML format) and trying to create new Spark DF from the nested XML.

Im using the code example described in https://github.com/databricks/spark-xml and in batch mode is running perfectly but not in Structured Spark Streaming.

Code chunk of spark-xml Github library

import com.databricks.spark.xml.functions.from_xml
import com.databricks.spark.xml.schema_of_xml
import spark.implicits._
val df = ... /// DataFrame with XML in column 'payload' 
val payloadSchema = schema_of_xml(df.select("payload").as[String])
val parsed = df.withColumn("parsed", from_xml($"payload", payloadSchema))

My batch code

val df = Seq(
  (8, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers><tag1>7</tag1> <tag2>4</tag2> <mode>0</mode> <Quantity>1</Quantity></AccountSetup>"),
  (64, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers><tag1>6</tag1> <tag2>4</tag2>  <mode>0</mode> <Quantity>1</Quantity></AccountSetup>"),
  (27, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers><tag1>4</tag1> <tag2>4</tag2> <mode>3</mode> <Quantity>1</Quantity></AccountSetup>")
).toDF("number", "body")
)


val payloadSchema = schema_of_xml(df.select("body").as[String])
val parsed = df.withColumn("parsed", from_xml($"body", payloadSchema))

val final_df = parsed.select(parsed.col("parsed"))
display(final_df.select("parsed.*"))

enter image description here

I was trying to do same logic for Spark Structured Streaming like the following code:

Structured Streaming code

import com.databricks.spark.xml.functions.from_xml
import com.databricks.spark.xml.schema_of_xml
import org.apache.spark.eventhubs.{ ConnectionStringBuilder, EventHubsConf, EventPosition }
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._


val streamingInputDF = 
  spark.readStream
    .format("eventhubs")
    .options(eventHubsConf.toMap)
    .load()

val payloadSchema = schema_of_xml(streamingInputDF.select("body").as[String])
val parsed = streamingSelectDF.withColumn("parsed", from_xml($"body", payloadSchema))
val final_df = parsed.select(parsed.col("parsed"))

display(final_df.select("parsed.*"))

In code part of val payloadSchema = schema_of_xml(streamingInputDF.select("body").as[String]) intstruction throws the error Queries with streaming sources must be executed with writeStream.start();;

Update

Tried to


val streamingInputDF = 
  spark.readStream
    .format("eventhubs")
    .options(eventHubsConf.toMap)
    .load()
    .select(($"body").cast("string"))

val body_value = streamingInputDF.select("body").as[String]
body_value.writeStream
    .format("console")
    .start()

spark.streams.awaitAnyTermination()


val payloadSchema = schema_of_xml(body_value)
val parsed = body_value.withColumn("parsed", from_xml($"body", payloadSchema))
val final_df = parsed.select(parsed.col("parsed"))

Now is not running into the error but Databricks stay in "Waiting status"

enter image description here Thanks!!

basigow
  • 145
  • 1
  • 11

1 Answers1

1

There is nothing wrong with your code if it works in batch mode.

It is important to not only convert the source into a stream (by using readStream and load) but it is also required to convert the sink part into a stream.

The error message you are getting is just reminding you to also look into the sink part. Your Dataframe final_df is actually a streaming Dataframe which has to be started through start.

The Structured Streaming Guide gives you a good overview on all available Output Sinks and the easiest would be to print the result to the console.

To summarize, you need to add the following to your program:

final_df.writeStream
    .format("console")
    .start()

spark.streams.awaitAnyTermination()
Michael Heil
  • 16,250
  • 3
  • 42
  • 77
  • Thank you very much for your reply Mike. I really need more detailed knowledge about Spark Strusctured Streaming. Anyway there is something that doesn't work for me, why `final_df.writeStream`? if I run `schema_of_xml(streamingInputDF.select("body").as[String])` it already fails me with that error, before I get to final_df – basigow Jan 22 '21 at 09:39
  • Because you streamingInputDF is a streaming Dataframe and you are using this for your payloadSchema, whereas the payloadSchema does not have a "writeStream" and "start". – Michael Heil Jan 22 '21 at 09:42
  • If the answers in the other Stackoverflow post are not helping I suggest to create a minimal repdocuible example and open a new question specifically pointing to the error/problem you are getting. – Michael Heil Jan 22 '21 at 09:43
  • I don't know to what extent I can make it more reproducible, it's just having an XML document (as in the Batch code example) and reading it as Stream. Maybe I misunderstood you. As for the other post, I see it very directed to cache and I think I still don't know a lot of basic concepts about Structurated Streaming. – basigow Jan 22 '21 at 10:40
  • On the other hand I tried to add your code before `display(final_df.select("parsed.*"))` and it still fails with same message error... I've edited the post with an UPDATE that I honestly don't know if that approach makes sense. Thank you very much for your time :) – basigow Jan 22 '21 at 10:41
  • Remember that all code after spark.streams.awaitAnyTermination will not be executed. It looks like there are too many open questions here. Maybe it is worth starting with a more simple first example on Structured Streaming. Make that work, then add things like XML-schemas. Unfortunately, Stackoverflow is not meant to provide full tutorials for beginners but rather tries to solve very specific problems. – Michael Heil Jan 22 '21 at 10:52
  • 1
    Thanks a lot Mike, I'll do that, I'll try with simpler examples :) – basigow Jan 22 '21 at 11:05