I am trying to consume a json data stream from an Azure Event Hub to be further processed for analysis via PySpark on Databricks.
I am having trouble attempting to extract the json data into data frames in a notebook.
I can successfully connect to the event hub and can see the data streams arriving.
In order explain the source json structure, please see below a version of the json data that is sent to the event hub.
[
{
"header": {
"msg_type": "0003",
"source_dev_id": "TP1",
"user_id": "RP1156"
},
"body": {
"event_type": "ARRIVAL",
"gbtt_timestamp": "1626464040000",
"original_loc_stanox": "FELIX0008"
}
},
{
"header": {
"msg_type": "0003",
"source_dev_id": "TP1",
"user_id": "RP1156"
},
"body": {
"event_type": "ARRIVAL",
"gbtt_timestamp": "1626465080000",
"original_loc_stanox": "CREW0008"
}
},
{
"header": {
"msg_type": "0002",
"source_dev_id": "",
"user_id": "RP1156"
},
"body": {
"event_type": "DEPATURE",
"gbtt_timestamp": "1626466070000",
"original_loc_stanox": "FELIX0008"
}
}
]
The following is what I am doing in the notebook after connecting to the even hub:
# Read the stream to a data frames
df = spark.readStream.format("eventhubs").options(**ehConf).load()
# display(df) shows me that the data is arrving
# the structure of df at this stage is as follows
df.printSchema()
root
|-- body: binary (nullable = true)
|-- partition: string (nullable = true)
|-- offset: string (nullable = true)
|-- sequenceNumber: long (nullable = true)
|-- enqueuedTime: timestamp (nullable = true)
|-- publisher: string (nullable = true)
|-- partitionKey: string (nullable = true)
|-- properties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- systemProperties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
# all the json data is in body as binary which I cast to string later on
# next I try to extract the json data into a data frame
from pyspark.sql.types import *
import pyspark.sql.functions as F
dfsch = ArrayType(
StructField("body",StructType([
StructField("event_type",StringType(),True),
StructField("gbtt_timestamp",StringType(),True),
StructField("original_loc_stanox",StringType(),True)
]),True),
StructField("header",StructType([
StructField("msg_type",StringType(),True),
StructField("source_dev_id",StringType(),True),
StructField("user_id",StringType(),True)
]),True)
)
dfj = df.select(F.from_json(F.col("body").cast("string"), dfsch).alias("payload"))
# at this stage I get the following error
TypeError: Object of type StructField is not JSON serializable
I get the following error:
TypeError: Object of type StructField is not JSON serializable
I am quite new to pyspark so not sure if I have correctly represented the json schema in df3sch ?
would be immensely grateful if someone can point out what is wrong here ?