4

I am trying to consume a json data stream from an Azure Event Hub to be further processed for analysis via PySpark on Databricks.

I am having trouble attempting to extract the json data into data frames in a notebook.

I can successfully connect to the event hub and can see the data streams arriving.

In order explain the source json structure, please see below a version of the json data that is sent to the event hub.

 [
    {
       "header": {
          "msg_type": "0003",
          "source_dev_id": "TP1",
          "user_id": "RP1156"
       },
       "body": {
          "event_type": "ARRIVAL",
          "gbtt_timestamp": "1626464040000",
          "original_loc_stanox": "FELIX0008"
       }
    },
    {
       "header": {
          "msg_type": "0003",
          "source_dev_id": "TP1",
          "user_id": "RP1156"
       },
       "body": {
          "event_type": "ARRIVAL",
          "gbtt_timestamp": "1626465080000",
          "original_loc_stanox": "CREW0008"
       }
    },
    {
       "header": {
          "msg_type": "0002",
          "source_dev_id": "",
          "user_id": "RP1156"
       },
       "body": {
          "event_type": "DEPATURE",
          "gbtt_timestamp": "1626466070000",
          "original_loc_stanox": "FELIX0008"
       }
    }
 ]

The following is what I am doing in the notebook after connecting to the even hub:

# Read the stream to a data frames
df = spark.readStream.format("eventhubs").options(**ehConf).load()

# display(df) shows me that the data is arrving

# the structure of df at this stage is as follows
df.printSchema()
root
 |-- body: binary (nullable = true)
 |-- partition: string (nullable = true)
 |-- offset: string (nullable = true)
 |-- sequenceNumber: long (nullable = true)
 |-- enqueuedTime: timestamp (nullable = true)
 |-- publisher: string (nullable = true)
 |-- partitionKey: string (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- systemProperties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

# all the json data is in body as binary which I cast to string later on

# next I try to extract the json data into a data frame

from pyspark.sql.types import *
import  pyspark.sql.functions as F

dfsch = ArrayType(
    StructField("body",StructType([
        StructField("event_type",StringType(),True),
        StructField("gbtt_timestamp",StringType(),True),
        StructField("original_loc_stanox",StringType(),True)
    ]),True),
    StructField("header",StructType([
        StructField("msg_type",StringType(),True),
        StructField("source_dev_id",StringType(),True),
        StructField("user_id",StringType(),True)
    ]),True)
)

dfj = df.select(F.from_json(F.col("body").cast("string"), dfsch).alias("payload"))

# at this stage I get the following error

TypeError: Object of type StructField is not JSON serializable

I get the following error:

TypeError: Object of type StructField is not JSON serializable

I am quite new to pyspark so not sure if I have correctly represented the json schema in df3sch ?

would be immensely grateful if someone can point out what is wrong here ?

rarpal
  • 175
  • 1
  • 13
  • what does `df.select(F.col("body").cast("string")).show()` outputs ? – Steven Jul 19 '21 at 13:50
  • I could not do .show() because its a streaming data frame, so I did the following: dfs = df.select(F.col("body").cast("string")) display(dfs) I can certainly see the json strings in the body column in the same form as the original source I have given earlier. but I am not sure how to convert the json string to a data frame so that I can further process the actual column values ? – rarpal Jul 19 '21 at 16:46

1 Answers1

2

looking more closely at the schema definition, it was actually something simple. I just had to change the schema to the following:

dfsch = ArrayType(
    StructType([
        StructField("body",StructType([
            StructField("event_type",StringType(),True),
            StructField("gbtt_timestamp",StringType(),True),
            StructField("original_loc_stanox",StringType(),True)
        ])),
        StructField("header",StructType([
            StructField("msg_type",StringType(),True),
            StructField("source_dev_id",StringType(),True),
            StructField("user_id",StringType(),True)
        ]))
    ])
)

Shows I am still new to pyspark. onwards and upwards!

Thanks for your help.

rarpal
  • 175
  • 1
  • 13