1

I have a structured streaming job which takes json messages from a kafka queue and stores them into parquet/hdfs. From there the messages are taken and analysed on a schedule and stored into a mysql database.

Now I want to implement streaming where I take the json message from kafka analyse it and store it into mysql.

My problem right now is that Spark structured streaming does not give me the possibility to use pivot.

My code right now looks like this:

val df = readFromHdfsParquet
val df_exploded= df.select($"device", $"owner", $"timestamp", explode($"channels").as("exploded")
.withColumn("channelName", $"exploded.channelName")
.withColumn($"value", $"exploded.value")
.drop("exploded")

val df_grouped = df_exploded
.groupBy($"device, $"owner", $"timestamp")
.pivot("channelName")
.agg(first($"value", false)

which results in the wanted output structure containing all available channels.

My Json looks like this:

{
  "device": "TypeA",
  "owner": "me",
  "timestamp": "2019-05-12 17:27:59",
  "channels": [
    {
      "channelName": "temperature",
      "state": 0,
      "value": "27"
    },
    {
      "channelName": "humidity",
      "state": 0,
      "value": "10"
    }
  ]
}

The length of the channels array is not set and can change from device to device.

What I want is a dataframe with the following structure and store it into mysql.

|device|owner|timestamp          |temperature|humidity|
|TypeA |me   |2019-05-12 17:27:59|27         |10      |

How is something like this possible with structured streaming? It would also be enough to get a few channels instead of all of them by explicitly selecting them. (f.e. temperature only without humidity)

user2811630
  • 445
  • 3
  • 11

0 Answers0