1

In order to aggregate a timeserie (for ex every 10min) I used "groupBy" and "window" as shown :

val df2 = df.groupBy(
  window($"timestamp", "10 minutes"))
  .avg("field")

df2.show() looks like

+-------------------------------------------+----------+
|                                     window|avg(field)|
+-------------------------------------------+----------+
| [2018-06-10 03:30:00, 2018-06-10 03:40:00]|22        |
| [2018-06-10 03:30:00, 2018-06-10 03:40:00]|42        |
| [2018-06-10 03:30:00, 2018-06-10 03:40:00]|60        | 
+-------------------------------------------+----------+

This is its schema :

root
 |-- window: struct (nullable = true)
 |    |-- start: timestamp (nullable = true)
 |    |-- end: timestamp (nullable = true)
 |-- avg(field): int (nullable = true)

I wanted to save it to csv but I can't :

CSV data source does not support struct<start:timestamp,end:timestamp>

Do you know how can I flatten the window column ? Or is there a better way to aggregate timeseries like that ?

Thank you very much

Noa Be
  • 277
  • 3
  • 10
  • I've already seen this post. My problem is I don't know how to deal with "struct" type : val stringify = udf((vs: StructType) => vs match { case null => null case _ => s"""[${vs.head}]""" }) is not working ... – Noa Be Aug 24 '18 at 13:36
  • Found it ! you can access struct by adding the col's name between parenthesis : df.withColumn("window", $"window"("start")) – Noa Be Aug 28 '18 at 07:41

0 Answers0