0

I have a simple time series where a switch is turned on and off by an operator. My aim is to label each of the "turned on" phases with a different ID, e.g., the result with column eventID would look like this:

val eventDF = sc.parallelize(List(("2016-05-01 10:00:00", 0, 0),
                                  ("2016-05-01 10:00:30", 0, 0),
                                  ("2016-05-01 10:01:00", 1, 1),
                                  ("2016-05-01 10:01:20", 1, 1),
                                  ("2016-05-01 10:02:10", 1, 1),
                                  ("2016-05-01 10:03:30", 0, 0),
                                  ("2016-05-01 10:04:00", 0, 0),
                                  ("2016-05-01 10:05:20", 0, 0),
                                  ("2016-05-01 10:06:10", 1, 2),
                                  ("2016-05-01 10:06:30", 1, 2),
                                  ("2016-05-01 10:07:00", 1, 2),
                                  ("2016-05-01 10:07:20", 0, 0),
                                  ("2016-05-01 10:08:10", 0, 0),
                                  ("2016-05-01 10:08:50", 0, 0)))
                .toDF("timestamp", "switch", "eventID")

So far, I tried the rank/rangeBetween/lag window functions without any luck...therefore, any hint is appreciated.

Christoph
  • 1
  • 3
  • 1
    It is possible and not particularly hard, but won't scale unless you have another layer of grouping. – zero323 Jan 04 '18 at 15:32
  • Yes, you are right- this is the first step to separate/group the events. How would you do it? – Christoph Jan 04 '18 at 19:38
  • Take a look at https://stackoverflow.com/q/42448564/6910411 and https://stackoverflow.com/q/43806269/6910411 - it is essentially the same problem. – zero323 Jan 05 '18 at 20:05

0 Answers0