Timestamp interval/window calculation - Pyspark

Question

I'm working on dataset where the signal is broken up every 10s even if it is continuously active with a counter that is continuous as well.

In this dataset I want know how long mode (x) was active (without a change to another mode) and how many times mode (x) was activated in a ID.

In this case the answer is.. mode (x) was activated twice. First one 23s and second 10s.I would like create two columns to show that per driveId.

ID	Timestamp(ms)	Counter	mode
A	1000	11	x
A	1010	12	x
A	1020	13	x
A	1023	7	y
A	1033	8	y
A	1037	21	x
A	1047	22	x

Result

ID	activations (x)	time1	time2
A	2	23	10

Please, do not post code or data as images. You may use [ASCII table generator](https://ozh.github.io/ascii-tables/) to present tabular information in text format — astentx, Jun 30 '21 at 21:56
I've fixed it, you may check the markdown: pipes divide columns, after header line you add line with dashes, where colon (:) defines alignment of the content. It then will be rendered nice) — astentx, Jul 01 '21 at 07:30

score 0 · Answer 1 · answered Jul 01 '21 at 07:44

0

Did not get the way you are expecting the output. For example, if you are concerned with only how many times mode X was activated, or the timestamp is also required. Also, seems like when mode X is activated, it becomes mode Y as per your table. However, you can start with the following SQL. if you need further information, you can make this a cursor and then store the timestamp to a local/temporary variable to calculate the difference between the timestamp. Also, if you need the count, you can just use the count() function with this SQL

select * from table where counter <= 10 order by timestamp asc

answered Jul 01 '21 at 07:44

Erfan Bashar

149
1
6

Hello Erfan, the timestamp is continuous and the activation is interrupted when another mode is activated. In this case, x is activated twice and y once with their respective duration (ms) per activation. – Arek Jul 01 '21 at 07:53
I would say the solution would be similar to this one except my timestamp is continuous and theirs has a start and end time. https://stackoverflow.com/questions/51309693/grouping-consecutive-rows-in-pyspark-dataframe – Arek Jul 01 '21 at 08:23

Vaebhav · Answer 2 · 2021-07-01T12:12:42.423

You can achieve this by using lead on mode and checking for changes from 'x' --> 'y' and vice versa -

from pyspark import SparkContext
from pyspark.sql import SQLContext
from functools import reduce
import pyspark.sql.functions as F

from pyspark.sql import Window

sc = SparkContext.getOrCreate()
sql = SQLContext(sc)

input_list = [ ('A', 1000,'x')
              ,('A', 1010,'x')
              ,('A', 1020,'x')
              ,('A', 1023,'y')
              ,('A', 1033,'y')
              ,('A', 1037,'x')
              ,('A', 1047,'x')
            ]

sparkDF = sql.createDataFrame(input_list,['Driver_ID','Timestamp','mode']) 

window = Window.partitionBy('Driver_ID').orderBy('Timestamp')

sparkDF = sparkDF.withColumn('mode_lead', F.lead(F.col('mode')).over(window))


### Condition to Identify mode change from 'x' --> 'y'
x_change_cond = ( (F.col('mode') == 'x')
                & (F.col('mode') != F.col('mode_lead'))
                )

### Condition to Identify mode change from 'y' --> 'x'
### Reason for equality is since the lag timestamp is required when mode changes from y to x

y_change_cond = ( (F.col('mode') == 'y')
                & (F.col('mode') == F.col('mode_lead'))
                ) 

sparkDF = sparkDF.withColumn('timestamp_diff_lead',F.lead(F.col('Timestamp')).over(window) - F.col('Timestamp'))


sparkDF = sparkDF.withColumn('timestamp_sum_x_change'
                             ,F.when(x_change_cond,F.sum(F.col('timestamp_diff_lead')).over(window)
                            ).otherwise(0)
                    )

sparkDF = sparkDF.withColumn('timestamp_sum_y_change',F.when(y_change_cond
                        ,F.sum(F.col('timestamp_diff_lead')).over(window.rowsBetween(0,Window.currentRow))
                    ).otherwise(0))

cols = ['Timestamp','mode','mode_lead','timestamp_diff_lead'
              ,'timestamp_sum_x_change','timestamp_sum_y_change']

sparkDF.select(*cols).show(truncate=False)

+---------+----+---------+-------------------+----------------------+----------------------+
|Timestamp|mode|mode_lead|timestamp_diff_lead|timestamp_sum_x_change|timestamp_sum_y_change|
+---------+----+---------+-------------------+----------------------+----------------------+
|1000     |x   |x        |10                 |0                     |0                     |
|1010     |x   |x        |10                 |0                     |0                     |
|1020     |x   |y        |3                  |23                    |0                     |
|1023     |y   |y        |10                 |0                     |10                    |
|1033     |y   |x        |4                  |0                     |0                     |
|1037     |x   |x        |10                 |0                     |0                     |
|1047     |x   |null     |null               |0                     |0                     |
+---------+----+---------+-------------------+----------------------+----------------------+

Using the above result set , you can filter on rows matching either (|) of the change conditions -

filter_cond = (F.col('timestamp_sum_x_change') != 0) | (F.col('timestamp_sum_y_change') != 0)

sparkDF.filter(filter_cond).select(*cols).show()

+---------+----+---------+-------------------+----------------------+----------------------+
|Timestamp|mode|mode_lead|timestamp_diff_lead|timestamp_sum_x_change|timestamp_sum_y_change|
+---------+----+---------+-------------------+----------------------+----------------------+
|     1020|   x|        y|                  3|                    23|                     0|
|     1023|   y|        y|                 10|                     0|                    10|
+---------+----+---------+-------------------+----------------------+----------------------+

Finally you can aggregate the filtered resultset to generate the output -

sparkDF.filter(filter_cond).groupby('Driver_ID').agg(*[ F.count(F.col('Driver_ID')).alias('activation(x)')
        ,F.max(F.col('timestamp_sum_x_change')).alias('time1')
        ,F.max(F.col('timestamp_sum_y_change')).alias('time2')
    ]
).show()

+---------+-------------+-----+-----+
|Driver_ID|activation(x)|time1|time2|
+---------+-------------+-----+-----+
|        A|            2|   23|   10|
+---------+-------------+-----+-----+

hey Vauebhav, this is great and seems like it works on my end. Is the time2, 10s, the duration of mode y or the second mode x? I'm more interested in the duration of all mode x, which in this example the second mode x activation is also 10s. This is where the "counter" column comes in play because it is sequential when the signal is continuous. Also, what is the "final_cond" or is it meant to be the "filter_cond"? — Arek, Jul 01 '21 at 12:01
Hey Arik, Nice catch it was `filter_cond`. Also regarding `time2` , its the mode change from `y` to `x`, I did not realized you added an additional `Counter` column. Probably that can also be used to mark rows for any mode changes — Vaebhav, Jul 01 '21 at 12:10
@Arek- Added some pointers to help you understand the reason behind the conditions. Try to look at the conditions under the output `sparkDF.select(*cols).show(truncate=False)` — Vaebhav, Jul 01 '21 at 12:14
and if i had many modes and still wanted to focus on the activations of x, could I use the following.. y_change_cond = ( (F.col('mode') == 'y') | (F.col('mode') == 'a') | (F.col('mode') == 'b') & (F.col('mode') == F.col('mode_lead'))) — Arek, Jul 01 '21 at 12:55
In logic in should work , maybe give it a try , else the y_change_cond would need a little bit of more modification — Vaebhav, Jul 01 '21 at 13:23
Ya I tried to check the code using one of the "DriveId", since there are 1000s, to see if I can count the the activations manually, and the driveIDs that are summed at the end, none of them had mode x active. Do you know why this could be? — Arek, Jul 01 '21 at 14:22
without looking at the data , Its very hard for me to comment on that — Vaebhav, Jul 01 '21 at 14:26

Timestamp interval/window calculation - Pyspark

2 Answers2

Finally you can aggregate the filtered resultset to generate the output -