I'm new to SQL and AWS Timestream and I want to write a query which will give me the total time that a device is active. I then want to translate that into energy usage based on the kWh rating of the device.
The time intervals for data points are not on a fixed interval. Data looks something like this:
timestamp (s) | active (boolean) |
---|---|
1617697080 (10h18) | false |
1617697920 (10h32) | true |
1617698280 (10h38) | false (active for 6 minutes) |
1617699000 (10h50) | true |
1617699120 (10h52) | false (active for 2 minutes) |
etc. |
In the above the total active time is 8 minutes.
The kind of queries I would like to get out are something like,
- Total active time (energy usage) over the last month (or other period)
- Total active time (energy usage) per day over the last month
What query would give me this info and be tolerant of the variable intervals?
There are two paths that I'm looking at but haven't quite figured out yet,
- Interpolate the data and fill with the value to get a new timestream with a consistent interval (then it is as simple as counting the values), or
- Use some date/time function to look at the timestamps between the data points and add up the total time that it is active.
I've been trying to get a query right to interpolate the data, but have not succeeded yet. I'm following the pattern in the AWS Timestream SQL docs, but not quite understanding it yet.
I don't even know where to begin or where to look for examples of summing the timestamp difference. The logical process would be something like,
if (this_point == true) then
total_active_time += next_point_timestamp - this_point_timestamp
Possible solution #1: Interpolate
Interpolating works well within certain bounds, although it feels like overkill for the solution. The on/off resolution of the device is 10s, so if the interpolation sample time is too big, the data starts getting skewed. This presents a problem when the interpolated timeseries goes beyond 10k points - AWS Timestream throws an error "Result of sequence function must not have more than 10000 entries". So for short periods of time, this works, but it I want to figure out energy usage over a month, then it will never work out (at least not at 10s resolution). Using 1m resolution gives me good enough accuracy to work with over a 6 day period. It would be possible to use this query multiple times with different date ranges to get the data.
WITH active_timeseries AS (
SELECT time, COUNT_IF(measure_value::boolean) AS Active
FROM "my_db"."data"
WHERE measure_name = 'active'
AND time > ago(6d)
GROUP BY time
), interpolated_timeseries AS (
SELECT INTERPOLATE_LOCF(
CREATE_TIME_SERIES(time, Active),
SEQUENCE(min(time), max(time), 1m)) AS interpolated_active
FROM active_timeseries
), new_timeseries AS (
SELECT time, value
FROM interpolated_timeseries
CROSS JOIN UNNEST(interpolated_active)
)
-- where "1" = 1kWh
SELECT bin(time, 1d) as binned_ts, COUNT_IF(value > 0) / 60.0 * 1 as Daily
FROM new_timeseries
GROUP BY bin(time, 1d)
ORDER BY binned_ts
Possible solution #2: Calculate the time difference
In the application there may be multiple "on" and multiple "off" samples for the on and off times. It is possible to determine the transition samples from on-to-off and off-to-on using the LAG function. At first I wasn't able to get the LAG and LEAD time functions to work on a timestamp, but it suddenly seems to be working without me having changed something...not sure what to make of that. LEAD can then be used to determine a time interval. Putting it all together it looks like this:
with active_timeseries AS (
SELECT time,
measure_value::boolean as active,
LAG(measure_value::boolean, 1, NULL) OVER (ORDER BY time ASC) AS last_active
FROM "my_db"."data"
where measure_name = 'active'
AND time > ago(6d)
ORDER by time ASC
)
SELECT time, active, last_active,
(LEAD(time, 1, NULL) OVER (ORDER BY time ASC) - time) AS time_interval
FROM active_timeseries
where (active = true AND last_active = false) OR
(active = false AND last_active = true)
This then gets you an interval with data looking something like this (but with the timestamps in the native "timestamp" type)
timestamp (s) | active (boolean) | time interval |
---|---|---|
1617697080 (10h18) | false | 840 |
1617697920 (10h32) | true | 360 |
1617698280 (10h38) | false (active for 6 minutes) | 720 |
1617699000 (10h50) | true | 120 |
1617699120 (10h52) | false (active for 2 minutes) | etc. |
etc. |
This is great! Pretty much what I'm looking for....but the now I can't convert the time interval (a "timestamp" type) into something usable. I need to do some basic operations on it to work it into a kWh usage, e.g.,
kWh = number_of_seconds_active(s) / seconds_in_an_hour(s) * power(kW)
After some churn and testing I discovered SQL EXTRACT() which allows me to pull out days, hours, minute, and seconds. So I can do this:
with active_timeseries AS (
SELECT time,
measure_value::boolean as active,
LAG(measure_value::boolean, 1, NULL) OVER (ORDER BY time ASC) AS last_active
FROM "my_db"."data"
where measure_name = 'active'
AND time > ago(6d)
ORDER by time ASC
), interval_timeseries AS (
SELECT time, active, last_active,
(LEAD(time, 1, NULL) OVER (ORDER BY time ASC) - time) AS time_interval
FROM active_timeseries
where (active = true AND last_active = false) OR
(active = false AND last_active = true)
)
SELECT time, active, last_active, time_interval,
EXTRACT(hour from time_interval) * 3600 + EXTRACT(minute from time_interval) * 60 + EXTRACT(second from time_interval) as interval_time,
(EXTRACT(hour from time_interval) * 3600 + EXTRACT(minute from time_interval) * 60 + EXTRACT(second from time_interval)) / 3600.0 * 3.0 as kWh
from interval_timeseries
where active = true
and that gives me the energy usage!