2

I need to calculate weekly and monthly moving averages per sensor per day for a large set of sample data based on some quality criteria. I have a working solution based on correlated sub-queries (or self joins), but I was wondering if using analytic functions is possible and would result in better performance?

Here is what I have right now (simplified):

CREATE TABLE Samples(
    SensorId int,
    SampleTime datetime,
    Value float,
    Quality float
)

WITH DailyAvg (SensorId, SampleDate, ValueSum, ValueCount)
AS
(
    SELECT
        SensorId,
        CAST(SampleTime AS DATE) AS SampleDate,
        SUM(Value) AS ValueSum,
        COUNT_BIG(Value) AS ValueCount
    FROM Samples
    WHERE Quality > 0.95
    GROUP BY SensorId, CAST(SampleTime AS DATE)
)
SELECT
    SensorId,
    SampleDate,
    ( SELECT SUM(d2.ValueSum) / SUM(d2.ValueCount) FROM DailyAvg AS d2 WHERE d2.SensorId = d1.SensorId AND d2.SampleDate BETWEEN DATEADD(DAY,   -7, d1.SampleDate) AND d1.SampleDate) AS AverageLastWeek,
    ( SELECT SUM(d2.ValueSum) / SUM(d2.ValueCount) FROM DailyAvg AS d2 WHERE d2.SensorId = d1.SensorId AND d2.SampleDate BETWEEN DATEADD(DAY,  -14, d1.SampleDate) AND d1.SampleDate) AS AverageLast2Weeks,
    ( SELECT SUM(d2.ValueSum) / SUM(d2.ValueCount) FROM DailyAvg AS d2 WHERE d2.SensorId = d1.SensorId AND d2.SampleDate BETWEEN DATEADD(MONTH, -1, d1.SampleDate) AND d1.SampleDate) AS AverageLastMonth
FROM DailyAvg d1
ORDER BY SensorId, SampleDate

I've tried replacing the sub-query for weekly average with the snippet below, but it obviously cannot handle days without any samples correctly. I thought of using RANGE or PARTITION BY expressions, but I cannot figure out how to specify the window frame to select the samples from e.g. "last week".

SUM(ValueSum) OVER(PARTITION BY SensorId ORDER BY SampleTime ROWS 7 PRECEDING) / SUM(ValueCount) OVER(PARTITION BY SensorId ORDER BY SampleTime ROWS 7 PRECEDING) AS AverageLastWeek

I even considered "Quirky Update", but besides being messy I don't think it makes sense with this many days being averaged over.

hjort
  • 133
  • 1
  • 7
  • do you need the lastmonth information or lastmonth per Id information? I am thinking about filtering SampleTime to the last month. – MelgoV May 06 '15 at 17:36
  • If I understand this correctly, I believe you could get away with using a PIVOT function. I am not sure how much faster that will be, but definitely worth a shot! – FutbolFan May 06 '15 at 20:43
  • I need moving averages for each id and for each day. For instance, on today's date any given Id would each give rise to a row with the weekly average based on samples from 4/30 through 5/6 and the monthly average based on samples from 4/7 through 5/6. So the window frames are larger than the steps or partitions. – hjort May 06 '15 at 21:07

3 Answers3

0

Check out the window functions "Lead" and "Lag". They were created exactly for this purpose (executing agg functions on moving windows over result sets).

Amir Pelled
  • 591
  • 4
  • 13
  • I actually did consider the LEAD / LAG analytic functions combined with PARTITION BY SensorId, but as far as I could tell, they're based on row count and thus doesn't handle days with no samples - much like my "ROWS 7 PRECEDING" clause mentioned in my original question. Can you provide me with a concrete example to help me better understand how these functions can help me? – hjort May 06 '15 at 22:35
  • That depends on whether you want "missing" days to count as days with 0 value, or not to be counted at all. For 0 value, you can create a "days" table and join it, thus "filling" th gaps for the offset with 0. For the second option, you can again create a "days" table, and join it with the CTE on the ID and the CTE.date between the Days.date -X and Days.Date. Then you can calculate the Count and Avg for each date. – Amir Pelled May 07 '15 at 07:46
0

My code compiled faster, had less scans and less logical reads with my little randomized dataset. It's kind of hard to tell with such a small dataset and I don't have your indexes and whatnot. So try it for yourself. If anything, it's simpler than your query. Now in my query, the Average month was tricky. If you want you could make it like the others that use a certain number of days. So you might do average for previous 30 days. What I put in was average for that current month only. I do have a feeling you could put your subquery in for that one, but I didn't try that cause it's late here.

Note: The fancy rows preceding assumes that their is a row for each day with no gaps or else your data will be skewed.

SELECT  SensorID,
        SampleDate,
        AVG(avg_VALUE) OVER (PARTITION BY SensorID,SampleDate) avg_per_date, --but I only have one row per date so that's why its a whole number each time
        AVG(avg_VALUE) OVER (PARTITION BY SensorID ORDER BY SampleDate ROWS BETWEEN 7 PRECEDING AND CURRENT ROW)  AverageLastWeek,
        AVG(avg_VALUE) OVER (PARTITION BY SensorID ORDER BY SampleDate ROWS BETWEEN 14 PRECEDING AND CURRENT ROW) AverageLast2Weeks,
        AVG(avg_VALUE) OVER (PARTITION BY SensorID,MONTH(SampleDate) ORDER BY SampleDate ROWS UNBOUNDED PRECEDING) AverageCurrentMonth
        --AVG(avg_VALUE) OVER (PARTITION BY SensorID ORDER BY SampleDate ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) AverageLast30Days --alternatively you could do this
FROM
(
    SELECT SensorId,CAST(SampleTime AS DATE) SampleDate,AVG(Value) Avg_value
    FROM Samples
    WHERE Quality > .95
    GROUP BY SensorId,CAST(SampleTime AS DATE)
) S
ORDER BY SensorID,SampleDate

If you have any questions or need anything else, let me know!

Stephan
  • 5,891
  • 1
  • 16
  • 24
  • Thanks for the concrete example. However, it very much resembles the code snippet I mentioned at the bottom of my original question, except your proposal doesn't do a proper weighted average to correctly handle days with differing number of samples. And as you point out yourself, neither of them can handle days with no samples, which is a requirement. – hjort May 07 '15 at 08:21
  • Yeah that's the limitations of Windows functions. The only way around that, which wouldn't be pretty, would be to create list of each data from minimum date to end date for each SensorID then LEFT JOIN it to your table. But at that point you are doing so much extra work, it's probably more efficient to stick with what you got. – Stephan May 07 '15 at 12:44
0

can you provide with some sample data. this can be done with recursive CTE. I can help after I have some test data to play with

sam
  • 1,242
  • 3
  • 12
  • 31