0

So, I've got one problem: getting an actual sum/avg with over, partitioning through product codes, of the last 12 months (counting the one being processed), in a scenario where there are products that may not exist in a month.

Tried to start with a TL;DR, let's go for the actual thing:

  • My table has 4 fields: year, month, product code and value;
  • I need sum over the last year (selected month + past 11 months);
  • I need avg over the last year (same).

I tried with:

SELECT * FROM (
  SELECT year, month, product,
  AVG(value) OVER (
    PARTITION BY product 
    ORDER BY year, month
    ROWS 11 PRECEDING
  ) as average,
  SUM(value) OVER (
    PARTITION BY product 
    ORDER BY year, month
    ROWS 11 PRECEDING
  ) as sum
  FROM suchDB.muchUSER.awesomeTABLE
) q
where year = <insert year> and month = <month>

Problems:

  • If any queried month doesn't have product '123', it'll ignore if the product has history on the table, it won't bring me an average or sum of past months.
  • If any past month, which should be part of history of a queried month, doesn't have product '123', it'll ignore that month and move on to a month past the 'last year' scope (a queried October without a July would return it's version of the previous year, instead of stopping on November).

We used to query this thing using 'GROUP BY product' and 'WHERE ((year-1)*100)+month > queriedYear-1Month AND (year*100)+month <= queriedYearMonth' until someone pointed us in the OVER/PARTITION direction and we changed everything... but even then we had some problems, like avg() function will ignore NULL months...

Help?

FORGOT ONE VERY IMPORTANT THING

The data in awesomeTABLE is versioned - each year/month can have more than one version. Only the latest must be used. I usually do this by joining with a select distinct year, month, max(version) from awesomeTABLE group by year, month but it seems to kill some of the possible solutions...

João Ciocca
  • 776
  • 1
  • 10
  • 23

4 Answers4

2

The big issue is that you need to get two lists to correctly summarize the values in your dataset - one list of dates, and one list of products. Without these two lists, a missing product in the final month means that product won't be reported on, or (as you discovered already) the wrong 12 months may be summarized (missing July means the 11 Preceding rows includes the start month).

Below is a fully expanded exploration of a procedure that generates those lists. It uses only the source data table (making an assumption that something was sold in each month. It can be made more concise (i.e. calculate the dates as in shawnt's example above), but is written to show all of the steps and assumptions. I encapsulated it into a stored procedure because it shows the value passed in explicitly.

    CREATE PROCEDURE DoTheThing 
    @startDate DATE  -- Should be Year-Month-01 or YearMonth01
    AS
    BEGIN
    DECLARE @yr INT, @mth INT,
        @yr2 INT, @mth2 INT,
        @endDate DATE   -- StartDt - 11 months

    -- if the date will be passed in with a day other than 01, add code here to set the day on the passed date to 01
    -- if only the high year and month are passed in, then create a @startDt value and continue.

    SET @endDate = DATEADD(MONTH, -11, @startDate)

    SELECT @yr = DATEPART(YEAR, @startDate),
        @mth = DATEPART(MONTH, @startdate),
        @yr2 = DATEPART(YEAR, @startDate),
        @mth2 = DATEPART(MONTH, @startdate)

    WITH mthYr AS (
        SELECT DISTINCT 
            YEAR, 
            MONTH
        FROM suchDB.muchUSER.awesomeTABLE   -- Get the data from the source table
        WHERE (
            YEAR = @yr              -- if in the passed-in year, then take all months less than or equal to the start month
            AND MONTH <= @mth
            )
            OR (
            YEAR = @yr2             -- if the period is Jan -- Dec in one year, this reiterates the above
            AND MONTH >= @mth2      -- if not, select the months in the second year where the month is greater than or equal to the calculated month
            )
        ), 
    prods AS (
        SELECT DISTINCT product     -- Return a list of products sold during the year.
        FROM suchDB.muchUSER.awesomeTABLE smt
        INNER JOIN mthYr
            ON mthYr.YEAR = smt.YEAR
            AND mthyr.MONTH = smt.MONTH
        )

    SELECT MAX(mthYr.YEAR) AS year,     -- current report only shows passed in year/month value
        MAX(mthYr.month) AS month, 
        prods.product,                  
        AVG(ISNULL(VALUE, 0.00)) average,   -- isNull value adds a zero into the list to be averaged
        SUM(ISNULL(VALUE, 0.00)) SUM        -- not really necessary, but no warnings about NULL values will be generated
    FROM mthYr CROSS JOIN prods         -- cross join of the two lists means all of the products sold will have a value for each month
    LEFT JOIN suchDB.muchUSER.awesomeTABLE smt  -- left join so missing productMonths will still be added in
        ON smt.YEAR = mthYr.year
        AND smt.MONTH = mthYr.month 
        AND prods.product = smt.product
    ORDER BY smt.product
END
Laughing Vergil
  • 3,706
  • 1
  • 14
  • 28
  • so, one thing I get from all the answers is that I actually should have 2 new separated tables - one for dates, and another for products, right? (sorry if this sound newbie, but all my knowledge about databases is practical, no theory) – João Ciocca Dec 11 '15 at 13:10
  • Yeah I didn't understand on my first reading of your question but it does look like you have the same problem with both products and dates. So yes, 2 "tables". – shawnt00 Dec 11 '15 at 13:37
  • 1
    ok, one more here - I tried it, but it complains that "prods.product" at the last select is invalid on the select list because it's not grouped. After fixing the group by, it worked. Followed this idea http://dattatreysindol.com/2012/05/29/tips-n-tricks-t-sql-an-elegant-way-to-create-or-alter-stored-procedures-in-one-go-for-easy-maintenance/, adding `USE suchDB / GO / IF OBJECT_ID (etc...)` at the start of your code, ending with `EXEC suchDB.dbo.DoTheThing '20150901'` and it seems to work, just need to validate it now =D – João Ciocca Dec 11 '15 at 13:45
  • Something's serioulsy strange here: I'm getting exactly the same value for average and SUM columns... – João Ciocca Dec 11 '15 at 13:53
  • Turns out it isn't calculating anything. It's bringing me the value for that month. prod 110 value for 2015/09 is 26,291,162,867.79 - and the procedure is giving me the same value for average AND sum. – João Ciocca Dec 11 '15 at 14:18
  • Did you group only by product? – shawnt00 Dec 11 '15 at 14:39
  • That is odd. I really can't see why you would get that result. – Laughing Vergil Dec 11 '15 at 16:35
  • I see it now, and it is a stupid typo on my part. When assigning the values to the Yr2 and Mth2 variables, the data should come from EndDate, not StartDate. (_beats head against desk_) – Laughing Vergil Dec 11 '15 at 16:44
1

If you have a table of products and a table of dates, I would left join those with the query above, so that all the products and all the dates are represented, then sum the results with the method you have.

Your first bullet should be solved because all dates and all products would be represented

I believe your second bullet is caused by "rows 11 preceding" truly getting a past 11 values irrespective of date; that could be solved by substituting 0 for null for the month/product combinations that are currenly missing, but would be supplied in the case of the left join method.

onupdatecascade
  • 3,336
  • 22
  • 35
1

Lots of possible fixes for the missing months. Here's one approach that doesn't change your original query too much:

WITH ym as (
    select
         year(dateadd(month, n, cast('<year>' + '<month>' + '01' as date))) as y,
        month(dateadd(month, n, cast('<year>' + '<month>' + '01' as date))) as m
    from (values
         (0), (-1), (-2), (-3),  (-4),  (-5),
        (-6), (-7), (-8), (-9), (-10), (-11)
    ) ofs(n)
)
SELECT
    y, m, p.product,
    AVG(coalesce(value, 0)) OVER (
      PARTITION BY p.product 
      ORDER BY y, m
      ROWS 11 PRECEDING
    ) as average,
    SUM(value) OVER (
      PARTITION BY p.product 
      ORDER BY y, m
      ROWS 11 PRECEDING
    ) as sum
FROM
    ym cross join
    (select distinct product from suchDB.muchUSER.awesomeTABLE) p
    left outer join suchDB.muchUSER.awesomeTABLE t
        on t."year" = ym.y and t."month" = ym.m and t.product = p.product
where y = <insert year> and m = <month>

Rather than the where at the end you could do an order by y, m desc and select top 1. I usually try to avoid top but I'm not sure how you're dropping in parameters and it might be less of a hassle to avoid referring to them twice with some programming libraries. And even if you're doing it by hand you still have to remember to look in two places across a long query.

Since it appears you only want aggregates for a single month, I think you can get away with this simpler version that doesn't use window functions:

WITH ym as (
    select
        dateadd(month, n, cast('<year>' + '<month>' + '01' as date) as dt,
         year(dateadd(month, n, cast('<year>' + '<month>' + '01' as date))) as y,
        month(dateadd(month, n, cast('<year>' + '<month>' + '01' as date))) as m
    from (values
         (0), (-1), (-2), (-3),  (-4),  (-5),
        (-6), (-7), (-8), (-9), (-10), (-11)
    ) ofs(n)
)
SELECT
    year(max(dt)) as "year", month(max(dt)) as "month", p.product,
    AVG(coalesce(value, 0)) as average,
    SUM(value) as "sum"
FROM
    ym cross join
    (select distinct product from awesome) p
    left outer join awesome t
    on t."year" = ym.y and t."month" = ym.m and t.product = p.product
GROUP BY p.product

I don't know if you want to limit only to products actually sold during the previous year so I didn't handle that there.

If you start thinking about how to make this more generic and reusable you might end up more like this one. I went ahead and added that limitation on product to only the previous year of activity:

WITH dt as (
    select dateadd(month, n, cast('<year>' + '<month>' + '01' as date) as dt
),
ym as (
    select dt, year(dateadd(month, n, dt)) as y, month(dateadd(month, n, dt)) as m
    from (values
         (0), (-1), (-2), (-3),  (-4),  (-5),
        (-6), (-7), (-8), (-9), (-10), (-11)
    ) ofs(n), dt
)
SELECT
    year(max(dt)) as "year", month(max(dt)) as "month", p.product,
    AVG(coalesce(value, 0)) as average, SUM(value) as "sum"
FROM
    ym cross join
    (
        select distinct product from awesome
        where cast("year" + "month" + '01' as date) between
                (select min(dt) from ym) and (select max(dt) from ym)
    ) p
    left outer join (
        select distinct /* get the latest "version" only */
            first_value("year")
                over (partition by "year", "month", product order by version desc),
            first_value("month")
                over (partition by "year", "month", product order by version desc),
            product,
            first_value(value)
                over (partition by "year", "month", product order by version desc)
        from awesome
    ) t
        on t."year" = ym.y and t."month" = ym.m and t.product = p.product
GROUP BY p.product

The final query also attempts to handle your filter for only the most recent version. You'll need a later edition of SQL Server for the first_value() function though.

shawnt00
  • 16,443
  • 3
  • 17
  • 22
  • tried this one but... somethings seem off. First, there's missing ) on ym's y and m. Then, SQL complains on "offsets(n)" with "expecting ')', or ','." – João Ciocca Dec 11 '15 at 13:28
  • Yeah, looks like I had missed a couple parents in my rush. Hopefully you got the idea. – shawnt00 Dec 11 '15 at 13:39
  • I do, I maybe just lack the knowledge to correctly apply... `offsets(n)` now complains "Expecting AS, ID, or QUOTED_ID". "n" on the ym also complains "invalid column name 'n'" – João Ciocca Dec 11 '15 at 13:48
  • I think that was because I had too many paren for the values table constructor. I always forget whether I need parens around the whole thing. SQL Fiddle wasn't working so I couldn't test it out but it should be a lot better now. – shawnt00 Dec 11 '15 at 14:26
  • on the third example, I still dunno if I'm adapting it wrong to the actual field and table names, but SQL still complains on `offsets(n)` – João Ciocca Dec 14 '15 at 19:58
  • 1
    Looks like `offsets` is a reserved word. I changed it and did some testing with that subquery to verify it works. – shawnt00 Dec 14 '15 at 21:13
  • ok... almost everything now. but for some reason it complains about "ym.y" and "ym.m" on the outer join. Multi-part id couldn't be bound. – João Ciocca Dec 15 '15 at 19:44
  • 1
    Maybe change `ym,` to `ym cross join`. Seems like I have had problems when mixing the two join syntaxes. – shawnt00 Dec 15 '15 at 19:50
  • This seem to have done the job - including both bullet problems described on my post =D just going to double check the results! Thanks @shawnt00 – João Ciocca Dec 15 '15 at 20:05
  • HOLY COW, I forgot one important problem I have here - there are VERSIONS to each date! Meaning a date can have more than one version - and I _should_ only use the latest one. The way I usually do this is just add a join with distinct year, month, max(version) grouped by year, month - but for this case, it seems that doing this kills the latest month for the product that's missing (and which your solution covered!) any ideas on getting around this? version field is date, and it's set by "processing start time", defined at the start of import routine each time it runs. – João Ciocca Dec 15 '15 at 20:14
  • Do you have a recent edition of SQL Server with `first_value()`? You could also do the same join logic you've used before inside the derived table that I've called `t`. – shawnt00 Dec 15 '15 at 20:30
  • I've got 2012, from [doc](https://msdn.microsoft.com/pt-br/library/hh213018(v=sql.110).aspx) it seems to be available. use that with `t.`, will try that. – João Ciocca Dec 15 '15 at 20:55
  • I already edited my answer but I was assuming you had the function available. – shawnt00 Dec 15 '15 at 20:56
  • 1
    Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/98023/discussion-between-shawnt00-and-joao-ciocca). – shawnt00 Dec 15 '15 at 21:05
  • Re-reading your comment about versions I'm wondering if I misunderstand the relationship to product and what you're actually versioning. In other words is the versioning applied to month-year combinations or month-year-product combinations (or something different?) – shawnt00 Dec 15 '15 at 21:10
1

The first thing to do when working with date is to have a datetime field, not separated fields for year, month and day

SELECT year, month, product, value
     , DATEFROMPARTS(year, month, 1) fullDate
FROM   suchDB.muchUSER.awesomeTABLE

As the row can be versioned we need to get the last version of every year, month and product, that can be done in a number of way, for example using windowing function or with an auto-join, an example of the latter is

SELECT b.[year], b.[month], b.[product], [value]
     , DATEFROMPARTS(b.[year], b.month, 1) fullDate
FROM   suchDB.muchUSER.awesomeTABLE b
       INNER JOIN (SELECT [year], [month], [product], max([version]) lv
                   FROM   suchDB.muchUSER.awesomeTABLE
                   GROUP BY [year], [month], [product]
                  ) m ON b.[year] = m.year AND b.month = m.month 
                     AND b.product = m.product AND b.[version] = m.lv

Also having the list of the products will come handy

SELECT DISTINCT product
FROM   suchDB.muchUSER.awesomeTABLE

Getting the last year worth of data can be done in more than one way, I like CROSS APPLY

Declare @_year int
Declare @_month int

Set @_year = 2015
Set @_month = 11

;With baseDate AS (
  SELECT b.[year], b.[month], b.[product], [value]
       , DATEFROMPARTS(b.[year], b.month, 1) fullDate
  FROM   suchDB.muchUSER.awesomeTABLE b
         INNER JOIN (SELECT [year], [month], [product], max([version]) lv
                     FROM   suchDB.muchUSER.awesomeTABLE
                     GROUP BY [year], [month], [product]
                    ) m ON b.[year] = m.year AND b.month = m.month 
                       AND b.product = m.product AND b.[version] = m.lv
), Products AS (
  SELECT DISTINCT [product]
  FROM   suchDB.muchUSER.awesomeTABLE
)
SELECT @_year [Year], @_month [Month], p.[product]
     , ly.Average
     , ly.[Sum]
FROM   Products p
       CROSS APPLY (SELECT Sum(lastYear.Value) / 12.0 Average
                         , Sum(lastYear.Value) [Sum]
                    FROM   baseDate lastYear
                    WHERE  lastYear.fullDate > DATEFROMPARTS(@_year - 1, @_month, 1)
                      AND  lastYear.fullDate <= DATEFROMPARTS(@_year, @_month, 1)
                      AND  lastYear.product = p.product
                   ) ly 
WHERE  ly.[Sum] IS NOT NULL

The WHERE condition removes the products that have not row in the year before the parameters

To remove the variables and get the values for every month a calendar table is needed. If between all the products in the table all the month are there we can get the dates with a DISTINCT, using the first CTE instead of the table let us get also the full date

SELECT DISTINCT [year], [month], fullDate
FROM   baseDate

Otherwise there are different ways to create a calendar table. We can add the calendar table to the main query CTEs and use it in the CROSS APPLY instead of the variables

;With baseDate AS (
  SELECT b.[year] _y, b.[month] _m, b.[product], [value]
       , DATEFROMPARTS(b.[year], b.month, 1) fullDate
  FROM   suchDB.muchUSER.awesomeTABLE b
         INNER JOIN (SELECT [year], [month], [product], max([version]) lv
                     FROM   suchDB.muchUSER.awesomeTABLE
                     GROUP BY [year], [month], [product]
                    ) m ON b.[year] = m.year AND b.month = m.month 
                       AND b.product = m.product AND b.[version] = m.lv
), Products AS (
  SELECT DISTINCT [product]
  FROM   suchDB.muchUSER.awesomeTABLE
), Months As (
  SELECT DISTINCT _y, _m, fullDate
  FROM   baseDate
)
SELECT _y [Year], _m [Month], p.[product]
     , ly.Average
     , ly.[Sum]
     , ly.[Count]
FROM   Products p
       CROSS APPLY (SELECT m._y, m._m
                         , Sum(bd.Value) / 12.0 Average
                         , Sum(bd.Value) [Sum]
                         , Count(Value) [Count]
                    FROM   Months m
                           LEFT JOIN baseDate bd 
                                  ON bd.fullDate > DATEADD(YY, -1, m.fullDate)
                                 AND bd.fullDate <= m.fullDate
                    WHERE  bd.product = p.product
                    GROUP BY m._y, m._m
                   ) ly 
WHERE  ly.[Sum] IS NOT NULL
Serpiton
  • 3,676
  • 3
  • 24
  • 35
  • sorry for the delay, only now got around to test this. And once again I think I may lack more knowledge =p I got the idea of the auxiliary distinct tables and the date in one field - had no idea about that "datefromparts" function, thanks! I don't actually understand the "lastYear.year" and "lastYear.month" on the third line, though. I can see the "from baseDate lastYear" on the 4th from last line, but how does it interact with the baseDate table? – João Ciocca Dec 15 '15 at 18:45
  • ok, I think I got it to work =) thanks, taught me a lot in this one! but... it won't solve the two problems listed =/ one example I have caught here, a product that is not available on Sep/15, just won't show up if Sep/15 is queried... – João Ciocca Dec 15 '15 at 19:39
  • I forgot to explain [this](https://stackoverflow.com/questions/34212733/mssql2012-over-partition-last-12-months-nulls-included#comment56338839_34213293) before. Query works now, but probably because of different versions for the same year/month, the result isn't valid. – João Ciocca Dec 15 '15 at 20:22
  • Works, and almost valid! One problem remains: if there's no product for that month, avg will treat as null, and average the latest 11 months by 11, instead of considering it 0 and averaging the latest 11 months by 12. I tried adding "coalesce", but it didn't solved =( – João Ciocca Dec 15 '15 at 21:21
  • the sqlfiddle show just what I've said: even though product B's lacking entry for 2015/11 and 2015/06, the average for product B is still 10 - when it should be 8.33 ((10*10)/12) – João Ciocca Dec 15 '15 at 21:50
  • Nice. works AND validated! One last question, for my knowledge only: how would you change this query to make it show this average and sum for all dates, instead of having to input any - still keeping the "12 months window"? – João Ciocca Dec 15 '15 at 22:13
  • guess I need some time dissecting that example =D it worked perfectly (there's a product here that don't show up for 9/15 and 10/15, and it's results are showing as expected using that example!), but I have no idea how to reproduce =p thanks a LOT =) Any idea to make the "sum" and "avg" by OVER/PARTITION, so the value for that month can be show too? – João Ciocca Dec 21 '15 at 22:48
  • I do know that... i just don't know how to enforce the `where lastYear.fullDate > DATEFROMPARTS(@_year - 1, @_month, 1) AND lastYear.fullDate <= DATEFROMPARTS(@_year, @_month, 1)` part =/ – João Ciocca Dec 22 '15 at 10:08