-1

I have a simple table that contains a record of products and their total sales per day over a year (just 3 columns - Product, Date, Sales). So, for example, if product A is sold every single day, it'll have 365 records. Similarly, if product B is sold for only 50 days, the table will have just 50 rows for that product - one for each day of sale.

I need to calculate the daily average sales and standard deviation for the entire year, which means that, for product B, I need to have additional 365-50=315 entries with zero sales to be able to calculate the daily average and standard deviation for the year correctly.

Is there a way to do this efficiently and dynamically in SQL?

Thanks

Chipmunk_da
  • 467
  • 2
  • 9
  • 27

3 Answers3

2

We can generate 366 rows and join the sales data to it:

WITH rg(rn) AS (
   SELECT 1 AS rn
   UNION ALL
   SELECT a.rn + 1 AS rn
   FROM   rg a
   WHERE  a.rn <= 366
)

SELECT
  *
FROM
  rg
  LEFT JOIN (
    SELECT YEAR(saledate) as yr, DATEPART(dayofyear, saledate) as doy, count(*) as numsales 
    FROM sales 
    GROUP BY YEAR(saledate), DATEPART(dayofyear, saledate) 
  ) s ON rg.rn = s.doy

OPTION (MAXRECURSION 370);

You can replace the nulls (where there is no sale data for that day) with e.g. AVG(COALESCE(numsales, 0)). You'll probably also need a WHERE clause to eliminate the 366th day on non leap years (such as MODULO the year by 4 and only do 366 rows if it's 0).

If you're only doing a single year, you can use a where clause in the sales subquery to give only the relevant records; most efficient is to use a range like WHERE salesdate >= DATEFROMPARTS(YEAR(GetDate()), 1, 1) AND salesdate < DATEFROMPARTS(YEAR(GetDate()) + 1, 1, 1) rather than calling a function on every sales date to extract the year from it to compare to a constant. You can also drop the YEAR(salesdate) from the select/group by if there is only a single year

If you're doing multiple years, you could make the rg generate more rows, or (perhaps simpler) cross join it to a list of years so you get 366 rows multiplied by e.g. VALUES (2015),(2016),(2017),(2018),(2019),(2020) (and make the year from the sales part of the join too)

Caius Jard
  • 72,509
  • 5
  • 49
  • 80
  • Wouldn't this give an error since `YEAR(saledate)` isn't the Group By? – Chipmunk_da Jun 26 '20 at 09:52
  • So the above solution will add `NULL` for the missing entries. But to calculate the std dev for each product I need to identify NULLs (replaced with 0) related to each product. How can I do that? – Chipmunk_da Jun 29 '20 at 09:30
  • The first sentence under the code - *You can replace the nulls (where there is no sale data for that day) with e.g. AVG(COALESCE(numsales, 0))* > - if you coalesce the nulls with 0, before you average, then you will get a bunch of 0s that drag your average down. If you do not coalesce then the nulls are not considered in the average. `10 6 0 0` avg = 4, `10 6 null null` avg = 8 – Caius Jard Jun 29 '20 at 10:44
  • By my giving a query that is returning the data as null instead of 0 you can know the difference between eg "a day with no sales" versus "a day with 1 sale and 1 refund" if that's how you choose to code things up.. (If a sale is +1 and a refund is -1, then the sum of the day would be 0, the absence of the day results in null. You can hence choose whether to treat them the same by making the nulls 0, or not, by doing something else with the nulls – Caius Jard Jun 29 '20 at 10:48
  • Yes, but when we join with the `rg` table and have the product name in the `SELECT` and `GROUPBY` statement, what that would give me essentially is two rows per product (one grouped by `NULL` and the other by Product Name). And so the `Average` and `Standard Deviation` would be calculated separately for `NULL` and the Product Name. – Chipmunk_da Jun 29 '20 at 13:14
  • If you weren't selecting the product name and using it in the where clause instead (calculating `Average` and `Standard Deviation` for one product at a time), then the calculations would work correctly. But I need to do this for all products in the table. – Chipmunk_da Jun 29 '20 at 13:24
  • *If you weren't selecting the product name and using it in the where clause* - I didn't understand you; where did I mention product name at all? I never selected any product name nor used it in any where clause – Caius Jard Jun 29 '20 at 13:59
  • Sorry, what I mean is that I need to calculate the average and std dev for each product separately (per the original question). And so would need to include the product name in the select statement but then the solution wouldn't work – Chipmunk_da Jun 29 '20 at 14:16
  • 1
    If you want to have results that are "per day per product" you need a matrix that is every product id crossed with every day, and then you join your sales data onto that, for example `(SELECT r.rn as doy, p.id as prod FROM rg CROSS JOIN products) matrix LEFT JOIN (sales...) s ON s.saleday = matrix.doy and s.productid = matrix.prod` – Caius Jard Jun 29 '20 at 15:09
  • 1
    ( I mention this at the end o the answer, but there I was talking about 1 product spread across a yearXday matrix, whereas this is 1 year spread across a productXday matrix. Same thing, just different info involved) – Caius Jard Jun 29 '20 at 15:16
0

find the first and last day of the year and then use datediff() to find number of days in that year.

After that don't use AVG on sales, but SUM(Sales) / days_in_year

select  *,
         days_in_year = datediff(day, first_of_year, last_of_year) + 1
from    (values (2019), (2020)) v(year)
        cross apply
        (
            select  first_of_year = dateadd(year, year - 1900, 0),
                    last_of_year  = dateadd(year, year - 1900 + 1, -1)
        ) d
Squirrel
  • 23,507
  • 4
  • 34
  • 32
  • Thanks @Squirrel. I also need to calculate the standard deviation and don't think this would work for that. Sorry I hadn't included that in my question earlier but have updated it now. – Chipmunk_da Jun 26 '20 at 09:33
0

There's a different way to look at it - don't try to add additional empty rows, just divide by the number of days in a year. While the number of days a year isn't constant (a leap year will have 366 days), it can be calculated easily since the first day of the year is always January 1st and the last is always December 31st:

SELECT   YEAR(date), 
         product,
         SUM(sales) / DATEPART(dy, DATEFROMPARTS(YEAR(date)), 12, 31))
FROM     sales_table
GROUP BY YEAR(date), product
Mureinik
  • 297,002
  • 52
  • 306
  • 350
  • Thanks @Mureinik. I also need to calculate the standard deviation and don't think this would work for that. Sorry I hadn't included that in my question earlier but have updated it now. – Chipmunk_da Jun 26 '20 at 09:37