3

I often face the situation where I need to compare aggregated data of different periods from the same source.

I usually deal with it this way:

SELECT
    COALESCE(SalesThisYear.StoreId, SalesLastYear.StoreId) StoreId
    , SalesThisYear.Sum_Revenue RevenueThisYear
    , SalesLastYear.Sum_Revenue RevenueLastYear
FROM
    (
        SELECT   StoreId, SUM(Revenue) Sum_Revenue
        FROM     Sales
        WHERE    Date BETWEEN '2017-09-01' AND '2017-09-30'
        GROUP BY StoreId
    ) SalesThisYear
    FULL JOIN (
        SELECT   StoreId, SUM(Revenue) Sum_Revenue
        FROM     Sales
        WHERE    Date BETWEEN '2016-09-01' AND '2016-09-30'
        GROUP BY StoreId
    ) SalesLastYear
    ON (SalesLastYear.StoreId = SalesThisYear.StoreId)

-- execution time 337 ms

It is not very elegant in my opinion, because it visits the table twice, but it works.

Another similar way to achieve the same is:

SELECT
    Sales.StoreId
    , SUM(CASE YEAR(Date) WHEN 2017 THEN Revenue ELSE 0 END) RevenueThisYear
    , SUM(CASE YEAR(Date) WHEN 2016 THEN Revenue ELSE 0 END) RevenueLastYear
FROM
    Sales
WHERE
    Date BETWEEN '2017-09-01' AND '2017-09-30' 
    or Date BETWEEN '2016-09-01' AND '2016-09-30'
GROUP BY
    StoreId

-- execution time 548 ms

Both solutions performs almost the same on my data set (1,929,419 rows in the selected period, all indexes on their places), the first one a little better in time. And it doesn't matter if I include more periods, the first one is always better on my data set.

This is only a simple example but, sometimes, it involves more than two intervals and even some logic (e.g. compare isoweek/weekday instead of month/day, compare different stores, etc).

Although I already have figured out several ways to achieve the same, I was wondering if there is a clever way to achieve the same. Maybe a more cleaner solution, or a more suitable for big data sets (over a TB).

For example, I suppose the second one is less resource intensive for a big data set, since it does a single Index Scan over the table. The first one, on the other hand, requires two Index Scans and a Merge. If the table is too big to fit in memory, what will happen? Or the first one is always better?

Diego Queiroz
  • 3,198
  • 1
  • 24
  • 36

2 Answers2

0

There is very rarely a This way of doing things is always better, especially when they are doing very similar things.

What I will suggest however is that you try to utilise best practise wherever you can, such as minimising the use of scalar functions in your queries as this inhibits index usage.

For example, by changing your second query to the following I would imagine you will see at least some improvement performance wise:

SELECT
    Sales.StoreId
    , SUM(CASE WHEN Date BETWEEN '2017-09-01' AND '2017-09-30' THEN Revenue ELSE 0 END) RevenueThisYear
    , SUM(CASE WHEN Date BETWEEN '2016-09-01' AND '2016-09-30' THEN Revenue ELSE 0 END) RevenueLastYear
FROM
    Sales
WHERE
    Date BETWEEN '2017-09-01' AND '2017-09-30' 
    or Date BETWEEN '2016-09-01' AND '2016-09-30'
GROUP BY
    StoreId
iamdave
  • 12,023
  • 3
  • 24
  • 53
  • This was not exactly my point, but your recommendation makes sense, so I was curious about the improvement I would get. Sadly, there were no improvement. The average execution time of your query was 597 ms. – Diego Queiroz Oct 28 '17 at 14:59
  • Important to note that, my table stores dates as VARCHAR in the format YYYYMMDD, so YEAR(Date) actually is LEFT(Date, 4). However, it is also a scalar function. – Diego Queiroz Oct 28 '17 at 14:59
  • What do you think about my performance concerns over big data sets? Do you think the merge operation is a resource killer? – Diego Queiroz Oct 28 '17 at 14:59
0

The second looks better. But I guess the year part is slowing the query. Lets take out the year and put this. 2017-01-01 will be greater for this year range('2017-09-01' AND '2017-09-30') and less for last year range ('2016-09-01' AND '2016-09-30') .

 SELECT
        Sales.StoreId
        , SUM(CASE WHEN date > 2017-01-01 THEN Revenue ELSE 0 END) RevenueThisYear
        , SUM(CASE WHEN date < 2017-01-01 THEN Revenue ELSE 0 END) RevenueLastYear
    FROM
        Sales
    WHERE
        Date BETWEEN '2017-09-01' AND '2017-09-30' 
        or Date BETWEEN '2016-09-01' AND '2016-09-30'
    GROUP BY
        StoreId

IF FULL join is working great, lets try this.

SELECT
    COALESCE(SalesThisYear.StoreId, SalesLastYear.StoreId) StoreId
    , sum(SalesThisYear.Revenue) RevenueThisYear
    , sum(SalesLastYear.Revenue) RevenueLastYear
  FROM     Sales SalesThisYear full join
           Sales SalesLastYear
    ON SalesLastYear.StoreId = SalesThisYear.StoreId
 WHERE    SalesThisYear.Date BETWEEN '2017-09-01' AND '2017-09-30'
   AND    SalesLastYear.Date BETWEEN '2016-09-01' AND '2016-09-30'
GROUP BY COALESCE(SalesThisYear.StoreId, SalesLastYear.StoreId)

Edit *

  SELECT  Sales.StoreId
        , SUM(CASE WHEN date > '2017-01-01' THEN Revenue ELSE 0 END) RevenueThisYear
        , SUM(CASE WHEN date < '2017-01-01' THEN Revenue ELSE 0 END) RevenueLastYear
    FROM
        (Select store_id, date, revenue 
           from Sales
           WHERE Date BETWEEN '2017-09-01' AND '2017-09-30' 
              or Date BETWEEN '2016-09-01' AND '2016-09-30') q

GROUP BY StoreId
Valli
  • 1,440
  • 1
  • 8
  • 13
  • The average execution time in my table was 520ms. A little improvement, but the 'join' solution is still better in time. What do you think about my performance concerns over big data sets? Do you think the merge operation is a resource killer? – Diego Queiroz Oct 28 '17 at 14:56
  • I know. Sometimes our theory goes wrong. Like here a full join is saving a lot of time. I have rewritten your full join query. – Valli Oct 28 '17 at 15:10
  • 1
    The last one was very interesting, but it was the worst. I didn't wait for it to finish because it took more than a minute, so I give up. I analyzed the Execution Plan, and the Hash Match now is done after the Merge (before, there was two Hash Match operations happening before the Merge, one for each table). Apparently, sub-queries are not that terrible as I thought. – Diego Queiroz Oct 29 '17 at 22:31
  • The subqueries are doing a good job here. Your data is huge, So subquery restricts the data. So am just trying the above query to avoid a full join – Valli Oct 29 '17 at 23:19