0

Not for any specific case, just a question on how SQL software (e.g. Oracle) implements aggregate/analytic functions in order to help me write more efficient code.

Basically, I'm curious whether Oracle does anything smart about caching computed aggregate/analytic function values if they're referenced multiple times. For example, take the following two queries:

-- Query A
select 
    PRODUCT, 
    sum(SALES) as TOTSALES,                 -- (1)
    log(sum(SALES)+1,10) as LOG_TOTSALES    -- (2)
from EXAMPLE_TABLE
group by PRODUCT
order by sum(SALES);                        -- (3)

-- Query B
select 
    PRODUCT, 
    TOTSALES,
    log(TOTSALES+1,10) as LOG_TOTSALES
from (
    select PRODUCT, sum(SALES) as TOTSALES
    from EXAMPLE_TABLE
    group by PRODUCT
) 
order by TOTSALES;

which accomplish the same thing. A naive implementation of Query A would evaluate the sum(SALES) three times (twice in the select and once in the order by). Query B only evaluates it once in the inner query, and then calculates/orders by the explicit NUM column.

So the question is whether Oracle/MySQL/etc do anything smart under the hood for cases like Query A. I can imagine a case where, when the compiler encounters the first sum(SALES), it saves a new identifier. Then when it sees the log(sum(SALES)+1,10) (for which it does not have an identifier) and tries to parse it, recognizes the sum(SALES) already computed and looks up the cached value.

The same could work if the same analytic function is invoked > 1 times in a select, e.g.

select 
   DATE,
   SALES as TODAY_SALES,
   lag(SALES,1) over (order by DATE) as YEST_SALES,
   SALES - (lag(SALES,1) over (order by DATE)) as DIFF_SALES
from DAILY_SALES;

where lag(SALES,1) doesn't need to be computed twice if it is cached the first time. But I don't know if that would create other headaches in implementation, in which case I should be writing code like Query B to be more efficient.

Thanks!

Edit: replaced count(*) with sum(SALES) for a more illustrative example.

tpapalex
  • 1
  • 2
  • 1
    Question is to broad to be answerd.. – Raymond Nijland Aug 01 '19 at 17:31
  • @tpapalex . . . `count(*)` is a total non-issue in terms of performance. The expensive part is aggregating the data. – Gordon Linoff Aug 01 '19 at 17:33
  • *"So the question is whether Oracle/MySQL/etc "* MySQL does have SQL part caching.. -> https://dev.mysql.com/doc/refman/8.0/en/explain-extended.html notice the `` explainment *"The expression (such as a scalar subquery) is executed once and the resulting value is saved in memory for later use. For results consisting of multiple values, a temporary table may be created and you will see instead. "* – Raymond Nijland Aug 01 '19 at 17:34
  • @GordonLinoff understood, but I guess the same question stands if `count(*), log(count(*)+1,10)` is replaced by `sum(SALES), log(sum(SALES)+1)` – tpapalex Aug 01 '19 at 18:06
  • @RaymondNijland yes, part/subquery caching seems to be what I was referring to - thanks! I'll put together an answer, it seems that Oracle calls it a result cache. – tpapalex Aug 01 '19 at 18:09
  • @RaymondNijland actually on closer reading, MySQL's Query Cache seems to be caching the entire table resulting from a query, in case it is run again and hasn't changed. What I was more interested in was, while the query is being executed, whether the results of aggregate functions are cached (e.g. in the group by hash) so that they can be retrieved easily if they're used again. – tpapalex Aug 01 '19 at 18:20
  • *" actually on closer reading, MySQL's Query Cache "* Where did i mention the MySQL Query cache? The thing i mentioned is a optimizer cache pretty sure of it.. – Raymond Nijland Aug 01 '19 at 18:26
  • ah I see; that must be separate, I was trying to do some further googling on subquery caching and confused it with the query cache. So when the documentation says "scalar subsquery", that could include the result of an aggregate funciton? thanks! – tpapalex Aug 01 '19 at 18:29
  • *"What I was more interested in was, while the query is being executed, whether the results of aggregate functions are cached (e.g. in the group by hash) so that they can be retrieved easily if they're used again"* Right you mean as example if query has aggregates like `SELECT id, COUNT(*), COUNT(*) FROM table GROUP BY id` that the second `COUNT(*)` will use the cached result off the first `COUNT(*)` ? – Raymond Nijland Aug 01 '19 at 18:30
  • yup that's right -- i guess that's what the optimizer is for, was just not sure what to google – tpapalex Aug 01 '19 at 18:34
  • *"i guess that's what the optimizer is for"* it sounds logical if i would program a RDMS it would be in there.. – Raymond Nijland Aug 01 '19 at 19:05
  • iám using [this](https://www.db-fiddle.com/f/tTEYUT1hva3YKuBa7wxiPP/0) to track that happens in the MySQL source code the result -> `optimize function in sql_select.cc file starting on line 865` i found this [comment](https://github.com/mysql/mysql-server/blob/5.5/sql/sql_select.cc#L1040) *" Extract all table-independent conditions and replace the WHERE clause with them. All other conditions were computed by opt_sum_query and the MIN/MAX/COUNT function(s) have been replaced by constants,"* the functions `make_cond_for_table()` and `opt_sum_query()` seams to be used – Raymond Nijland Aug 01 '19 at 19:11
  • So yes based on that short look in the source code i would say that aggregates values are cached on the same query level in the optimizer – Raymond Nijland Aug 01 '19 at 19:16

0 Answers0