3

I am doing a subquery in which I have a calculated column involving random number generation. In the base query I select this column twice. MySQL 5.6 works as I expect, the calculated value being called once and fixed. The 5.7+/8.0+ execution seems to re-evaluate the subquery's column value individually for each selection. Is this correct behavior? What can I do to force it work as expected in newer versions of MySQL?

CREATE TABLE t (
  `id` BIGINT(20) NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=InnoDB;

insert into t values();
insert into t values();
insert into t values();
insert into t values();
insert into t values();

SELECT  
        q.i,
        q.r,
        q.r
FROM    (
        SELECT  
                id AS i,
                (FLOOR(RAND(100) * 4)) AS r
        FROM t
        ) q;

MySQL 5.6 yields (values are the same):

+---+-----+-----+
| i |  r  |  r  |
+---+-----+-----+
| 1 |   0 |   0 |
| 2 |   2 |   2 |
| 3 |   3 |   3 |
| 4 |   2 |   2 |
| 5 |   1 |   1 |
+---+-----+-----+

while 5.7 yields (values are different):

+---+-----+-----+
| i |  r  |  r  |
+---+-----+-----+
| 1 |   0 |   2 |
| 2 |   3 |   2 |
| 3 |   1 |   1 |
| 4 |   2 |   1 |
| 5 |   2 |   0 |
+---+-----+-----+
Pavel S.
  • 1,202
  • 1
  • 13
  • 29
  • Maybe it's dependent on whether the subquery is materialized or not. It looks like you can force the materialization in 5.7 (which I expect will get you the behavior of 5.6) with the hint /*+ SUBQUERY(MATERIALIZATION) */ (add after your inner select) – Kevin Postlewaite Jun 02 '17 at 20:46
  • @KevinPostlewaite I've tried your suggestion with 5.7, unfortunately it does not change things... – Pavel S. Jun 02 '17 at 21:44
  • Sorry to send you down a path that didn't work: that's a surprising change in behavior that you've found and I didn't see any documentation of this change. – Kevin Postlewaite Jun 02 '17 at 21:54
  • @KevinPostlewaite Thank you anyway for your attention! – Pavel S. Jun 02 '17 at 22:07
  • 1
    Oracle says it's not a [bug](https://bugs.mysql.com/bug.php?id=72447). They more or less say the optimizer can ignore that it's non-deterministic in optimizing a query. But their own documentation says non-deterministic functions can impede optimization, so I suspect they are not meeting the SQL standard, and I'll post an answer when I've read referenced mine. See MySQL [12.6.2 Mathematical Functions](https://dev.mysql.com/doc/refman/5.7/en/mathematical-functions.html#function_rand) and [8.2.1.17 Function Call Optimization](https://dev.mysql.com/doc/refman/5.7/en/function-optimization.html) – philipxy Jun 03 '17 at 01:25
  • @philipxy Thanks for your reply! If both behaviors are correct, there still should be a way to force the expected behavior... After migration from 5.6 to 5.7 my queries got corrupted, and it's been a pain figuring out the reason. – Pavel S. Jun 03 '17 at 10:13
  • Remember, if the spec doesn't guarantee something, your expectations are simply unjustified & irrelevant. PS I got the bug link as 5th hit googling 'mysql 5.7 rand error'. Googling also finds workarounds (which I expect are not guaranteed to work either). – philipxy Jun 03 '17 at 16:21
  • 1
    In restrospect my linked bug is re ORDER BY RAND(), so it doesn't apply. MySQL RAND() docs says it's called once per WHERE row and illustrates (but doesn't actually say) it's called once per SELECT row (and says it's not to be used in GROUP BY or ORDER BY; and ON & HAVING are unaddressed, but they are defined in terms of WHERE). (It also doesn't clarify re multiple calls in one clause.) Adding that to the SQL conceptual execution of a query, that the SELECT clause should select from the output of the FROM, makes this a bug. – philipxy Jun 03 '17 at 17:37
  • 1
    @philipxy I submitted a corresponding [bug report](https://bugs.mysql.com/bug.php?id=86624) at MySQL. – Pavel S. Jun 08 '17 at 12:59
  • FYI Re [Postgres](https://www.postgresql.org/docs/current/static/xfunc-volatility.html): IMMUTABLE = DETERMINISTIC "same results given the same arguments", STABLE "allows the optimizer to optimize multiple calls of the function to a single call", VOLATILE = other. (Though "using a volatile function will re-evaluate the function at every row where its value is needed."--per conceptual or optimized execution??) Only deterministic vs non-deterministic is standard. So if Postgres is conforming then the standard does not obligate mulitple non-deterministic implementation calls per statement. – philipxy Jun 18 '17 at 02:04
  • @philipxy Even if the standard does not obligate multiple calls, a behavior with multiple calls does not indicate a bug, right? Roy Lyseng gave me solutions at bugs.mysql.com of how to get desired behavior with MySQL 5.7 and 8.0. I'll post an answer shortly. – Pavel S. Jun 19 '17 at 15:45
  • 1
    I agree my fyi is a separate issue. But whatever reasonable behaviour is given for multiple text calls, your double select clause has *none*. Sadly the MySQL documentation does not say what you can rely on given a program text for non-deterministic funcitons. Nor for variables--they talk about some kind of execution model involving "the client" but they don't explain *that*. – philipxy Jun 19 '17 at 23:42

1 Answers1

3

As mentioned in The MySQL 8.0.0 Milestone Release is available,

In MySQL 5.6 and earlier, derived tables were always materialized. In 5.7, derived tables are merged into the outer query in most cases, and materialized in some cases.

...

Enabling merging a derived table or view through a optimizer hint (WL#9307) — This work by Guilhem Bichot allows users to control whether a derived table or view will be merged or materialized using the “merge” and “no_merge” hints.

I suppose that this is the cause of the behavior I am observing in newer versions of MySQL. The mentioned hint can be used with MySQL 8.0 to force RAND() be called only once:

SELECT  /* NO_MERGE(q) */
        q.i,
        q.r,
        q.r
FROM    (
        SELECT 
                id AS i,
                (FLOOR(RAND(100) * 4)) AS r
        FROM t
        ) AS q;

+---+-----+-----+
| i |  r  |  r  |
+---+-----+-----+
| 1 |   0 |   0 |
| 2 |   2 |   2 |
| 3 |   3 |   3 |
| 4 |   2 |   2 |
| 5 |   1 |   1 |
+---+-----+-----+

This however is not available in 5.7. To achieve the desired behavior with 5.7, add LIMIT <a very high number> to the derived table definition (I'm using signed LONG_MAX below). Thanks to Roy Lyseng for this workaround.

SELECT
        q.i,
        q.r,
        q.r
FROM    (
        SELECT 
                id AS i,
                (FLOOR(RAND(100) * 4)) AS r
        FROM t LIMIT 9223372036854775807
        ) AS q;

+---+-----+-----+
| i |  r  |  r  |
+---+-----+-----+
| 1 |   0 |   0 |
| 2 |   2 |   2 |
| 3 |   3 |   3 |
| 4 |   2 |   2 |
| 5 |   1 |   1 |
+---+-----+-----+

As philipxy mentioned in the comment, the result of a query expression must be strictly defined regardless of any optimizations being applied. Which means it is an optimizer bug in MySQL 5.7/8.0.

Pavel S.
  • 1,202
  • 1
  • 13
  • 29
  • 1
    All this means is the optimizer has a bug in it. Optimizations have nothing to with the definition of what the value of a query expression is allowed to be. – philipxy Jun 19 '17 at 23:40
  • @philipxy I've added it to the answer. Thank you for your contribution! – Pavel S. Jun 20 '17 at 11:52
  • 2
    For me, using `GROUP BY id` instead of `LIMIT high number` also works and is a bit "cleaner": https://stackoverflow.com/questions/48432654/mysql-rand-and-if-without-limit-leads-to-unexpected-results – Horen Jan 24 '18 at 22:57