6

Let's say I have a table

+------+---------+--------+
| lang | title   | url    |
+------+---------+--------+
| pt   | Livro 1 | o294jl |
| en   | Book 1  | o294jl |
| en   | Book 2  | o294jl |
+------+---------+--------+

And I run a query

SELECT lang, title
FROM table
GROUP BY url

The result of the query is not obvious because the values of lang and title are different among the group.

How does an SQL engine choose which row to return from a group? Which row must be selected in my example? Is it specified in the SQL standard?

Finesse
  • 9,793
  • 7
  • 62
  • 92
  • [Group by clause in mySQL and postgreSQL, why the error in postgreSQL?](https://stackoverflow.com/a/33629201/5070879) - this query is not a valid one according to the standard with `sql_mode = ONLY_FULL_GROUP_BY` you will get an error – Lukasz Szozda Aug 24 '19 at 11:01

2 Answers2

3

Values are chosen from arbitrary matching rows for each group. The values could come from different rows for different runs. In theory, different columns in the same SELECT could come from different rows.

The documentation explains this:

If ONLY_FULL_GROUP_BY is disabled, a MySQL extension to the standard SQL use of GROUP BY permits the select list, HAVING condition, or ORDER BY list to refer to nonaggregated columns even if the columns are not functionally dependent on GROUP BY columns. . . . In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are nondeterministic, which is probably not what you want.

You should read the complete documentation on the subject.

Note that the default behavior of MySQL is now to reject such queries. Yay!

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

I addition to Gorden's answer – In practice the engine will just do the least work, which is to choose the values from the first found row in the group. However – Which row is the first depends on the execution plan, in particular on the chosen index.

Assuming the following schema and data:

CREATE TABLE test (
  `lang`  VARCHAR(2),
  `title` VARCHAR(50),
  `url`   VARCHAR(50)
) engine=InnoDB;

INSERT INTO test (`lang`, `title`, `url`) VALUES
  ('pt', 'Livro 1', 'o294jl'),
  ('en', 'Book 1', 'o294jl'),
  ('en', 'Book 2', 'o294jl');

Executing the query

SELECT lang, title FROM test GROUP BY url;

returns

| lang | title   |
| ---- | ------- |
| pt   | Livro 1 |

Which is the first row in insertion order (using the clustered index).

If we now define an index on (url, lang, title)

ALTER TABLE test ADD INDEX url_lang_title (url, lang, title);

The same SELECT query returns

| lang | title  |
| ---- | ------ |
| en   | Book 1 |

which is the first row in the new url_lang_title index.

View on DB Fiddle

As you can see: Having exactly same data and exactly same query – MySQL can return different results. And even if you don't change the indices, you can't rely on a particular index being chosen. The engine can choose another index for many other reasons.

The moral of the story: Don't ask what the engine will return. Instead tell it exactly what you want it to return by writing deterministic queries.

Paul Spiegel
  • 30,925
  • 5
  • 44
  • 53