mysql sum over derived table is faster than sum up without it

Question

I cannot understand why the first query, which is using a derived table, is slower than the second one. My table:

  CREATE TABLE `test` (
  `someid` binary(16) NOT NULL,
  `indexedcolumn1` int(11) NOT NULL,
  `indexedcolumn2` int(10) unsigned NOT NULL,
  `data` int(11) NOT NULL,
  KEY `indexedcolumn1` (`indexedcolumn1`),
  KEY `indexedcolumn2` (`indexedcolumn2`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

This table contains 4.514.856 rows

The faster query:

SELECT SUM(isSame) AS same, SUM(isDifferent) AS diff, SUM(isNotSet) AS notSet, indexedcolumn1 FROM (
      SELECT   
        CASE WHEN t.indexedcolumn1 = t.data   
           THEN 1   
           ELSE 0   
        END AS isSame,  
        CASE WHEN t.indexedcolumn1 != t.data   
           THEN 1   
           ELSE 0   
        END AS isDifferent,  
        CASE WHEN t.data = 0  
           THEN 1   
           ELSE 0   
        END AS isNotSet,  
        indexedcolumn1
      FROM
              test as t 
      WHERE
          t.indexedcolumn2 >= 10000000
)AS tempTable GROUP BY indexedcolumn1;

Result:

72 rows in set (4.70 sec)

The slower query:

SELECT   
SUM(CASE WHEN t.indexedcolumn1 = t.data   
   THEN 1   
   ELSE 0   
END) AS same,  
SUM(CASE WHEN t.indexedcolumn1 != t.data   
   THEN 1   
   ELSE 0   
END) AS diff,  
SUM(CASE WHEN t.data = 0  
   THEN 1   
   ELSE 0   
END) AS notSet,  
indexedcolumn1
FROM
      test as t 
WHERE
  t.indexedcolumn2 >= 10000000
GROUP BY indexedcolumn1;

Result:

72 rows in set (5.90 sec)

I thought you should avoid a derived table whenever its possible. Even EXPLAIN does not give any hint: for query1:

+----+-------------+------------+------+----------------+------+---------+------+---------+---------------------------------+
| id | select_type | table      | type | possible_keys  | key  | key_len | ref  | rows    | Extra                           |
+----+-------------+------------+------+----------------+------+---------+------+---------+---------------------------------+
|  1 | PRIMARY     | <derived2> | ALL  | NULL           | NULL | NULL    | NULL | 2257428 | Using temporary; Using filesort |
|  2 | DERIVED     | t          | ALL  | indexedcolumn2 | NULL | NULL    | NULL | 4514856 | Using where                     |
+----+-------------+------------+------+----------------+------+---------+------+---------+---------------------------------+

for query 2:

+----+-------------+-------+-------+---------------------------+------------+---------+------+---------+-------------+
| id | select_type | table | type  | possible_keys             | key        | key_len | ref  | rows    | Extra       |
+----+-------------+-------+-------+---------------------------+------------+---------+------+---------+-------------+
|  1 | SIMPLE      | t    | index | indexedcolumn1,indexedcolumn2 | indexedcolumn1 | 4       | NULL | 4514856 | Using where |
+----+-------------+-------+-------+---------------------------+------------+---------+------+---------+-------------+

I also tried the tests several times, always with the same result: The first query was faster.... But why? The results are the same.

EDIT: I did a additional test: I removed the where clause. Even then I get better results for the first query (EXPLAIN):

+----+-------------+------------+------+---------------+------+---------+------+---------+---------------------------------+
| id | select_type | table      | type | possible_keys | key  | key_len | ref  | rows    | Extra                           |
+----+-------------+------------+------+---------------+------+---------+------+---------+---------------------------------+
|  1 | PRIMARY     | <derived2> | ALL  | NULL          | NULL | NULL    | NULL | 4514856 | Using temporary; Using filesort |
|  2 | DERIVED     | t          | ALL  | NULL          | NULL | NULL    | NULL | 4514856 | NULL                            |
+----+-------------+------------+------+---------------+------+---------+------+---------+---------------------------------+

Explain Query 2:

+----+-------------+-------+-------+---------------+------------+---------+------+---------+-------+
| id | select_type | table | type  | possible_keys | key        | key_len | ref  | rows    | Extra |
+----+-------------+-------+-------+---------------+------------+---------+------+---------+-------+
|  1 | SIMPLE      | t     | index | indexedcolumn1  | indexedcolumn1 | 4       | NULL | 4514856 | NULL  |
+----+-------------+-------+-------+---------------+------------+---------+------+---------+-------+

score 0 · Answer 1 · answered Feb 02 '18 at 17:34

0

I am initially surprised at the difference in performance. The derived table incurs the overhead of materialization. MySQL might combine that with the first step of the sorting (used for GROUP BY), so it might make no difference with ORDER BY.

Given that you are only working with 72 rows, the overhead for the GROUP BY would seem to be minimal, so I would expect the two to be pretty similar.

But the key is in the index usage. The first version uses an index to filter the data -- essentially looking up each of the 72 rows -- and then doing the group by. I'm surprised this takes multiple seconds.

The second is using the index on the group by. This saves a sort, but it requires a full table scan.

answered Feb 02 '18 at 17:34

Gordon Linoff

1,242,037
58
646
786

Thanks for your reply. The initial table contain currently 4.514.856 rows, only the result table is containing 72 rows. I still do not get the point. Why is the first query faster? BEcuase it is using the other index? – holodoc Feb 02 '18 at 17:40
What about signed vs unsigned. I've read that [unsigned int can increase performance](https://rakesh.sankar-b.com/2010/08/25/mysql-unsigned-int-to-signed-int-performance-tips-index/) but it looks like the index is not actually used in query 1. What about int(10) vs int(11)? I can't find anything that supports smaller display ints improve performance as well. – Larry Beasley Feb 02 '18 at 18:43
@mly . . . The first query uses the index to pull out the 72 rows and then does the aggregation. The second scans all the data (well, the index) to set the data up for the aggregation. The filtering happens during the scan. – Gordon Linoff Feb 03 '18 at 03:18
@Gordon: So when I got you correctly it would be different when I would remove the WHERE clause? I tested this and got even worse results with the second query. In this case the first query is not even using one index (because of the missing where clause). – holodoc Feb 04 '18 at 15:06
@mly . . . The more rows being aggregated, the slower it will be. – Gordon Linoff Feb 04 '18 at 22:30
@Gordan: But without the WHERE statement it cannot use any index for query 1 (see edit: EXPLAIN statement) .How could it still be faster than second query? – holodoc Feb 05 '18 at 15:05

score 0 · Answer 2 · answered Feb 11 '18 at 16:28

I have 5 comments, of varying degrees of relevance:

(1) Using a "covering" index is likely to be faster:

INDEX(indexedcolumn2, indexedcolumn1, data)

I expect this will make the non-subquery approach beat the other.

(2) You really should have a `PRIMARY KEY for any InnoDB table.

(3) Slight shortening:

    CASE WHEN t.indexedcolumn1 = t.data   
       THEN 1   
       ELSE 0   
    END AS isSame,

-->

    t.indexedcolumn1 = t.data  AS isSame,

and

SUM(CASE WHEN t.indexedcolumn1 = t.data   
   THEN 1   
   ELSE 0   
END) AS same,

-->

SUM(t.indexedcolumn1 = t.data) AS same,

(4) When running timings, run the query twice -- the first time may involve I/O (for caching) than the second.

(5) The query is in a gray area where the Optimizer does not have enough knowledge of the distribution of data to necessarily pick the best way to perform the query. The faster query made better use of an index to help with the filtering (WHERE) than the slower query, which banked on avoiding the 'sort' to deal with the GROUP BY.

Version 5.7 has a different "cost model" and may have picked the 'right' approach. What version are you using.

mysql sum over derived table is faster than sum up without it

2 Answers2