5

One of the things my app does a fair amount is:

select count(distinct id) from x;

with id the primary key for table x. With MySQL 5.1 (and 5.0), it looks like this:

mysql> explain SELECT count(distinct id) from x;
+----+-------------+----------+-------+---------------+-----------------+---------+------+---------+-------------+
| id | select_type | table    | type  | possible_keys | key             | key_len | ref  | rows    | Extra       |
+----+-------------+----------+-------+---------------+-----------------+---------+------+---------+-------------+
|  1 | SIMPLE      | x        | index | NULL          | ix_blahblahblah | 1       | NULL | 1234567 | Using index |
+----+-------------+----------+-------+---------------+-----------------+---------+------+---------+-------------+

On InnoDB, this isn't exactly blazing, but it's not bad, either.

This week I'm trying out MySQL 5.5.11, and was surprised to see that the same query is many times slower. With the cache primed, it takes around 90 seconds, compared to 5 seconds before. The plan now looks like this:

mysql> explain select count(distinct id) from x;
+----+-------------+----------+-------+---------------+---------+---------+------+---------+-------------------------------------+
| id | select_type | table    | type  | possible_keys | key     | key_len | ref  | rows    | Extra                               |
+----+-------------+----------+-------+---------------+---------+---------+------+---------+-------------------------------------+
|  1 | SIMPLE      | x        | range | NULL          | PRIMARY | 4       | NULL | 1234567 | Using index for group-by (scanning) |
+----+-------------+----------+-------+---------------+---------+---------+------+---------+-------------------------------------+

One way to make it go fast again is to use select count(id) from x, which is safe because id is a primary key, but I'm going through some abstraction layers (like NHibernate) that make this a non-trivial task.

I tried analyze table x but it didn't make any appreciable difference.

It looks kind of like this bug, though it's not clear what versions that applies to, or what's happening (nobody's touched it in a year yet it's "serious/high/high").

Is there any way, besides simply changing my query, to get MySQL to be smarter about this?

UPDATE:

As requested, here's a way to reproduce it, more or less. I wrote this SQL script to generate 1 million rows of dummy data (takes 10 or 15 minutes to run):

delimiter $$
drop table if exists x;
create table x (
  id integer unsigned not null auto_increment,
  a integer,
  b varchar(100),
  c decimal(9,2),
  primary key (id),
  index ix_a (a),
  index ix_b (b),
  index ix_c (c)
) engine=innodb;
drop procedure if exists fill;
create procedure fill()
begin
  declare i int default 0;
  while i < 1000000 do
    insert into x (a,b,c) values (1,"one",1.0);
    set i = i+1;
  end while;
end$$
delimiter ;
call fill();

When it's done, I observe this behavior:

  • 5.1.48
    • select count(distinct id) from x
      • EXPLAIN is: key: ix_a, Extra: Using index
      • takes under 1.0 sec to run
    • select count(id) from x
      • EXPLAIN is: key: ix_a, Extra: Using index
      • takes under 0.5 sec to run
  • 5.5.11
    • select count(distinct id) from x
      • EXPLAIN is: key: PRIMARY, Extra: Using index for group-by
      • takes over 7.0 sec to run
    • select count(id) from x
      • EXPLAIN is: key: ix_a, Extra: Using index
      • takes under 0.5 sec to run

EDIT:

If I modify the query in 5.5 by saying

select count(distinct id) from x force index (ix_a);

it runs much faster. Indexes b and c also work (to varying degrees), and even forcing index PRIMARY helps.

Ken
  • 516
  • 1
  • 5
  • 11
  • Are the two x tables exactly the same? Please run `SHOW CREATE TABLE x\G` on both DBs and post the results. I'm particularly interested in the ix_blahblahblah index which was used in 5.1 but not 5.5. – Ike Walker Apr 19 '11 at 18:30
  • Ike: I don't want to post the actual code I'm working on, but I'll try to come up with something similar that exhibits this problem. – Ken Apr 19 '11 at 18:44
  • Ike: Simplified example added! The table is a bit smaller (fewer rows and fewer columns) than my actual data, but the performance drop in 5.5 is still almost 10x. – Ken Apr 19 '11 at 21:43
  • Are the configurations for 5.1 and 5.5 the same? If 5.5's got smaller limits for stuff like cache and key storage, you'd see results like this. – Marc B Apr 19 '11 at 22:59
  • Marc: They are the same. When I installed 5.5, I manually copied all of the parts of my 5.1 config that seemed like they would matter to query performance. Just now I copied the rest of the options, for good measure, and restarted, and I'm seeing the same performance as before. – Ken Apr 20 '11 at 00:20

6 Answers6

1

I'm not making any promises that this will be better but, as a possible work around, you could try:

SELECT COUNT(*)
    FROM (SELECT id
              FROM x
              GROUP BY id) t
Joe Stefanelli
  • 132,803
  • 19
  • 237
  • 235
1

I'm not sure why you need DISTINCT on a unique primary key. It looks like MySQL is viewing the DISTINCT keyword as an operator and losing the ability to make use of the index (as would any operation on a field.) Other SQL engines also sometimes don't optimize searches on expressions very well, so it's not a surprise.


I note your comment in another answer about this being an artifact of your ORM. Have you ever read the famous Leaky Abstractions blog by Joel Spolsky? I think you are there. Sometimes you end up spending more time straightening out the tool than you spend on the problem you're using the tool to solve.

dkretz
  • 37,399
  • 13
  • 80
  • 138
  • I'm not sure what your point is. My ORM is less of a problem than my RDBMS. I can obviously bypass any part of the abstraction layer easily except SQL (as Joel observes), but even if I could put inline assembly language in my SQL statements, that wouldn't be the *right* answer here. MySQL 5.5 is significantly worse than MySQL 5.1 (and 5.0 and Postgres and everything else I've tried) in some obvious cases, and "stay with an old version" is a very easy solution. Did you have a better answer here? – Ken Apr 20 '11 at 00:34
  • I first answered your question assuming it was described in the title; when in fact your question is not about mysql (the answer being to leave "DISTINCT" out of your query) where the real problem seems to be NHibernate. But if it has an exit to direct SQL insertion, why not just use it? Ultimately NHibernate is going to need to adapt to MySQL and not vice versa. In your context SQL insertion would be the equivalent of assembly insertion. Maybe for you staying with an old version is preferable, dunno. YMMV. But it's not like MySQL is doing something illegitimate. – dkretz Apr 20 '11 at 01:11
  • I don't know why you say it's "not about mysql". It's absolutely about MySQL, with the requirement that I can't modify the query arbitrarily. Obviously the question is not simply "how many records are in this table?", since even at 90 seconds that's far less time than I spent typing this question. My system builds arbitrary queries where sometimes DISTINCT is necessary, and the easiest place to recognize this is in MySQL, and in fact MySQL up to 5.1 seemed to do exactly this. – Ken Apr 20 '11 at 17:04
  • As I (almost) said above, there's an incomapibility ("impedance mismatch") between NHibernate and MySQL. You can blame either one. You apparently choose to blame MySQL. Although I can hardly see how they should be expected to anticipate the vast variety of expressions that can be created, including (as in this case) redundancies, and treat them all equally effectively. – dkretz Apr 20 '11 at 18:30
  • I'm blaming MySQL because in non-trivial cases DISTINCT is required (I can post a big and complex case that exhibits the same problem as this simple case, if you really want). I think MySQL can be expected to anticipate this *particular* case because every version of MySQL before this did, and all the release notes and changelogs for MySQL 5.5 indicate that it has improved performance, with no mention of any regressions. If I had upgraded NHibernate and performance got an order-of-magnitude worse, I would blame NHibernate instead. – Ken Jun 15 '11 at 01:13
1

I dont know if you have realiased, but counting the rows on a large database with InnoDB is slow, even without the distinct keyword. InnoDB does not cache the rowcount in the table metadata, MyISAM does.

I would suggest you do one of two things

1) create a trigger that inserts/updates distinct counts into another table on insertion.

2) slave another MySQL server to your database, but change the table type on the slave only, to MyISAM and perform your query there (this is probarbly overkill).

Geoffrey
  • 10,843
  • 3
  • 33
  • 46
  • I have realized that, and configuration #2 is in fact what we have now. (#1 isn't an option because it's not just one query we want to run; this was simply an example.) It's a big kluge that's a minor nightmare to maintain and we're trying to get away from it. For straight up counts of table rows, the query cache can mitigate most of this, but it's not a perfect solution. – Ken Jun 15 '11 at 01:16
0

I may be missreading your question, but if id is the primary key of table x, then the following two queries are logically equivalent:

select count(distinct id) from x;

select count(*) from x;

...regardless of whether the optimizer realizes this. Distinct generally implies a sort or scanning the index in order, which is considerably slower than just counting the rows.

Ronnis
  • 12,593
  • 2
  • 32
  • 52
  • Yes, this is true. My ORM and other tools generate the former. MySQL 5.5 (completely unlike 5.1 and 5.0) don't seem to understand that it's the same as the latter. I'm asking for a way to convince MySQL of this fact. – Ken Apr 19 '11 at 18:31
  • @ken, ah, I missed the part about the DAL. Can't you configure the DAL to understand that the id is unique, thus dropping the distinct? Or is the real case more complex than your description? – Ronnis Apr 19 '11 at 18:35
  • It already knows it's a PK. If there's some magic flag to make NH skip the DISTINCT in this case I'm all ears! (I'm downloading the source code for that right now... :-) – Ken Apr 19 '11 at 18:40
  • @ken, I've never used nhibernate. I added the NH tag to your question, it may attract the right people. Can you also share the NH related code that generates the query? – Ronnis Apr 19 '11 at 18:47
  • We've got a fair amount of infrastructure around NHibernate -- and to be fair, it's probably how we're using NH, not NH itself. That's why I'm looking for a MySQL solution first. – Ken Apr 19 '11 at 19:05
  • This is one of those cases where you need to insist that your abstraction layer has the ability to exit to a direct function call in the underlying layer. Don't you hate it when you spend more time on the tool than on the problem? – dkretz Apr 19 '11 at 22:46
  • le dorfier: I'd love to "insist" on that but I don't know any RDBMS in the world that offers that so it probably means migrating a few hundred thousand lines of code and a medium-sized programming team to a new database technology, and I was hoping for something a bit less involved this week. :-) – Ken Apr 20 '11 at 00:37
0

Creative use of autoincrement fields
Note that your id is autoincrement.
It will add +1 after each insert.

However it does not reuse numbers, so if you delete a row you need to track of that.
My idea goes something like this.

 Count(rows) = Max(id) - number of deletions - starting(id) + 1

Scenario using update

Create a separate table with the totals per table.

table counts 
  id integer autoincrement primary key
  tablename varchar(45)  /*not needed if you only need to count 1 table*/
  start_id integer default maxint
  delete_count 

Make sure you extract the starting_id before the first delete(!) into the table and do

INSERT INTO counts (tablename, start_id, delete_count)
  SELECT 'x', MIN(x.id), 0
  FROM x;

Now create a after delete trigger.

DELIMITER $$

CREATE TRIGGER ad_x_each AFTER DELETE ON x FOR EACH ROW
BEGIN
  UPDATE counts SET delete_count = delete_count + 1 WHERE tablename = 'x';
END $$

DELIMITER ;

IF you want to have the count, you do

SELECT max(x.id) - c.start_id + 1 - c.delete_count as number_of_rows
FROM x 
INNER JOIN counts c ON (c.tablename = 'x') 

This will give you your count instantly, with requiring a trigger to count on every insert.

insert scenario

If you have lots of deletes, you can speed up the proces by doing an insert instead of an update in the trigger and selecting

TABLE count_x  /*1 counting table per table to keep track of*/
  id integer autoincrement primary key /*make sure this field starts at 1*/
  start_id integer default maxint  /*do not put an index on this field!*/

Seed the starting id into the count table.

INSERT INTO counts (start_id) SELECT MIN(x.id) FROM x;

Now create a after delete trigger.

DELIMITER $$

CREATE TRIGGER ad_x_each AFTER DELETE ON x FOR EACH ROW
BEGIN
  INSERT INTO count_x (start_id) VALUES (default);     
END $$

DELIMITER ;

SELECT max(x.id) - min(c.start_id) + 1 - max(c.id) as number of rows
FROM x
JOIN count_x as c  ON (c.id > 0)

You'll have to test which approach works best for you.

Note that in the insert scenario you don't need delete_count, because you are using the autoincrementing id to keep track of the number of deletions.

Johan
  • 74,508
  • 24
  • 191
  • 319
  • I think a trigger-based mechanism could work, but that it's more complex than this. MyISAM does not reuse numbers, but InnoDB does a SELECT MAX on server startup, so if the server happens to be restarted between a DELETE and an INSERT, it will indeed re-use autoinc IDs. – Ken Jun 15 '11 at 01:24
0
select count(*)
from ( select distinct(id) from x)
Marek Grzenkowicz
  • 17,024
  • 9
  • 81
  • 111
tonng
  • 1