46

I have a table with more than 100 millions rows in Innodb.

I have to know if there is more than 5000 rows where the foreign key = 1. I don't need the exact number.

I made some testing :

SELECT COUNT(*) FROM table WHERE fk = 1 => 16 seconds
SELECT COUNT(*) FROM table WHERE fk = 1 LIMIT 5000 => 16 seconds
SELECT primary FROM table WHERE fk = 1 => 0.6 seconds

I will have a bigger network and treatment time but it can be an overload of 15.4 seconds !

Do you have a better idea ?

Thanks

Edit: [Added OP's relevant comments]

I tried SELECT SQL_NO_CACHE COUNT(fk) FROM table WHERE fk = 1 but it took 25 seconds

Mysql was tuned for Innodb with Mysql Tuner.

CREATE TABLE table ( pk bigint(20) NOT NULL AUTO_INCREMENT,
fk tinyint(3) unsigned DEFAULT '0', 
PRIMARY KEY (pk), KEY idx_fk (fk) USING BTREE ) 
ENGINE=InnoDB AUTO_INCREMENT=100380914 DEFAULT CHARSET=latin1

DB Stuff:

'have_innodb', 'YES' 'ignore_builtin_innodb', 'OFF' 'innodb_adaptive_hash_index', 'ON'    
'innodb_additional_mem_pool_size', '20971520' 'innodb_autoextend_increment', '8' 
'innodb_autoinc_lock_mode', '1' 'innodb_buffer_pool_size', '25769803776' 
'innodb_checksums', 'ON' 'innodb_commit_concurrency', '0',
'innodb_concurrency_tickets', '500' 'innodb_data_file_path',
'ibdata1:10M:autoextend' 'innodb_data_home_dir', '', 'innodb_doublewrite', 'ON'     
'innodb_fast_shutdown', '1' 'innodb_file_io_threads', '4' 
'innodb_file_per_table', 'OFF', 'innodb_flush_log_at_trx_commit', '1' 
'innodb_flush_method', '' 'innodb_force_recovery', '0' 'innodb_lock_wait_timeout', '50' 
'innodb_locks_unsafe_for_binlog', 'OFF' 'innodb_log_buffer_size', '8388608' 
'innodb_log_file_size', '26214400' 'innodb_log_files_in_group', '2' 
'innodb_log_group_home_dir', './' 'innodb_max_dirty_pages_pct', '90'     
'innodb_max_purge_lag', '0' 'innodb_mirrored_log_groups', '1' 'innodb_open_files', 
'300' 'innodb_rollback_on_timeout', 'OFF' 'innodb_stats_on_metadata', 'ON' 
'innodb_support_xa', 'ON' 'innodb_sync_spin_loops', '20' 'innodb_table_locks', 'ON' 
'innodb_thread_concurrency', '8' 'innodb_thread_sleep_delay', '10000'      
'innodb_use_legacy_cardinality_algorithm', 'ON'

Update '15: I used the same method up to now with 600 millions rows and 640 000 new rows per day. It's still working fine.

hotips
  • 2,575
  • 7
  • 42
  • 59
  • 1
    the count would go faster if you chose a column in the `COUNT()`, as such: `SELECT COUNT(fk) FROM table WHERE fk = 1` – ClydeFrog Jun 11 '12 at 08:06
  • take a look at [this website](http://www.mysqlperformanceblog.com/2007/04/10/count-vs-countcol/) for more information – ClydeFrog Jun 11 '12 at 08:07
  • 6
    @ClydeFrog: Really? According to [the manual](http://dev.mysql.com/doc/en/group-by-functions.html#function_count), *`COUNT(*)` is optimized to return very quickly if the `SELECT` retrieves from one table, no other columns are retrieved, and there is no `WHERE` clause*. Indeed, the blog to which you linked suggests that `COUNT(*)` is faster than `COUNT(column)`. – eggyal Jun 11 '12 at 08:07
  • 1
    If tested now and I had the same time spent : 16 seconds – hotips Jun 11 '12 at 08:08
  • 4
    `SELECT COUNT(*) FROM table WHERE fk = 1` needs 16 seconds? Do you have an index on `fk`? – ypercubeᵀᴹ Jun 11 '12 at 08:11
  • Yes sure. It's a 100 millions rows table... I'm not connected in local for my test here ;-) VPN in not very fast. – hotips Jun 11 '12 at 08:20
  • What happens when you do a `SELECT SQL_NO_CACHE COUNT(fk) FROM table WHERE fk = 1` and you have an index defined in `fk`? – Salman A Jun 11 '12 at 08:41
  • "I have to know if there is more than 5000 rows where the foreign key = 1. I don't need the exact number." How does `SELECT COUNT(fk) > 5000 FROM table WHERE fk = 1` perform? – jensgram Jun 11 '12 at 08:43
  • 2
    Have you done any InnoDB configuration optimization or is it running out of the box? – N.B. Jun 11 '12 at 08:47
  • try with `SELECT COUNT(1) FROM table WHERE fk = 1` and let me know how much seconds it takes?? – Fahim Parkar Jun 11 '12 at 08:47
  • As @N.B. implied: Default MySQL settings are not optimized for InnoDB. Perhaps you should check them. – ypercubeᵀᴹ Jun 11 '12 at 08:52
  • @si2w As counting happens locally in the server, you don't need a fast connnection for that. – glglgl Jun 11 '12 at 08:57
  • I tried SELECT SQL_NO_CACHE COUNT(fk) FROM table WHERE fk = 1 but it took 25 seconds... – hotips Jun 11 '12 at 08:59
  • Mysql was tuned for Innodb with Mysql Tuner. On production it's faster but too slow for me... – hotips Jun 11 '12 at 09:00
  • Can you add the table's definition in the question? – ypercubeᵀᴹ Jun 11 '12 at 09:00
  • 1
    Please post the output of `SHOW CREATE TABLE table;` and `SHOW VARIABLES LIKE '%innodb%';` so we can see the necessary info first hand. – N.B. Jun 11 '12 at 09:21
  • 'innodb_flush_log_at_trx_commit', '1' 'innodb_flush_method', '' 'innodb_force_recovery', '0' 'innodb_lock_wait_timeout', '50' 'innodb_locks_unsafe_for_binlog', 'OFF' 'innodb_log_buffer_size', '8388608' 'innodb_log_file_size', '26214400' 'innodb_log_files_in_group', '2' 'innodb_log_group_home_dir', './' 'innodb_max_dirty_pages_pct', '90' 'innodb_max_purge_lag', '0' 'innodb_mirrored_log_groups', '1' 'innodb_open_files', '300' 'innodb_rollback_on_timeout', 'OFF' 'innodb_stats_on_metadata', 'ON' 'innodb_support_xa', 'ON' 'innodb_sync_spin_loops', '20' 'innodb_table_locks', 'ON' – hotips Jun 11 '12 at 13:29

7 Answers7

31

You don't seem interested in the actual count so give this a try:

SELECT 1 FROM table WHERE fk = 1 LIMIT 5000, 1

If a row is returned, you have 5000 and more records. I presume the fk column is indexed.

Salman A
  • 262,204
  • 82
  • 430
  • 521
  • This is interesting. Have you tested such solution - and does it perform well? – ypercubeᵀᴹ Jun 11 '12 at 08:55
  • 2
    @ypercube: I checked on dummy data with 3M rows, no index on fk and consistently got results in < 1s (first run was ~3s). This query heavily depends on distribution of the data so YMMV. – Salman A Jun 11 '12 at 09:41
  • Here's how that query works: Read 5001 'rows' from an index containing `fk` (or the data, if no index). If those rows happen to be in RAM, it is a reasonably quick CPU task. If those rows are on disk, it could take longer; however, they would be brought into cache, ready for your next query to use them. In this latter case, the cost of the check was "free". – Rick James Nov 04 '18 at 16:38
23

Counter tables or other caching mechanism is the solution:

InnoDB does not keep an internal count of rows in a table because concurrent transactions might “see” different numbers of rows at the same time. To process a SELECT COUNT(*) FROM t statement, InnoDB scans an index of the table, which takes some time if the index is not entirely in the buffer pool. If your table does not change often, using the MySQL query cache is a good solution. To get a fast count, you have to use a counter table you create yourself and let your application update it according to the inserts and deletes it does. If an approximate row count is sufficient, SHOW TABLE STATUS can be used. See Section 14.3.14.1, “InnoDB Performance Tuning Tips”.

scriptin
  • 3,060
  • 1
  • 24
  • 33
  • I have a where condition => show table status will no help. I have 400 000 new rows every day... I'm lucky ! – hotips Jun 11 '12 at 09:01
  • @si2w I didn't mean to advice you using `SHOW TABLE STATUS`. I told about count tables and caching. – scriptin Jun 11 '12 at 09:38
11

I gotta add another Answer -- I have many corrections/additions to the comments and Answers so far.

For MyISAM, SELECT COUNT(*) without WHERE is dead-reckoned -- very fast. All other situations (include the InnoDB in the Question) must count through either the data's BTree or an index's BTree to get the answer. So we need to see how much to count through.

InnoDB caches data and index blocks (16KB each). But when the table's data or index BTree is bigger than innodb_buffer_pool_size, you are guaranteed to hit the disk. Hitting the disk is almost always the slowest part of any SQL.

The Query Cache, when involved, usually results in query times of about 1 millisecond; this does not seem to be an issue with any of the timings quoted. So I won't dwell on it.

But... Runing the same query twice in a row will often exhibit:

  • First run: 10 seconds
  • Second run: 1 second

This is symptomatic of the first run having to fetch most of the blocks from disk, while the second found it all in RAM (the buffer_pool). I suspect that some of the timings listed are bogus because of not realizing this caching issue. (16 sec vs 0.6 sec may be explained by this.)

I will harp on "disk hits" or "blocks needed to be touched" as the real metric of which SQL is faster.

COUNT(x) checks x for IS NOT NULL before tallying. This adds a tiny amount of processing, but does not change the number of disk hits.

The proffered table has a PK and a second column. I wonder if that is the real table?? It makes a difference --

  • If the Optimizer decides to read the data -- that is, scan in PRIMARY KEY order -- it will be reading the data BTree, which is usually (but not in this lame example) much wider than secondary index BTrees.
  • If the Optimizer decides to read a secondary index (but not need to do a sort), there will be fewer blocks to touch. Hence, faster.

Comments on the original queries:

SELECT COUNT(*) FROM table WHERE fk = 1 => 16 seconds
    -- INDEX(fk) is optimal, but see below
SELECT COUNT(*) FROM table WHERE fk = 1 LIMIT 5000 => 16 seconds
    -- the LIMIT does nothing, since there is only one row in the result
SELECT primary FROM table WHERE fk = 1 => 0.6 seconds
    -- Again INDEX(fk), but see below

WHERE fk = 1 begs for INDEX(fk, ...), preferably just INDEX(fk). Note that in InnoDB, each secondary index contains a copy of the pk. That is, INDEX(fk) is effectively INDEX(fk, primary). Hence, the 3rd query can use that as "covering" and not need to touch the data.

If the table is truly just the two columns then probably the secondary index BTree will be fatter than the data BTree. But in realistic tables, the secondary index will be smaller. Hence an index scan will be faster (fewer blocks to touch) than a table scan.

The third query is also delivering a large resultset; this could cause the query to take a long time -- but it won't be included in the quoted "time"; it is network time, not query time.

innodb_buffer_pool_size = 25,769,803,776 I would guess that the table and its secondary index (from the FK) are each about 3-4GB. So, any timing might first have to load a lot of stuff. Then a second run would be entirely cached. (Of course, I don't know how many rows have fk=1; presumably less than all the rows?)

But... At 600M rows, the table and its index are each approaching the 25GB buffer_pool. So, the day may come soon that it becomes I/O bound -- this will make you wish to get back to 16 (or 25) seconds; yet you won't be able to. We can then talk about alternatives to doing the COUNT.

SELECT 1 FROM tbl WHERE fk = 1 LIMIT 5000,1 -- Let's analyze this. It will scan the index, but it will stop after 5000 rows. Of all you need is "more than 5K", that is the best way to get it. It will be consistently fast (touching only a dozen blocks), regardless of total number of rows in the table. (It is still subject to buffer_pool_size and cache characteristics of the system. But a dozen blocks takes much less than a second, even with a cold cache.)

MariaDB's LIMIT ROWS_EXAMINED may be worth looking into. Without that, you could do

SELECT COUNT(*) AS count_if_less_than_5K
    FROM ( SELECT 1 FROM tbl WHERE fk = 1 LIMIT 5000 );

It may be faster than delivering the rows to the client; it will have to collect the rows internally in a tmp table, but deliver only the COUNT.

A side note: 640K rows inserted per day -- this approaches the limit for single-row INSERTs in MySQL with your current settings on a HDD (not SDD). If you need to discuss the potential disaster, open another Question.

Bottom line:

  • Be sure to avoid the Query cache. (by using SQL_NO_CACHE or turning the QC off)
  • Run any timing query twice; use the second time.
  • Understand the structure and size of the BTree(s) involved.
  • Don't use COUNT(x) unless you need the null check.
  • Do not use PHP's mysql_* interface; switch to mysqli_* or PDO.
Rick James
  • 135,179
  • 13
  • 127
  • 222
3

It is an old question but I had the same issue and maybe this will helps someone: with 4 million records the COUNT query takes more than 20 seconds. So, in my case after I added a simple filtering by the primary key it becomes much faster and takes only 4 seconds. So the final query is:

SELECT COUNT(*) FROM Table
WHERE PK > 0;

And in my case the PK was INT.

O. V.
  • 125
  • 1
  • 8
  • Actually this answer lead me to a solution for a similar problem that I had. And the simple reason is: In InnoDB, SELECT COUNT(*) without a WHERE must do an index scan. The Optimizer will pick the 'smallest' index, which in your case is the PRIMARY KEY. – kcm Dec 12 '22 at 13:21
0

If you are using PHP you could do mysql_num_rows on the result you got from SELECT primary FROM table WHERE fk = 1 => 0.6 seconds, I think that will be efficient.

But depends on what server-side language you are using

nischayn22
  • 447
  • 3
  • 14
  • C# with the latest official driver. I think the driver gives a cursor to the data. So I can have the rows number without having to retrieve the whole dataset. – hotips Jun 11 '12 at 08:09
  • @si2w There are 2 ways to get data frrom the server: `mysql_store_result()` where the whole result set is sent to the client and you can count it, and `mysql_use_result()` where the data is sent if needed, but all data has to be fetched before issuing other commands. – glglgl Jun 11 '12 at 08:59
  • 1
    Can not confirm this answer. In my case the `COUNT()` lasts 1.6s and with a usual SELECT `mysql_num_rows` not retreiving the data its 1.8s. – mgutt Apr 17 '15 at 10:01
  • 1
    This is a terrible idea, if you have a large table, let's say GBs of data, a standard PHP setup would run out of memory in an instant! – Ben Osborne Oct 01 '18 at 12:27
0

If you're not interested to know the number of rows and you just want to test the COUNT against some value, you can use the standard script bellow:

SELECT 'X'
FROM mytable
WHERE myfield='A'
HAVING COUNT(*) >5

This will return one single row or no row at all, depending if condition is met.

This script is ANSI compliant and can be fully run without evaluating the complete value of COUNT(*). If MySQL implemented optimization to stop evaluating rows after some condition is met (I really hope it does), then you'll get a performance improvement. Unfortunately I can't test this behavior myself because I don't have a big MySQL database available. If you do this test, please share the result here :)

Gerardo Lima
  • 6,467
  • 3
  • 31
  • 47
0

Finally the fastest was to query the first X rows using C# and counting the rows number.

My application is treating the data in batches. The amount of time between two batches are depending the number of rows who need to be treated

SELECT pk FROM table WHERE fk = 1 LIMIT X

I got the result in 0.9 seconds.

Thanks all for your ideas!

hotips
  • 2,575
  • 7
  • 42
  • 59
  • 3
    I don't see how you counted the rows number. Mind adding that code as well? – nischayn22 Jun 12 '12 at 15:01
  • My application is treating the data in batches. The amount of time between two batches are depending the number of rows who need to be treated – hotips Jun 14 '12 at 18:17
  • The potential problem with this approach is that it must _transfer all `X` pks to the client. In some situations, that network time is significant. – Rick James Nov 04 '18 at 16:41
  • 1
    For ex: if the limit is 50million. In your case the 50million pk id's will be sent to Code. We need to store it in variable, which will consume ram memory. Then if this process is happened for multiple threads at same time. You may go out of memory. Any updated new solution? – vinieth Apr 17 '19 at 12:59