Improve performance of counting records for only growing table (without deletion)

Question

I have read a lot of answers about performance of counting records, for example https://stackoverflow.com/a/1332730/7906257, and all says there is no simple solution. I use InnoDB and there is CL_GAME table with PK primary key. And the most important thing that I have a special use case, which definitely can be optimized: records are never deleted from the table, but only added. Is it possible to say MySQL to skip internal validations (like explained in the above link) while counting records:

SELECT COUNT(pk) FROM CL_GAME WHERE pk <= 1072370;

There are some ideas, which are not perfect:

One obvious trick is to normalize table and remove any gaps between pk's. So pk reflects count of records before. But it looks dangerous, because some reverted transaction can break it (ok, there are some non-trivial solutions like https://www.percona.com/blog/2011/11/29/avoiding-auto-increment-holes-on-innodb-with-insert-ignore/)
Another idea to have additional table, that contains pairs <pk, count> where count is number of records with pk <= pk from the pair. Therefore it's possible to count only part of table and use pre-calculated values. But I'd prefer to avoid any additional structures/caches, because it takes time for implementing, validating and supporting them

Rick James · Answer 1 · 2019-08-20T22:06:08.137

0

How many rows do you anticipate? How often will the COUNT be needed? How precise does the number need to be?

If a precise answer is not needed, see Rows in TABLE STATUS.
InnoDB will use the 'smallest' index to do the counting, so add such an index. The "smallest" secondary index would probably be an index on the smallest column (other than the first column in the PK). A TINYINT, if you have such is only 1 byte.
A TRIGGER could to the counting.
If a slightly stale count is good enough, then a periodic COUNT(*) that is stored somewhere would work.
Or you could look at this as a simple special case of a "Summary Table" in Data Warehouse applications. This might have daily counts; then you would need to count today's rows to finish up the total. (It would help to see SHOW CREATE TABLE. Then I could be more specific.)

edited Aug 20 '19 at 22:06

answered Aug 20 '19 at 02:28

Rick James

135,179
13
127
222

Most of your suggestions don't work, because I need summary with conditions like `pk <= 1072370`. What do you mean by 'smallest' index? – rupashka Aug 20 '19 at 07:10
@rupashka - I augmented my Answer. – Rick James Aug 20 '19 at 22:06
I'd like to avoid any tricks like triggers etc. Usually in long-term applications any workarounds become problems, so I'm looking for the simplest solution. Yes, I think I can use stale count and `TABLE STATUS` can help. I'll post my final solution here – rupashka Aug 21 '19 at 12:34
Just found info about `TABLE STATUS`: *For other storage engines, such as InnoDB, this value is an approximation, and may vary from the actual value by as much as 40% to 50%* (https://dev.mysql.com/doc/refman/8.0/en/show-table-status.html). Sounds bad – rupashka Aug 21 '19 at 12:49

score 0 · Answer 2 · answered Aug 20 '19 at 06:30

0

You can slightly improve your second option. You don't need persistent table for pk->count matching. You will use only most recent record from that table.

Instead of table you can keep this value in memory / service and update them periodically. It still requires some additional coding work but you will avoid excessive table scans for counting old records.

Keep in mind that uncommitted transaction may insert record which are not visible to another transactions. In other words the pk value are kept for this pk->count matching should be less then any pk in active transactions.

answered Aug 20 '19 at 06:30

Alexey Vlasov

355
2
5

1. I was not clear about second option: I suggested to keep pairs not for all pk, but with some gap (or even select such pk-s that count is with some predefined step). So I could count not all rows, but rows between `SELECT MAX(PK) FROM CL_PRECALCULATED WHERE PK <= REQUIRED_PK` and `REQUIRED_PK`. 2. Yes, service is a good idea, but I'd like to avoid it if possible 3. Because records are inserted with auto increment PK there is no problem with pre-calculated values – rupashka Aug 20 '19 at 07:17
I don't think there is any other trick around this. In most cases, the idea to count the number of table records smells bad. Getting the number of records usually pretty expensive operation and I try to avoid such scenarios. If you need rate metrics - consider counting rate per service and don't touch db If you need some non-precise estimates you have mentioned options above – Alexey Vlasov Aug 20 '19 at 19:07
I have a game with a lot of players (100k+). It's possible to find players with *filter*, and I show total number of players (if filter is empty) and number of filtered players otherwise. I'd like to show these numbers , because that's great for players to know how many players there are in the game... – rupashka Aug 20 '19 at 20:13
If so. Are you planning to have disabled users? inactive? or other categories inaccessible for search. If so you need to refine your search for totals. Also, I don't think you need a precise amount of players. It doesn't really matter is 100 0001 or 100 005 players are available for search. – Alexey Vlasov Aug 20 '19 at 20:52
All users are going to fall into results. That's a good notice about not precise count, thanks! – rupashka Aug 21 '19 at 12:27

rupashka · Answer 3 · 2019-08-24T21:53:06.043

Finally I decided that precision of the count doesn't matter. So there are several solutions:

TABLE STATUS - bad because of for InnoDB this value is an approximation, and may vary from the actual value by as much as 40% to 50%
Keep cached count value and pk for which the count was calculated. After inserting new record calculate records between old pk (pk_old) and pk (pk_new) of the inserted record. Any kinds of such implementation have problems: difference between count and real count can unlimited increase because of rollback current transactions, or in case if another transaction inserted a record between pk_old and pk_new, but not committed yet
Increase cached count after every insert. It's possible that transaction can be rollbacked and count became wrong until server restart

I selected approach #3 because it's easy and the fastest (only one sql query while server start)

Improve performance of counting records for only growing table (without deletion)

3 Answers3