1

I have the below table (simplified) where there are approximately 2 million records in the table, with tickets.spam = FALSE for the majority of records, and only 31 records with tickets.spam = TRUE.

CREATE TABLE tickets (
  id INT PRIMARY KEY,
  account_id INT,
  w_id INT,
  spam BOOLEAN,
  due_by DATE
);

The below query returns data in under 200ms.

SET SESSION query_cache_type = OFF;
SELECT SQL_NO_CACHE tickets.*
FROM tickets
WHERE tickets.account_id = 95
  AND tickets.w_id IN (2, 3, 4, 5, 6)
  AND tickets.spam = TRUE
LIMIT 25 OFFSET 0;

This is the query plan

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE tickets p127 ref --- index_acc_due_ws_id 8 const 1049925 5.00 Using index condition; Using where

But if change the limit to 26 instead of 25 the execution time goes for 5 seconds.

FROM tickets
WHERE tickets.account_id = 95
  AND tickets.w_id IN (2, 3, 4, 5, 6)
  AND tickets.spam = TRUE
LIMIT 26 OFFSET 0;
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE tickets p127 ref --- index_acc_due_ws_id 8 const 1049925 5.00 Using index condition; Using where

And index_acc_due_ws_id is an index on account_id, due_by, w_id, and id in that order.

My understanding is when the LIMIT is greater than the available number of records it might default to a full table scan to fetch the required number of records. However, since we have 31 records I'm not sure why this happens at 26 records.

Also, how do we effectively work with columns that have a huge data skew?

thebenman
  • 1,621
  • 14
  • 35

1 Answers1

1

It's hard to tell exactly why you had such a big discontinuity in execution times without using EXPLAIN ANALYZE FORMAT=TREE, but that is not available in MySQL 5.7.

You do need to include your spam column in your multicolumn index, especially considering your WHERE ... spam = TRUE filter gets rid of the vast majority of your rows. Try this.

CREATE INDEX spam_account_wid ON tickets(account_id, spam, w_id);

Once this index is in place, you may be able to speed things up another little bit by doing tickets.w_id BETWEEN 2 AND 6 instead of using IN(2,3,4,5,6). It does a range scan on the last column of the multi-column index.

O. Jones
  • 103,626
  • 17
  • 118
  • 172