3

I'd like to verify that my assumptions are right. I have two tables, which only differ in index order.

They look like this:

CREATE TABLE `ipcountry` (
    `id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
    `ipFROM` INT(10) UNSIGNED ZEROFILL NOT NULL DEFAULT '0000000000',
    `ipTO` INT(10) UNSIGNED ZEROFILL NOT NULL DEFAULT '0000000000',
    `countrySHORT` CHAR(2) NOT NULL DEFAULT '' COLLATE 'utf8_czech_ci',
    `countryLONG` VARCHAR(255) NOT NULL DEFAULT ' ' COLLATE 'utf8_czech_ci',
    PRIMARY KEY (`id`),
    INDEX `ipINDEX` (`ipTO`, `ipFROM`)
)
COLLATE='utf8_czech_ci'
ENGINE=InnoDB
AUTO_INCREMENT=2490331
;


CREATE TABLE `ipcountry2` (
    `id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
    `ipFROM` INT(10) UNSIGNED ZEROFILL NOT NULL DEFAULT '0000000000',
    `ipTO` INT(10) UNSIGNED ZEROFILL NOT NULL DEFAULT '0000000000',
    `countrySHORT` CHAR(2) NOT NULL DEFAULT '' COLLATE 'utf8_czech_ci',
    `countryLONG` VARCHAR(255) NOT NULL DEFAULT ' ' COLLATE 'utf8_czech_ci',
    PRIMARY KEY (`id`),
    INDEX `ipINDEX` (`ipFROM`, `ipTO`)
)
COLLATE='utf8_czech_ci'
ENGINE=InnoDB
AUTO_INCREMENT=2490331
;

Both tables have the exact same amount of rows which is roughly 2,500,000.

When performing EXPLAIN SELECT * FROM `ipcountry` WHERE ipFROM<=3548978221 AND ipTO>=3548978221 I get

{
    "table": "UnknownTable",
    "rows":
    [
        {
            "id": 1,
            "select_type": "SIMPLE",
            "table": "ipcountry",
            "partitions": null,
            "type": "range",
            "possible_keys": "ipINDEX",
            "key": "ipINDEX",
            "key_len": "4",
            "ref": null,
            "rows": 83260,
            "filtered": 33.33,
            "Extra": "Using index condition"
        }
    ]
}

When performing EXPLAIN SELECT * FROM `ipcountry2` WHERE ipFROM<=3548978221 AND ipTO>=3548978221 I get

{
    "table": "UnknownTable",
    "rows":
    [
        {
            "id": 1,
            "select_type": "SIMPLE",
            "table": "ipcountry2",
            "partitions": null,
            "type": "ALL",
            "possible_keys": "ipINDEX",
            "key": null,
            "key_len": null,
            "ref": null,
            "rows": 2515343,
            "filtered": 16.66,
            "Extra": "Using where"
        }
    ]
}

Is it because of the precedence of operators?

Boann
  • 48,794
  • 16
  • 117
  • 146
Jiri
  • 115
  • 8
  • 2
    For the second table, does it use the index then when you swap the conditions? – Progman Aug 04 '19 at 13:43
  • 1
    Please include the *selectivity* of both predicates. That is, the percentage of rows each predicate would **separately** select, compared to the total rows of the table. To be clear this is predicate #1: `ipFROM <= 3548978221` and this is predicate #2: `ipTO >= 3548978221`. – The Impaler Aug 04 '19 at 14:23
  • @Progman If I swap predicates, the result remains the same. However, if I change `ipFROM<=3548978221 AND ipTO>=3548978221` to `ipFROM>=3548978221 AND ipTO<=3548978221` the result is different (but obviously, changing the condition is not something that's wanted) – Jiri Aug 04 '19 at 19:27

2 Answers2

3

Notice in the first EXPLAIN:

        "key_len": "4",

This shows that only the query only reads the first INT in the index (4 bytes) for the lookup. You can see that this lookup narrows down the search from 2.5M to about 83K, about a 30:1 selectivity.

        "rows": 83260,

When you have two range conditions as in your query, MySQL can't use both columns of the index for the B-tree search. It can do a B-tree search on the first column, but subsequent columns of the index can't be used in that search.

Your query also filters by the other column at the storage engine level with index condition pushdown, indicated by the Extra note:

        "Extra": "Using index condition"

This isn't part of the B-tree search, but it helps a little bit by filtering out rows before they're returned from the storage engine to the SQL layer.

The bottom line is that there's no way to use a B-tree index search to optimize two range conditions on different columns in the same table.

MySQL will also skip using the index entirely if it estimates the cost of reading the whole table would be approximately the same as using the index. The more rows match your condition, the more likely this is. It's extra work for InnoDB to read rows via a secondary index, so it defaults to doing a table-scan if it estimates that your index lookup will match a large number of rows. The threshold at which this happens is not official or documented, but I've observed it happens when your condition matches at least 20% of the rows in the table.

In your second table, given that it also can only filter on the first column, we can reason that the condition on ipFROM alone will match a large subset of the rows in your table. You're searching for all IP addresses less than 3548978221, or 211.137.28.45, which is pretty high in the range of IPv4 addresses. It's not surprising that at least 20% of your rows have values less than that number.

So the MySQL optimizer concludes that in the second query, it won't give enough benefit to use the index, and it decides to do a table-scan. It can't use the second column of the index without using the first column.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • 1
    Thank you, Bill, for your in-depth clarification. I'll need to go through it a few more times, but it did clear up stuff for me :). – Jiri Aug 05 '19 at 06:18
2

This is due to the different number of records in the chosen range.

Looking up data in the table using an index is a two-step process. First it will use the index to find the records that fulfill the range condition to the first column in your index. Then it looks up the whole record (since you need *) in the table (using the primary key). This is significantly slower than reading the same amount of rows from the table (or indexes).

So if you have to read a lot of records anyway, it can be faster to just read all the records (faster per record) and throw away those you do not need than to read just the records you need (but slower per record). You can expect this effect to be somewhere around 10-20% of the table size.

Assuming the autoincrement value represents the number of records, for your first query/table, MySQL estimates that 83260 out of 2490331 records (3%) fulfill the condition ipFROM <= 3548978221. Getting the complete record from the table via the index is viable. For your second query/table, the estimate is that is has to read 2515343 out of 2490331 records (101%, sic) for ipTO>=3548978221, so it will just read the whole table without doing the slower two-step-lookup.

You can compare this to doing an index lookup by forcing MySQL to use the index:

SELECT * FROM `ipcountry2` force index (`ipINDEX`)
WHERE ipFROM<=3548978221 AND ipTO>=3548978221 

If you would just select columns that are present in the index (or any column that is part of the primary key), e.g. select ipFROM, ipTO from ..., this would make it a covering index, and MySQL could fulfill your request without the second lookup, and will always use this index.

Solarflare
  • 10,721
  • 2
  • 18
  • 35
  • The strategy you describe makes sense. However it strongly applies to database engines that use the "heap model". But as you see the OP is using InnoDB tables that use the "clustering index model". They operate in a different way. – The Impaler Aug 04 '19 at 14:30
  • @TheImpaler While a heap should behave similarly, this is meant to apply to innodb/clustering indexes. Maybe I was unclear at some point? I can then elaborate or clarify. – Solarflare Aug 04 '19 at 14:42
  • These are still the same tables but a bit different IP range `EXPLAIN SELECT * FROM \`ipcountry2\` force index (\`ipINDEX\`) WHERE ipFROM<=4294967295 AND ipTO>=4294967295 ` shows `{ "table": "UnknownTable", "rows": [ { "id": 1, "select_type": "SIMPLE", "table": "ipcountry2", "partitions": null, "type": "range", "possible_keys": "ipINDEX", "key": "ipINDEX", "key_len": "4", "ref": null, "rows": 1201813, "filtered": 33.33, "Extra": "Using index condition" } ] }` – Jiri Aug 04 '19 at 19:40
  • `EXPLAIN SELECT * FROM \`ipcountry\` force index (\`ipINDEX\`) WHERE ipFROM<=4294967295 AND ipTO>=4294967295 ` results in `{ "table": "UnknownTable", "rows": [ { "id": 1, "select_type": "SIMPLE", "table": "ipcountry", "partitions": null, "type": "range", "possible_keys": "ipINDEX", "key": "ipINDEX", "key_len": "4", "ref": null, "rows": 1, "filtered": 33.33, "Extra": "Using index condition" } ] }` – Jiri Aug 04 '19 at 19:42
  • @Jiri Is this a follow up question/is there something unclear about those new queries? – Solarflare Aug 04 '19 at 21:15
  • @Solarflare Thank you for the reply, and reactions. I just wanted to contribute with the results when `force index (`ipINDEX`)` being used. I'll need to go through the answers to fully comprehend them a few more times but as of now, it's pointless for me to ask other questions. Thank you! – Jiri Aug 05 '19 at 06:16
  • Ah, ok. Although asking questions is never pointless. One thing though: the interesting part of using the forced index is not the execution plan (as you force it, so you know how it will look, although it's never wrong to verify), but the actual execution time, to see how much faster it actually is to not use the index (e.g., if MySQL made the correct decision not to use it). (When doing timing tests, use SQL_NO_CACHE, e.g. `select SQL_NO_CACHE * from ...`, to prevent the use of the query cache) – Solarflare Aug 05 '19 at 07:52