0

In case I have a table partitioned by year; how do I avoid the scanning of all partitions when I have to lookup a row by its ID and can't use partition pruning in the lookup query?

CREATE TABLE part_table (
    id bigint NOT NULL auto_increment,
    moment datetime NOT NULL,

    KEY (id),
    KEY (moment)
)-- partitioning information (in years)
    PARTITION BY RANGE( YEAR(moment) ) (
    PARTITION p2020 VALUES LESS THAN (2021),
    PARTITION p2021 VALUES LESS THAN (2022),
    PARTITION p2022 VALUES LESS THAN (2023),
    PARTITION p2023 VALUES LESS THAN (2024),
    PARTITION p2024 VALUES LESS THAN (2025),
    PARTITION p2025 VALUES LESS THAN (2026),
    PARTITION pFuture VALUES LESS THAN (maxvalue) )
;

With e.g. lookup query:

SELECT * FROM part_table WHERE ID = <nr>
whale70
  • 333
  • 2
  • 9
  • Try using `PRIMARY KEY(id, moment)`? – Barmar Dec 03 '20 at 01:27
  • @Barmar; using that primary key does not result in a different set of partitions being scanned - when using EXPLAIN SELECT * FROM part_table WHERE id = 1, the same set of all partitions shows up for both scenarios. – whale70 Dec 03 '20 at 03:12

1 Answers1

1
  • Don't you want PRIMARY KEY(id, moment) or PRIMARY KEY(moment, id) instead of INDEX(id)?
  • Indexes are partitioned. Each partition is essentially a "table". It has a `BTree for the data and PK, and a BTree for each secondary index.
  • So, to find id=123 requires checking INDEX(id) in each partition. Herein lies one of the reasons why a PARTITIONed table is sometimes slower than the equivalent non-partitioned table.
  • It is inefficient to pre-create future partitions (other than one).

Show us the main queries you have. I will probably explain why you should not partition the table. I see two possible benefits in your definition:

  • Dropping 'old' data is much faster than DELETEing it.
  • `WHERE something-else AND moment between ..

Some cases

For this discussion, I assuming partitioning by a datetime in some fashion (BY RANGE(TO_DAYS(moment)) or BY ... (YEAR(moment)), etc).

WHERE id BETWEEN 111 and 222

Partitioning probably hurts slightly because, regardless of what indexes are available, the query must look in every partition.

WHERE id BETWEEN 111 and 222
  AND moment > NOW() - INTERVAL 1 MONTH
with some index starting with `id`

This is a case where partition "pruning" is beneficial. It will look in one or two partitions (depending on whether or not the query is being run in January). Then it will somewhat efficiently use the index to lookup by id.

Now let be discuss two flavors if an index starting with id (and assuming either of the WHERE clauses, above:

PRIMARY KEY(id, moment)

The PK is "clustered" with the data. That is, the data is sorted by first id then moment. Hence the id BETWEEN... will find the rows consecutively in the BTree -- this is the most efficient. The AND moment... works to filter out some of the rows.

INDEX(id)

is not "clustered". It is a secondary index. Secondary indexes take two steps. (1) search the secondary BTree for the ids, but without filtering by moment; (2) reach into the data BTree using the artificial PK that was provided for you; (3) now the filtering by moment can happen. More steps, more blocks to read, etc.

DROP PARTITION p2020

id much faster and less invasive than `DELETE .. WHERE moment < '2021-01-01'.

More

It is important to look at all the main queries. X=constant versus X BETWEEN... can make a big difference in optimization; please provide concrete examples that are realistic for your app.

Also, sometimes a "covering" index can make up for otherwise inefficient indexes. So those examples need to show all the columns in the important queries. And what datatypes they are.

In the absence of such details, I will make the following broad statements (which might be invalidated by the specifics):

  • If the WHERE references only one column, the PARTITIONing is probably never beneficial.
  • If the WHERE has one = test and one 'range' test, there is probably a composite index that will work much better than partitioning.
  • Partitioning may shine when there are two range tests, but only if 'pruning' can be applied. (There are a lot of limitations on pruning.)
  • With 2 ranges, the one that is not being pruned on should be at the beginning of the PRIMARY KEY.
  • When pruning is used but the rest of the WHERE cannot use some index, that implies a scan of the partition. If there are only a few partitions, that could be a big scan.
  • Don't pre-build more than one partition. When not pruning, it is somewhat costly to open all the partitions only to find some are empty.
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • The primary key has to include the partition key. – Barmar Dec 03 '20 at 03:58
  • @Rick James + Barmar: indeed, columns used for partitioning need to be included in ALL unique indices, so that's why the id index can't be unique. I was hoping there would be a way of using the id column for partition pruning as well; e.g. if each partition remembers the min/max id values of its id index the pruning algorithm might be able to check that certain partitions can be skipped when looking up a certain value. – whale70 Dec 03 '20 at 04:06
  • @Rick James: the main benefit of partitioning would be is easy deletion of old data. Most queries would operate on rows relating to the current year, so that's probably the best viable approach to limit lookup queries to a certain partition - combining the id criteria with a moment (year) criteria in the WHERE clause. – whale70 Dec 03 '20 at 04:10
  • @Rick James; what would the benefit be of using PRIMARY KEY(id, moment) over just KEY(id)? The uniqueness criteria is not really meaningful because of the use of the column moment. Or am I missing something? :-) – whale70 Dec 04 '20 at 05:18
  • @whale70 - I added some annotated examples. – Rick James Dec 04 '20 at 19:34
  • @Thanks Rick, that kind of summarizes the dillemmas and next-best solution indeed. Talking to an Oracle expert on this issue, I was told Oracle can maintain min/max id information PER partition (apart from the moment time-based criteria), so it can easily use the id lookup value for pruning as well; this would be a welcome addition to MySQL in the future :-) Thanks for thinking along; I think the predominant theme is to try to add partition pruning WHERE clauses whenever possible to avoid an all-partition scan. – whale70 Dec 09 '20 at 23:38
  • The one query you have presented is slower with partitioning than without, _even_ if you play around with augmenting the WHERE. Please provide another of your queries that you _think_ will be sped up by partitioning. I will probably explain that it won't be sped up. – Rick James Dec 09 '20 at 23:43
  • @Rick, can you explain to me why e.g. using a "WHERE id = AND moment = '2020-12-15' " would not result in partition pruning using the moment filter and then subsequent lookup with only the id key in that specific partition? I'm not getting that bit yet :-) The actual queries I would use would be a lot more complex/specific but this pattern is fairly representative. – whale70 Dec 15 '20 at 00:27
  • @whale70 - That example should do pruning. But, if `id` is unique, there there is only one row to look up. So partitioning will be no faster. _Every table should have a `PRIMARY KEY`; maybe your example is trimmed down too much?_ – Rick James Dec 15 '20 at 01:20
  • @rick, ok, thanks, in the scenarios i have to deal with, some queries may only use an id based lookup (possibly using extra moment criteria as well), but other queries may use only moment based criteria. So to find an optimum for both query categories, I would think partitioning would help with the 2nd type, while not hurting (but also not really helping) the 1st type. In case an id lookup is combined with partition pruning; the id lookup in 1 partition still would be faster than for an non-partitioned larger table? – whale70 Jan 01 '21 at 06:12
  • @whale70 - Good points. It would help to have specifics. Meanwhile, I added some general statements. – Rick James Jan 01 '21 at 17:44