Unable to add index which speeds up SQL query

Question

I have a SQL query, which I know is inefficient, based on mysql.slow_query log. I am using MySQL8.

I have three different tables, see show create table at bottom. Basically, a catch belongs to exactly one beat, a beat can have many catches. A catch belongs to exactly one river, a river can have many catches. A beat belongs to exactly one river, a river can have many beats. I guess you can say that the link from catch to river is excess, but maybe that is done for easing other queries elsewhere which is not dependent on beats.

The query is as follows:

select date
from catch_logs
where catch_logs.removed = false
  and exists (select * from rivers where catch_logs.river_id = rivers.id and active_on_laksebors = 1 and rivers.deleted_at is null)
  and exists (select * from beats where catch_logs.beat_id = beats.id and active_on_laksebors = 1)
  and no_catch = false
order by date asc
limit 1;

So the query does a join with two other tables. When running explain on the query, I get (a bit abbreviated, included what I think is relevant):

table	type	possible_keys	key	rows	filtered	extra
catch_logs	ref	< several indexes >	< several indexes >	1656	1	Using where; Using index
beats	eq_ref	PRIMARY	PRIMARY	1	100	Using where
rivers	all	PRIMARY	null	554	10	Using where; Using temporary; Using filesort

554 is the number of entries in rivers. Beats has ~1500 rows, but only checks 1. Basically, what I want to achieve, is that both beats and rivers only check 1 row, as I actually have this id from catch_logs.

I have seen in the MySQL documentation for type, that all, which is used in rivers, is the worst join type, and states that:

A full table scan is done for each combination of rows from the previous tables. This is normally not good if the table is the first table not marked const, and usually very bad in all other cases. Normally, you can avoid ALL by adding indexes that enable row retrieval from the table based on constant values or column values from earlier tables.

As far as I can see, from catch_logs perspective there is no difference whether it accesses beats or rivers. The only thing I can think of which is different, is that beats also has a reference to *rivers.

I have tried creating various indices, see create table definition of catch_logs, but without any changes in my result.

How can I make an index, so that I only check 1 single row of both beats and rivers? Part questions which may point in some direction

Why is beats only have "Using where" under "Extra", while rivers has "Using where; Using temporary; Using filesort"
rivers uses PRIMARY for both possible_keys and keys, but rivers only has this value for possible_keys, and null for keys The docs list the following explanation, but I cannot understand why it chooses to do so:

It is possible that key may name an index that is not present in the possible_keys value. This can happen if none of the possible_keys indexes are suitable for looking up rows, but all the columns selected by the query are columns of some other index. That is, the named index covers the selected columns, so although it is not used to determine which rows to retrieve, an index scan is more efficient than a data row scan.

Why is my possible_key "not suitable for looking up rows"?

I have improved my query somewhat by selecting min(date) instead of using order_by date asc, limit 1, but still my joins are creating issues in terms of performance

EDIT: Here are the table definitions:

CREATE TABLE `catch_logs` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `created_at` timestamp NULL DEFAULT NULL,
  `updated_at` timestamp NULL DEFAULT NULL,
  `date` date DEFAULT NULL,
  `river_id` int(10) unsigned DEFAULT NULL,
  `beat_id` int(10) unsigned DEFAULT NULL,
  `no_catch` tinyint(1) NOT NULL DEFAULT '0',
  `removed` tinyint(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY `catch_logs_beat_id_foreign` (`beat_id`),
  KEY `catch_logs_river_id_foreign` (`river_id`),
  KEY `catch_logs_removed_no_catch_river_id_date_beat_id_index` (`removed`,`no_catch`,`river_id`,`date`,`beat_id`),
  CONSTRAINT `catch_logs_beat_id_foreign` FOREIGN KEY (`beat_id`) REFERENCES `beats` (`id`) ON DELETE SET NULL,
  CONSTRAINT `catch_logs_river_id_foreign` FOREIGN KEY (`river_id`) REFERENCES `rivers` (`id`) ON DELETE SET NULL,
) ENGINE=InnoDB AUTO_INCREMENT=194046 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

CREATE TABLE `rivers` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
  `created_at` timestamp NULL DEFAULT NULL,
  `updated_at` timestamp NULL DEFAULT NULL,
  `slug` varchar(200) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
  `county_id` int(10) unsigned DEFAULT NULL,
  `latitude` varchar(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `longitude` varchar(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `active_on_laksebors` tinyint(1) NOT NULL DEFAULT '0',
  `deleted_at` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `rivers_county_id_foreign` (`county_id`),
  CONSTRAINT `rivers_county_id_foreign` FOREIGN KEY (`county_id`) REFERENCES `counties` (`id`) ON DELETE SET NULL
) ENGINE=InnoDB AUTO_INCREMENT=662 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

CREATE TABLE `beats` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
  `river_id` int(10) unsigned DEFAULT NULL,
  `created_at` timestamp NULL DEFAULT NULL,
  `updated_at` timestamp NULL DEFAULT NULL,
  `removed` tinyint(1) NOT NULL DEFAULT '0',
  `latitude` varchar(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `longitude` varchar(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `active_on_laksebors` tinyint(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY `beats_river_id_foreign` (`river_id`),
  CONSTRAINT `beats_river_id_foreign` FOREIGN KEY (`river_id`) REFERENCES `rivers` (`id`) ON DELETE SET NULL
) ENGINE=InnoDB AUTO_INCREMENT=1868 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Here is explain format=json

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "4516.29"
    },
    "ordering_operation": {
      "using_temporary_table": true,
      "using_filesort": true,
      "cost_info": {
        "sort_cost": "806.13"
      },
      "nested_loop": [
        {
          "table": {
            "table_name": "rivers",
            "access_type": "ALL",
            "possible_keys": [
              "PRIMARY"
            ],
            "rows_examined_per_scan": 557,
            "rows_produced_per_join": 5,
            "filtered": "1.00",
            "cost_info": {
              "read_cost": "56.89",
              "eval_cost": "0.56",
              "prefix_cost": "57.45",
              "data_read_per_join": "28K"
            },
            "used_columns": [
              "id",
              "active_on_laksebors",
              "deleted_at"
            ],
            "attached_condition": "((`database`.`rivers`.`active_on_laksebors` = 1) and (`database`.`rivers`.`deleted_at` is null))"
          }
        },
        {
          "table": {
            "table_name": "catch_logs",
            "access_type": "ref",
            "possible_keys": [
              "catch_logs_beat_id_foreign",
              "catch_logs_river_id_foreign",
              "catch_logs_removed_no_catch_river_id_date_beat_id_index"
            ],
            "key": "catch_logs_removed_no_catch_river_id_date_beat_id_index",
            "used_key_parts": [
              "removed",
              "no_catch",
              "river_id"
            ],
            "key_length": "7",
            "ref": [
              "const",
              "const",
              "database.rivers.id"
            ],
            "rows_examined_per_scan": 1447,
            "rows_produced_per_join": 8061,
            "filtered": "100.00",
            "using_index": true,
            "cost_info": {
              "read_cost": "25.15",
              "eval_cost": "806.13",
              "prefix_cost": "888.73",
              "data_read_per_join": "24M"
            },
            "used_columns": [
              "id",
              "date",
              "river_id",
              "beat_id",
              "no_catch",
              "removed"
            ],
            "attached_condition": "(`database`.`catch_logs`.`beat_id` is not null)"
          }
        },
        {
          "table": {
            "table_name": "beats",
            "access_type": "eq_ref",
            "possible_keys": [
              "PRIMARY"
            ],
            "key": "PRIMARY",
            "used_key_parts": [
              "id"
            ],
            "key_length": "4",
            "ref": [
              "database.catch_logs.beat_id"
            ],
            "rows_examined_per_scan": 1,
            "rows_produced_per_join": 806,
            "filtered": "10.00",
            "cost_info": {
              "read_cost": "2015.31",
              "eval_cost": "80.61",
              "prefix_cost": "3710.17",
              "data_read_per_join": "4M"
            },
            "used_columns": [
              "id",
              "active_on_laksebors"
            ],
            "attached_condition": "(`database`.`beats`.`active_on_laksebors` = 1)"
          }
        }
      ]
    }
  }
}

It would be very helpful if we could see the actual table and index DML, rather than your summary of them. Is that possible? — Andy Lester, Jan 14 '22 at 14:28
554 rows is so few that mysql probably does not care about indexes — Shadow, Jan 14 '22 at 14:48
You may be looking at the wrong place, as the interesting index is probably the one for A. You may not have a good one there, so MySQL is doing something strange (yet actually creatively smart) with C. I would probably try "date, something" plus maybe "b_id, c_id". It may depend on other details (e.g. how many rows are something=false). The more details you give us about your actual table/index/execution plan/data distribution, the more we can help. Also please add the execution time. — Solarflare, Jan 14 '22 at 17:09
It might be much easier to optimize without `and active_web = 1 and deleted_at is null`. — Rick James, Jan 14 '22 at 18:40
_Please_ provide `SHOW CREATE TABLE` for each table; we desperately need to see the indexes, including the PK. It appears that `id` is not the `PRIMARY KEY` of `C`?? `table_a_b_id_foreign` is not specific enough. — Rick James, Jan 14 '22 at 18:42
@AndyLester I have updated my question with actual query, and all table definitions with *show create table* RickJames I need these as filtering criterias. I don't know how this affects the index, but only ~10% of the rivers have deleted_at = null — larsesen, Jan 15 '22 at 11:12
I'm puzzled. There are two ways to get from `beats` to `rivers` -- either directly via `beats.river_id` or via `catch_logs`. What is the intended relationship: many:many? one:many? — Rick James, Jan 15 '22 at 16:30
Another request: `EXPLAIN FORMAT=JSON SELECT ...` -- It may provide some more details on why the query failed to use river's PK. — Rick James, Jan 15 '22 at 16:43
@RickJames A catch belongs to exactly one beat, a beat can have many catches. A catch belongs to exactly one river, a river can have many catches. A beat belongs to exactly one river, a river can have many beats. I guess you can say that the link from *catch* to *river* is excess, but maybe that is done for easing other queries elsewhere which is not dependent on *beats* — larsesen, Jan 15 '22 at 19:13
Please clarify via edits, not comments. Please don't abuse formatting, code format is for code. Large passages under typographical emphasis is not needed & are an impediment to reading; just organize & write clearly. — philipxy, Jan 15 '22 at 21:16

Rick James · Answer 1 · 2022-01-15T21:38:11.157

If A has PRIMARY KEY(id), then there is already a good index starting with id, namely the data's BTree, itself. Adding create index idx_tablea_b_c on a(id, b_id, c_id); is likely to be useless and ignored.
On the other hand, one of these might be useful because it starts with the item in the WHERE and includes all the columns needed. (I don't know which would be better.)
```
INDEX(something, date, some_id, some_other_id)
INDEX(something, some_id, some_other_id, date)
```
Your description mentions A.b_id and A.some_id; where they supposed to be the same? (Addressing this may lead to answering your question about accessing C only once.)
Those are not those are not "joins" but subqueries used for filtering.
WHERE ... EXISTS ( ... ) is good. LEFT JOIN ... IS NOT NULL ... is not likely to be any better.

More

Change SELECT * to SELECT 1. (This probably makes no difference.)
Please add this index to rivers: INDEX(deleted_at, active_on_laksebors, id). (It seems that the Optimizer has decided to change EXISTs() into JOIN; I did not know it would try to do such. This index should speed up rivers a little.)
Run the "Optmizer trace"; it may tell us why it picked the table order it did. Tips on how: Index Cookbook

I'm sorry if all this was confusing operating with A,B and C. I have updated my question with actual table names and definitions — larsesen, Jan 15 '22 at 11:14
@larsesen - Thanks for the new info. (I put a couple of Comments after the Question.) — Rick James, Jan 15 '22 at 16:44
@larsesen - You have a very challenging Question (at least for me). I added some "More". — Rick James, Jan 15 '22 at 21:39

Unable to add index which speeds up SQL query

1 Answers1