2

I'm not mysql guy, but I've got opportunity to work with that recently - I've got to optimize some query on Mysql5.7 (I have to cover that also on 5.6, but started with 5.7 as it has apparently more info in explain) running on AWS Aurora. It takes a lot of time, some joins are involved, etc. I started with cutting branches and choose to start 'debugging' with two tables only. These are not too big (~2M and ~1.5M rows), but well, generally I consider them as not too greatly designed (primary keys on varchar(255) columns, etc.).

The thing is that I wanted to take a look into so I used explain format=json and I'm trying to get any insight from that.

So, let's say that for

select cc.id, cc.col1, cc.col2, ct.col1
from my_table cc 
inner join my_table ct on ct.cc_id = cc.id

I am getting something like

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "5630369.56"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "cc",
          "access_type": "index",
          "possible_keys": [
            "PRIMARY"
          ],
          "key_length": "258",
          "rows_examined_per_scan": 1248725,
          "rows_produced_per_join": 1248725,
          "filtered": "100.00",
          "using_index": true,
          "cost_info": {
            "read_cost": "3732979.00",
            "eval_cost": "249745.00",
            "prefix_cost": "3982724.00",
            "data_read_per_join": "1G"
          }
        }
      },
      {
        "table": {
          "table_name": "ct",
          "access_type": "ref",
          "possible_keys": [
            "cc_id_idx"
          ],
          "key": "cc_id_idx",
          "key_length": "257",
          "ref": [
            "cc.id"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 1373037,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "1373037.97",
            "eval_cost": "274607.59",
            "prefix_cost": "5630369.56",
            "data_read_per_join": "3G"
          },
          "used_columns": [
            "id",
            "col_1",
            "col_2",
            ...
          ]
        }
      }
    ]
  }
}

and I have problems with understanding what happens here exactly. I think that mysql makes nested join where it starts with table cc and then for each row it hits to ct (access_type is ref and "rows_examined_per_scan": 1). It makes it > 1M times ("rows_produced_per_join": 1373037). Is that correct? I was looking for any documentation for that, but I didn't found any specifing info about how read these values in context of joining - maybe just my google-fu is not strong enough. Could any give me any clue about that?

(As I'm looking for any options to speed it up, I wanted to force mysql to make a hash join, which is available on Aurora (I was trying on Aurora 2.09) in form of /*+ HASH_JOIN(cc) */, but it didn't affect the query plan in any case - but it's rather issue for another question.)

Adam Tokarski
  • 659
  • 4
  • 13

1 Answers1

4
select cc.id, cc.col1, cc.col2, ct.col1
    from my_table cc 
    inner join my_table ct  on ct.cc_id = cc.id

If there were a WHERE clause, that would probably cause the Optimizer to pick the table mentioned in WHERE as the "first" table in the JOIN.

Without a WHERE, the Optimizer usually picks the smaller table. as "first".

Then it usually does a NLJ (Nested Loop Join):

  1. Read all the rows (possibly filtered by WHERE) of the "first" table.
  2. For each of those rows it reaches into the other table (NLJ).

That step 2 is sometimes done by reading the entire "second" table into memory and building a hash. But this only works for a "small" second table. Your table seems to be too big for that.

I bring all this up because the query you presented is simplified; the 'real' query may not be executed the same as that EXPLAIN.

Some observations:

  • "key_length" is "big" and inconsistent -- check declarations, including the collation. Don't blindly use VARCHAR(255) when some sensible length is practical. Mixing collations in a JOIN is deadly on performance. (The is no real proof that the query has this problem.)
  • Although I just criticized the VARCHAR(255) for a PRIMARY KEY, I disagree with those who exclaim that it is terrible and that you "must" switch to an INT. Can you tell us what type of data is there? UUID vs short string vs URL vs ... -- different optimization techniques may be applicable depending on the data.
  • "using_index": true -- This means that all the columns needed from the table in that query are found in the INDEX being used. This is good for performance. But you could kill that if you add another column to the SELECT clause.
  • "used_columns" -- The list seems longer than what the query needs? Do you have a lot of columns in the index? Does the EXPLAIN not match the query?
  • A "hash" is rarely better than a "BTree". (I don't know the specifics of Aurora, so I can't say anything about the index hint you tried.)

The simple query, as written, will take a long time -- a scan of one table (or index), plus lots of BTree lookups into the other table.

Back to your original question. Here's what I see in the Explain:

  • "Nested loop" == NLJ, as expected and as discussed above.
  • Big costs -- to be expected when joining big tables
  • "Using index" == good, but the only "possible key" is "PRIMARY". So, it is actually a table scan, not really an "index scan".
  • "Key_length" -- see above
  • "Used_columns" -- see above
  • access_type is ref and "rows_examined_per_scan": 1 -- The lookup (see above). But note that "rows_scanned" is an estimate. The real value depends on whether cc_id is UNIQUE.
  • "rows_produced_per_join" -- Not that important

Please provide SHOW CREATE TABLE for questions like this.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thanks for that answer! I appreciate really detailed level of it and suggestion you gave me. Issue was generally with collations and slightly mismatched data types. – Adam Tokarski Oct 30 '20 at 17:58