How to optimize this query with a self-join?

Question

I have the following table:

CREATE TABLE lab_data (
  id int(11) NOT NULL,
  patient_sid int(11) DEFAULT NULL,
  double_value double DEFAULT NULL,
  string_value varchar(7) DEFAULT NULL,
  data_type_id int(11) DEFAULT NULL,
  event_date datetime DEFAULT NULL,
  attribute_id int(11) DEFAULT NULL,
  lft int(11) DEFAULT NULL,
  rgt int(11) DEFAULT NULL,
  parent int(11) DEFAULT NULL,
  num_children int(11) DEFAULT NULL,
  PRIMARY KEY (id),
  KEY idx_bucket (attribute_id,string_value),
  KEY idx_test (attribute_id,double_value,event_date,patient_id,lft,rgt)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

This is a very large table (11 million rows), and I really need to optimize the following self-join query:

SELECT  distinct(patient_sid) as patient_sid
FROM lab_data l1 
LEFT JOIN (SELECT patient_sid, lft, rgt
           FROM lab_data
           WHERE attribute_id = 36 AND double_value >= 1.2 AND event_date >= '1776-01-01' 
         ) AS l2 
ON l1. patient_sid = l2.patient_sid AND l1.lft >= l2.lft AND l1.rgt <= l2.rgt
WHERE l1.attribute_id = 33 AND l1.string_value = '2160-0'

(I have tried moving the range search for AND l1.lft >= l2.lft AND l1.rgt <= l2.rgt into the outer where clause, and did not see much difference.)

The index, idx_bucket is correctly being used for the outer query, but the idx_test is not used for the inner sub query when I do an EXPLAIN query plan. Instead, it is using idx_bucket, too.

# id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
'1', 'SIMPLE', 'l1', NULL, 'ref', 'idx_bucket,idx_test', 'idx_bucket', '29', 'const,const', '517298', '100.00', 'Using temporary'
'1', 'SIMPLE', 'lab_data', NULL, 'ref', 'idx_bucket,idx_test', 'idx_bucket', '5', 'const', '13657', '100.00', 'Using where; Distinct'

If I force the inner subquery to use idx_test, I get the following query plan:

# id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
'1', 'SIMPLE', 'l1', NULL, 'ref', 'idx_bucket,idx_test', 'idx_bucket', '29', 'const,const', '517298', '100.00', 'Using temporary'
'1', 'SIMPLE', 'lab_data', NULL, 'ref', 'idx_test', 'idx_test', '5', 'const', '21808', '100.00', 'Using where; Distinct'

And from the JSON output, I only see attribute_id under used_key_parts used for this index ? According to the MySQL documentation (B-Tree Index Characteristics), the btree indices are such that, "A B-tree index can be used for column comparisons in expressions that use the =, >, >=, <, <=, or BETWEEN operators."

 "table": {
  "table_name": "lab_data",
  "access_type": "ref",
  "possible_keys": [
    "idx_test"
  ],
  "key": "idx_test",
  "used_key_parts": [
    "attribute_id"
  ],
  "key_length": "5",
  "ref": [
    "const"
  ],
  "rows_examined_per_scan": 8898041,
  "rows_produced_per_join": 988473,
  "filtered": "11.11",
  "index_condition": "((`ns_large2_2016`.`lab_data`.`double_value` >= 1.2) and (`ns_large2_2016`.`lab_data`.`event_date` >= '1776-01-01'))",
  "cost_info": {
    "read_cost": "339069.00",
    "eval_cost": "197694.69",
    "prefix_cost": "2118677.20",
    "data_read_per_join": "82M"
  },
  "used_columns": [
    "patient_sid",
    "double_value",
    "event_date",
    "attribute_id",
    "lft",
    "rgt"
  ]

Am I misunderstanding what used_key_parts is? I am assuming that these are the columns of the index being used. The documentation for b-tree indexes makes me believe that range comparisons should be included.

What? Why was this downvoted as soon as I posted it? I do not see any vagueness or ambiguity in my question? Oy! — horcle_buzz, Oct 21 '16 at 17:43
Not sure why, but you could improve a litle the query format, and not many ppl can read explain plan in json. — Juan Carlos Oropeza, Oct 21 '16 at 17:49
Appreciate the feedback. Sometimes one wonders what is going on here! (I got the query and am working on the explain plan...) — horcle_buzz, Oct 21 '16 at 18:03
I know there is a site convert that explain in a graphic representation. But dont have it at hand now. — Juan Carlos Oropeza, Oct 21 '16 at 18:04
https://dev.mysql.com/doc/workbench/en/wb-performance-explain.html — Juan Carlos Oropeza, Oct 21 '16 at 18:05
Problem with the graphic or tabular representation is that you can't get specifics about the indices, which is why I pasted in the JSON. I will just instead cut out the fluff and leave the part of the plan that is in question. — horcle_buzz, Oct 21 '16 at 18:20
Regarding your problem, if you run the inner select alone use the index? — Juan Carlos Oropeza, Oct 21 '16 at 18:25

score 0 · Answer 1 · answered Oct 21 '16 at 18:35

0

Try creating an index with

 KEY idx_test2 (attribute_id, double_value, event_date)

answered Oct 21 '16 at 18:35

Juan Carlos Oropeza

47,252
12
78
118

If I use a strict `=` for the double_value comparison in my query, then it works great, but with a range like given above with a `>=`, then it does not use this column in the compound index, even if I specify the `use index` hint. This goes back to my vague question about b-tree indexes. According to the documentation, "A B-tree index can be used for column comparisons in expressions that use the =, >, >=, <, <=, or BETWEEN operators..." however, it does not appear to be the case. – horcle_buzz Oct 21 '16 at 19:14
1

maybe this help MySQL index [**TIPS**](http://mysql.rjweb.org/doc.php/index_cookbook_mysql) – Juan Carlos Oropeza Oct 21 '16 at 19:19
Indeed it is quite helpful. Not sure it solves my particular case, but it at least is a good start. – horcle_buzz Oct 21 '16 at 19:54

score 0 · Answer 2 · answered Oct 21 '16 at 23:49

0

You need INDEX(patient_sid, attribute_id). Unfortunately, that is about all that is useful for l2.
Remove LEFT -- it could lead to extra patient_sid values that you did not want.
Do not expect double_value >= 1.2 to necessarily include "1.2". Floating point values have some quirky rounding issues. (The one failure case that comes to mind is if "1.2" were put into a FLOAT, then moved to a DOUBLE.)
DISTINCT(x) AS y may happen to work, but it is not parsed the way you expect. DISTINCT is not a function. Say SELECT DISTINCT l1.patient_sid FROM ....
See if the following works; it might be faster:

SELECT l1.patient_sid FROM lab_data l1 JOIN lab_data l2 ON l1.patient_sid = l2.patient_sid AND l1.lft >= l2.lft AND l1.rgt <= l2.rgt WHERE l1.attribute_id = 33 AND l1.string_value = '2160-0' AND l2.attribute_id = 36 AND l2.double_value >= 1.2 AND l2.event_date >= '1776-01-01'

answered Oct 21 '16 at 23:49

Rick James

135,179
13
127
222

I had tried that form, with no success. See [optimization-of-query-using-covering-indices](http://stackoverflow.com/questions/37125283/optimization-of-query-using-covering-indices) – horcle_buzz Oct 22 '16 at 00:56
Also, these queries are being constructed by the SQLAlchemy ORM, so use of DISTINCT has to be on a column as a function, not a set of selected columns. – horcle_buzz Oct 22 '16 at 14:39
Well, MySQL does not work that way: `SELECT DISTINCT(a), b FROM ...` is the same as `SELECT DISTINCTROW a,b FROM ...`. "The ALL and DISTINCT options specify whether duplicate rows should be returned. ALL (the default) specifies that all matching rows should be returned, including duplicates. DISTINCT specifies removal of duplicate rows from the result set. ... DISTINCTROW is a synonym for DISTINCT. " – Rick James Oct 22 '16 at 15:23

score 0 · Accepted Answer · answered Oct 22 '16 at 00:54

The solution ended up being using a adjacency list/parent child relationship in the self-join, versus the nested set representation of the self-join:

SELECT  distinct(patient_sid) as patient_sid
FROM lab_data l1 
LEFT JOIN (SELECT parent
           FROM lab_data
           WHERE attribute_id = 36 AND double_value >= 1.2 AND event_date >= '1776-01-01' 
         ) AS l2 
ON l1.id = l2.parent
WHERE l1.attribute_id = 33 AND l1.string_value = '2160-0'

Then, I defined an index on the table using

KEY idx_test (attribute_id, parent)

This ended up speeding up the query by a factor of 80 (with the nested set representation, it took 40+ minutes to execute and fetch the results, while with the adjacency list representation, it took only 28 seconds to complete). The only values now I need to do a range scan on are potentially double_value and event_date.

How to optimize this query with a self-join?

3 Answers3