8

When I add 'distinct' to my query, query time increases from 0.015 to over 6 seconds.

I want to join several tables, which are linked via foreign keys and get a distinct column from it:

select distinct table3.idtable3 from 
    table1
    join table2 on table1.idtable1 = table2.fkey
    join table3 on table2.idtable2 = table3.fkey
    where table1.idtable1 = 1 

The distinct query takes 6 seconds which seems to me to be improvable.

With select:

duration : 0.015s / fetch:5.532s (5.760.434 rows)

Explain:

id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
1   SIMPLE  table1      index   asd asd 137     10  10.00   Using where; Using index
1   SIMPLE  table2      ALL idtable2                200 25.00   Using where; Using join buffer (Block Nested Loop)
1   SIMPLE  table3      ref fkey_table2_table_3_idx fkey_table2_table_3_idx 138 mydb.table2.idtable2    66641   100.00  

With distinct select:

duration : 6.625s / fetch:0.000s (1000 rows)

Explain:

id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
1   SIMPLE  table1      index   asd asd 137     10  10.00   Using where; Using index; Using temporary
1   SIMPLE  table2      ALL idtable2                200 25.00   Using where; Using join buffer (Block Nested Loop)
1   SIMPLE  table3      ref fkey_table2_table_3_idx fkey_table2_table_3_idx 138 mydb.table2.idtable2    66641   100.00  

Database: Database snippet

Code for testing / MCRE:

import mysql.connector
import time
import numpy as np




""" 
-- MySQL Script generated by MySQL Workbench
-- Fri Jan 17 12:19:26 2020
-- Model: New Model    Version: 1.0
-- MySQL Workbench Forward Engineering

SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0;
SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0;
SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION';

-- -----------------------------------------------------
-- Schema mydb
-- -----------------------------------------------------

-- -----------------------------------------------------
-- Schema mydb
-- -----------------------------------------------------
CREATE SCHEMA IF NOT EXISTS `mydb` DEFAULT CHARACTER SET utf8 ;
USE `mydb` ;

-- -----------------------------------------------------
-- Table `mydb`.`table1`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `mydb`.`table1` (
  `idtable1` VARCHAR(45) NOT NULL,
  INDEX `asd` (`idtable1` ASC) VISIBLE)
ENGINE = InnoDB;


-- -----------------------------------------------------
-- Table `mydb`.`table2`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `mydb`.`table2` (
  `idtable2` VARCHAR(45) NOT NULL,
  `fkey` VARCHAR(45) NULL,
  INDEX `link_table1_table2_idx` (`fkey` ASC) INVISIBLE,
  INDEX `idtable2` (`idtable2` ASC) VISIBLE,
  CONSTRAINT `link_table1_table2`
    FOREIGN KEY (`fkey`)
    REFERENCES `mydb`.`table1` (`idtable1`)
    ON DELETE NO ACTION
    ON UPDATE NO ACTION)
ENGINE = InnoDB;


-- -----------------------------------------------------
-- Table `mydb`.`table3`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `mydb`.`table3` (
  `idtable3` VARCHAR(45) NOT NULL,
  `fkey` VARCHAR(45) NULL,
  INDEX `fkey_table2_table_3_idx` (`fkey` ASC) VISIBLE,
  CONSTRAINT `fkey_table2_table_3`
    FOREIGN KEY (`fkey`)
    REFERENCES `mydb`.`table2` (`idtable2`)
    ON DELETE NO ACTION
    ON UPDATE NO ACTION)
ENGINE = InnoDB;


SET SQL_MODE=@OLD_SQL_MODE;
SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS;
SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS;


"""


def insertData():
    for i in range(2):
        num_distinct_table1_values = 5
        num_distinct_table2_values = 10
        num_distinct_table3_values = 1000

        num_entries_table1 = int(num_distinct_table1_values)
        num_entries_table2 = int(num_distinct_table2_values * 10)
        num_entries_table3 = int(num_distinct_table3_values * 300)

        random_numbers_table1_id = range(num_distinct_table1_values)

        random_numbers_table2_id = np.random.randint(num_distinct_table2_values, size=int(num_entries_table2))
        random_numbers_table2_fkey = np.random.randint(num_distinct_table1_values, size=int(num_entries_table2))

        random_numbers_table3_id = np.random.randint(num_distinct_table3_values, size=int(num_entries_table3))
        random_numbers_table3_fkey = np.random.randint(num_distinct_table2_values, size=int(num_entries_table3))

        value_string_table1 = ','.join([f"('{i_name}')" for i_name in random_numbers_table1_id])
        value_string_table2=""
        for i in range(num_entries_table2):
            value_string_table2 = value_string_table2+','.join(
                ["('{id}','{fkey}'),".format(id=random_numbers_table2_id[i], fkey=random_numbers_table2_fkey[i])])

        value_string_table3=""
        for i in range(num_entries_table3):
            value_string_table3 = value_string_table3+','.join(
                ["('{id}','{fkey}'),".format(id=random_numbers_table3_id[i], fkey=random_numbers_table3_fkey[i])])

        # fill table 1
        mySql_insert_query = f"INSERT INTO table1 (idtable1) VALUES {value_string_table1}"
        cursor.execute(mySql_insert_query)
        conn.commit()
        print("Done table 1")
        # fill table 2
        mySql_insert_query = f"INSERT INTO table2 (idtable2, fkey) VALUES {value_string_table2}"
        mySql_insert_query=mySql_insert_query[0:-1]
        cursor.execute(mySql_insert_query)
        print("Done table 2")
        # fill table 3
        mySql_insert_query = f"INSERT INTO table3 (idtable3, fkey) VALUES {value_string_table3}"
        mySql_insert_query = mySql_insert_query[0:- 1]
        cursor.execute(mySql_insert_query)
        print("Done table 3")

        conn.commit()

conn = mysql.connector.connect(user='root', password='admin', host='127.0.0.1',
                               database='mydb', raise_on_warnings=True, autocommit=False)
cursor = conn.cursor()


insertData()


conn.close()
symcbean
  • 47,736
  • 6
  • 59
  • 94
Langer
  • 97
  • 7

1 Answers1

2

Thanks for the CREATE TABLEs; you might never have gotten an Answer without them.

  • Every table should have a PRIMARY KEY. If you have a column (or combination of columns) that 'naturally' works, use that. Else use an AUTO_INCREMENT.
  • When timing queries, (1) make sure the "Query cache" is not being used, and (2) run the query twice to check for other variations in timing.
  • INDEX(fkey) is INVISIBLE, hence not used. Don't waste learning time on VISIBLE/INVISIBLE, you may never need them in your career.
  • When experimenting, be sure to have more than a few rows in each table, and have their values varying in a realistic way. Else, the Optimizer may take shortcuts that will only confuse your learning experience.
  • And...

    duration : 0.015s / fetch:5.532s (5.760.434 rows)
    duration : 6.625s / fetch:0.000s (1000 rows)
    

Notice how both are about 6 seconds. It is just that the time is split up differently.

  • With 6M rows and no DISTINCT, the query can pump out the data immediately, but takes a long time due to network latency.
  • With the DISTINCT, the first row cannot come out until after preforming the "de-duplication", which probably involves a "temporary" (see the EXPLAIN) and a sort. So, now all the time is involved in computing before sending the data.
  • The confusion is that you looked only at the "duration" not the sum of the two times. That is, the total time is the important one to note.
  • The DISTINCT one is slightly slower (total time) because of the extra step of collecting and sorting 5.7M rows.
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • 'the first row cannot come out until after preforming the "de-duplication"'--To be pedantic, the second row. – philipxy Jan 17 '20 at 22:14
  • @philipxy - That depends on the algorithm. I think it chooses between sorting the data versus building a hash in RAM. Evidence: Sometimes `DISTINCT` is sorted, sometimes not. – Rick James Jan 17 '20 at 22:55
  • The quoted statement is about output of some vague class of implementations that do something then "de-duplicate". The last step of the something or the first step of the "de-plicate" could be to output any row. So the statement is false. Moreover as a true-or-false statement it doesn't "depend on" anything. I am sure we would agree on certain other statements being true & your comment being true of them, but they're not what you wrote. (My original point was just to point out an edge case.) – philipxy Jan 17 '20 at 23:54
  • 1) I omitted the primkey because it would be more effort to put data in. Just don't ask... But it does not effect the performance here. 2) This cache thing killed my benchmarking so often...but not this time. How can you avoid using the cache? 3) Invis index: Okay, but didnt change anything in performance. 4) The Data was realistic. More or less....I Know what you mean. Rest) I can add 2 numerbs and see the sum, but thanks for this math lesson ;) I was wondering why this distinct takes so long and I wanted to speed this up. I will prob. not increas network speed. – Langer Jan 20 '20 at 06:17
  • 1
    @Langer - The "Query cache" can be avoided on selects thus: `SELECT SQL_NO_CACHE ...`. – Rick James Jan 20 '20 at 06:19
  • So the question remains: How can I speed up the distinct or is this a realistic maximum? – Langer Jan 20 '20 at 06:20
  • @Langer - I claim that `DISTINCT` cost only 20% of the total (5.5s vs 6.6s). And that 1.1s is not bad for dedupping 5.7M rows. Next I claim that "time-to-first-byte" is a useless metric, since the 'consumer' of the data will be bogged down handling 5.7M rows. And I claim that you should focus on avoiding the 5.7M, perhaps by figuring out what the user _really_ needs and send him only that. – Rick James Jan 20 '20 at 06:26
  • @Langer - Oh, what was the meaning of "(1000 rows)"? – Rick James Jan 20 '20 at 06:29
  • I get back 1000 rows, what the user "really needs". This is why I want this distinct. – Langer Jan 20 '20 at 06:53
  • @Langer - Oh, there are 5.7M rows in the table, but only 1K distinct values? Well, that is both filtered and de-dupped. Is `table2` a many-to-many mapping table? If so, see the tips here: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table – Rick James Jan 20 '20 at 07:53
  • @Rick James - I'm not sure how to implement it in my example. Could you do this for this small MCRE, so that I could see how it work and directly benchmark it? This would help me alot! – Langer Jan 20 '20 at 10:24
  • @Langer - Let's start by using realistic table names like student/class/course/instructor or customer/order/order_item. Then state the question as "how many different _customers_ are there". – Rick James Jan 20 '20 at 18:22
  • @RickJames: We have 3 tables: product->chips->results. Those results have test Point names (tpNames). So Every product-family has a lot of chips and those chips have a lot of results, but only ~1000 distinct testpointNames. Now when I say: Give me all the tpNames for the product "SomeProduct", I want to have all tpNames but a distinct list of those. So maybe for those product we only have 100 testPointNames, but 1M results, because of the number of chips. – Langer Jan 21 '20 at 06:50
  • @RickJames: I'm still struggeling to get a better performance with the distinct query. Do you have more suggestions? – Langer Jan 28 '20 at 08:18
  • @Langer - Next suggestion is to try something totally different. Then a 3rd schema and query. There are a thousand aspects of databases; you have spent 11 days on this one aspect. It will be years before you get through... – Rick James Jan 28 '20 at 16:12
  • @Langer - and... How about "when to use `GROUP BY` versus `DISTINCT`? Should they ever be used at the same time? (No.) What about without a `JOIN` (there is at least one nifty optimization that can happen.) Are there different optimizations for different situations. (Yes.) Does anyone use `VARCHAR(45) for keys? (Rarely for PK.) Would `INT` be 'better'? What is the overhead of a `FOREIGN KEY`? (I still don't have a good handle on that question.) – Rick James Jan 28 '20 at 16:18
  • @Langer - Or... Sorry, I have run out of ideas on this one Question about `DISTINCT`. – Rick James Jan 28 '20 at 16:19