Efficient MySQL query to find entries in A where not matched in B

Question

I have a couple of tables (products and suppliers) and want to find out which items are no longer listed in the suppliers table.

Table uc_products has the products. Table uc_supplier_csv has supplier stocks. uc_products.model joins against uc_suppliers.sku.

I am seeing very long queries when trying to identify the stock in the products table which are not referred to in the suppliers table. I only want to extract the nid of the entries which match; sid IS NULL is just so I can identify which items don't have a supplier.

For the first of the queries below, it takes the DB server (4GB ram / 2x 2.4GHz intel) an hour to get a result (507 rows). I didn't wait for the second query to finish.

How can I make this query more optimal? Is it due to the mismatched character sets?

I was thinking that the following would be the most efficient SQL to use:

         SELECT nid, sid 
           FROM uc_products p
LEFT OUTER JOIN uc_supplier_csv c
             ON p.model = c.sku
         WHERE sid IS NULL ;

For this query, I get the following EXPLAIN result:

mysql> EXPLAIN SELECT nid, sid FROM uc_products p LEFT OUTER JOIN uc_supplier_csv c ON p.model = c.sku WHERE sid IS NULL;
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows   | Extra                   |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------------------+
|  1 | SIMPLE      | p     | ALL  | NULL          | NULL | NULL    | NULL |   6526 |                         | 
|  1 | SIMPLE      | c     | ALL  | NULL          | NULL | NULL    | NULL | 126639 | Using where; Not exists | 
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------------------+
2 rows in set (0.00 sec)

I would have thought that the keys idx_sku and idx_model would be valid for use here, but they aren't. Is that because the tables' default charsets do not match? One is UTF-8 and one is latin1.

I also considered this form:

SELECT nid 
  FROM uc_products 
 WHERE model 
NOT IN ( 
         SELECT DISTINCT sku FROM uc_supplier_csv 
       ) ;

EXPLAIN shows the following results for that query:

mysql> explain select nid from uc_products where model not in ( select sku from uc_supplier_csv ) ;
+----+--------------------+-----------------+-------+-----------------------+---------+---------+------+--------+--------------------------+
| id | select_type        | table           | type  | possible_keys         | key     | key_len | ref  | rows   | Extra                    |
+----+--------------------+-----------------+-------+-----------------------+---------+---------+------+--------+--------------------------+
|  1 | PRIMARY            | uc_products     | ALL   | NULL                  | NULL    | NULL    | NULL |   6520 | Using where              | 
|  2 | DEPENDENT SUBQUERY | uc_supplier_csv | index | idx_sku,idx_sku_stock | idx_sku | 258     | NULL | 126639 | Using where; Using index | 
+----+--------------------+-----------------+-------+-----------------------+---------+---------+------+--------+--------------------------+
2 rows in set (0.00 sec)

And just so I don't miss anything out, here are a few more exciting details: the table sizes and stats, and the table structure :)

mysql> show table status where Name in ( 'uc_supplier_csv', 'uc_products' ) ;
+-----------------+--------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
| Name            | Engine | Version | Row_format | Rows   | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time         | Update_time         | Check_time          | Collation         | Checksum | Create_options | Comment |
+-----------------+--------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
| uc_products     | MyISAM |      10 | Dynamic    |   6520 |             89 |      585796 | 281474976710655 |       232448 |       912 |           NULL | 2009-04-24 11:03:15 | 2009-10-12 14:23:43 | 2009-04-24 11:03:16 | utf8_general_ci   |     NULL |                |         | 
| uc_supplier_csv | MyISAM |      10 | Dynamic    | 126639 |             26 |     3399704 | 281474976710655 |      5864448 |         0 |           NULL | 2009-10-12 14:28:25 | 2009-10-12 14:28:25 | 2009-10-12 14:28:27 | latin1_swedish_ci |     NULL |                |         | 
+-----------------+--------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+

and

CREATE TABLE `uc_products` (
  `vid` mediumint(9) NOT NULL default '0',
  `nid` mediumint(9) NOT NULL default '0',
  `model` varchar(255) NOT NULL default '',
  `list_price` decimal(10,2) NOT NULL default '0.00',
  `cost` decimal(10,2) NOT NULL default '0.00',
  `sell_price` decimal(10,2) NOT NULL default '0.00',
  `weight` float NOT NULL default '0',
  `weight_units` varchar(255) NOT NULL default 'lb',
  `length` float unsigned NOT NULL default '0',
  `width` float unsigned NOT NULL default '0',
  `height` float unsigned NOT NULL default '0',
  `length_units` varchar(255) NOT NULL default 'in',
  `pkg_qty` smallint(5) unsigned NOT NULL default '1',
  `default_qty` smallint(5) unsigned NOT NULL default '1',
  `unique_hash` varchar(32) NOT NULL,
  `ordering` tinyint(2) NOT NULL default '0',
  `shippable` tinyint(2) NOT NULL default '1',
  PRIMARY KEY  (`vid`),
  KEY `idx_model` (`model`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 

CREATE TABLE `uc_supplier_csv` (
  `sid` int(10) unsigned NOT NULL default '0',
  `sku` varchar(255) default NULL,
  `stock` int(10) unsigned NOT NULL default '0',
  `list_price` decimal(8,2) default '0.00',
  KEY `idx_sku` (`sku`),
  KEY `idx_stock` (`stock`),
  KEY `idx_sku_stock` (`sku`,`stock`),
  KEY `idx_sid` (`sid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

EDIT: Adding query plans for a couple of suggested queries from Martin below:

mysql> explain SELECT nid FROM uc_products p WHERE NOT EXISTS ( SELECT 1 FROM uc_supplier_csv c WHERE p.model = c.sku ) ;
+----+--------------------+-------+-------+---------------+---------+---------+------+--------+--------------------------+
| id | select_type        | table | type  | possible_keys | key     | key_len | ref  | rows   | Extra                    |
+----+--------------------+-------+-------+---------------+---------+---------+------+--------+--------------------------+
|  1 | PRIMARY            | p     | ALL   | NULL          | NULL    | NULL    | NULL |   6526 | Using where              | 
|  2 | DEPENDENT SUBQUERY | c     | index | NULL          | idx_sku | 258     | NULL | 126639 | Using where; Using index | 
+----+--------------------+-------+-------+---------------+---------+---------+------+--------+--------------------------+
2 rows in set (0.00 sec)

mysql> explain SELECT nid FROM uc_products WHERE model NOT IN ( SELECT sku  FROM uc_supplier_csv ) ;
+----+--------------------+-----------------+-------+-----------------------+---------+---------+------+--------+--------------------------+
| id | select_type        | table           | type  | possible_keys         | key     | key_len | ref  | rows   | Extra                    |
+----+--------------------+-----------------+-------+-----------------------+---------+---------+------+--------+--------------------------+
|  1 | PRIMARY            | uc_products     | ALL   | NULL                  | NULL    | NULL    | NULL |   6526 | Using where              | 
|  2 | DEPENDENT SUBQUERY | uc_supplier_csv | index | idx_sku,idx_sku_stock | idx_sku | 258     | NULL | 126639 | Using where; Using index | 
+----+--------------------+-----------------+-------+-----------------------+---------+---------+------+--------+--------------------------+
2 rows in set (0.00 sec)

Your use of HAVING in the first query is incorrect -- since there's no GROUP BY, it should be a simple WHERE. Not sure why MySQL isn't giving you an error message, but I guess that's what messing up the query plan! — Alex Martelli, Oct 12 '09 at 04:28
I tested the four query forms on this page last night on my laptop (MBP2.4GHz/4GB/OSX/MAMP MySQL). * The LEFT OUTER JOIN form above took 3526s to execute. * The subselect form above took 1021s to execute. * Martin's suggestion below took 637s to execute. * James's was marginally faster than Martin's, but returned a different result from the other three forms. — Chris Burgess, Oct 12 '09 at 20:00

martin clayton · Accepted Answer · 2009-10-13T07:31:30.640

3

Perhaps try using NOT EXISTS rather than counts? For example:

SELECT nid 
  FROM uc_products p
 WHERE NOT EXISTS ( 
       SELECT 1 
         FROM uc_supplier_csv c
        WHERE p.model = c.sku
       )

SO user Quassnoi has a short article outlining some tests that suggest that this might also be worth a try:

SELECT nid 
  FROM uc_products
 WHERE model NOT IN ( 
       SELECT sku 
       FROM uc_supplier_csv
       )

basically as per your original query, without the DISTINCTion.

Another one for you Chris, this time with help for the cross-encoding join:

SELECT nid
  FROM uc_products p
 WHERE NOT EXISTS (
       SELECT 1
       FROM uc_supplier_csv c
       WHERE CONVERT( p.model USING latin1 )  = c.sku
       )

edited Oct 13 '09 at 07:31

answered Oct 12 '09 at 07:26

martin clayton

76,436
32
213
198

This query was the fastest suggested solution to return the correct result. It took 637s to execute. – Chris Burgess Oct 12 '09 at 19:56
What did the query plan look like? – martin clayton Oct 12 '09 at 21:03
added to question (formatting isn't working for me in the comments?) – Chris Burgess Oct 13 '09 at 04:38
Third time lucky? See above. – martin clayton Oct 13 '09 at 07:32
Wow. 0.17s instead of ten minutes. Nice. The third example (using WHERE CONVERT() above) gives the right answer almost instantly. Is it about the CONVERT you think? I had updated my local table to be DEFAULT CHARSET UTF8 but didn't see any change in behaviour from doing that. – Chris Burgess Oct 13 '09 at 19:18
The convert helps the optimiser when there is a 'cross encoding' needed. Best if you can match encodings in the join fields though. If you applied 'ALTER TABLE uc_supplier_csv DEFAULT CHARACTER SET utf8' you should see (using SHOW CREATE TABLE) that the existing sku field data has not been converted - so the optimiser still needs help. If you can apply 'ALTER TABLE uc_supplier_csv CONVERT TO CHARACTER SET utf8' that should migrate the data over, and the convert should no longer be needed. – martin clayton Oct 13 '09 at 21:17

Efficient MySQL query to find entries in A where not matched in B

1 Answers1