Select from table where in other table most efficient way

Question

I have a list of ID numbers as one table in my mySQL database; I have a second table that has From_ID, To_ID, and Frequency columns.

I'd like to make a third table that has the same structure as the second, but with only those rows for which the 'from' and 'to' IDs are in the first table.

The first table has ≈ 80k rows and the second has ≈ 45 million. It is taking so long that the process does not seem to end in a reasonable time (not less than a day).

My current query is as follows:

CREATE table3 AS (SELECT * FROM table2 
                  WHERE from_id IN (SELECT id FROM table1) 
                  AND to_id IN (SELECT id FROM table1);

If anybody could tell me a more efficient way to go about this I would really appreciate it!

score 2 · Accepted Answer · answered Mar 21 '15 at 20:02

2

First, use exists rather than in:

SELECT t2.*
FROM table2 t2
WHERE EXISTS (SELECT 1 FROM table1 t1 WHERE t1.id = t2.from_id) AND
      EXISTS (SELECT 1 FROM table1 t1 WHERE t1.id = t2.to_id);

Then be sure that you have an index on table1(id). The latter is really important.

As a note: you can test the query in a user interface by putting limit 100, then limit 1000 etc. onto the query. This will let you see what the performance is like as the data grows.

answered Mar 21 '15 at 20:02

Gordon Linoff

1,242,037
58
646
786

This worked great - thanks! Went from > 24hrs to 175 seconds. – Pete Ludlow Mar 21 '15 at 20:23
99.8% improvement ;) Actually, the index gets credit for that. – Gordon Linoff Mar 21 '15 at 20:57

Schwern · Answer 2 · 2015-03-21T20:20:33.083

I'd like to make a third table that has the same structure as the second, but with only those rows for which the 'from' and 'to' IDs are in the first table.

This is referred to as "denormalization" and while there are valid reasons to do it, it's not considered good database design and should be avoided.

Presumably you want to do this because your queries are so slow. So let's look at your query.

SELECT *
FROM  table2 
WHERE from_id IN (SELECT id FROM table1) 
  AND to_id   IN (SELECT id FROM table1)

This might be slow if MySQL has to do a full table scan of table1, but it appears to be smart enough to recognize it can use an index.

mysql> explain SELECT * FROM table2                    WHERE from_id IN (SELECT id FROM table1)                    AND to_id IN (SELECT id FROM table1);
+----+-------------+--------+--------+---------------+---------+---------+---------------------+------+-------------+
| id | select_type | table  | type   | possible_keys | key     | key_len | ref                 | rows | Extra       |
+----+-------------+--------+--------+---------------+---------+---------+---------------------+------+-------------+
|  1 | SIMPLE      | table2 | ALL    | NULL          | NULL    | NULL    | NULL                |    4 | Using where |
|  1 | SIMPLE      | table1 | eq_ref | PRIMARY       | PRIMARY | 4       | test.table2.from_id |    1 | Using index |
|  1 | SIMPLE      | table1 | eq_ref | PRIMARY       | PRIMARY | 4       | test.table2.to_id   |    1 | Using index |
+----+-------------+--------+--------+---------------+---------+---------+---------------------+------+-------------+
3 rows in set (0.00 sec)

I think it can be expressed better by explicitly asking for the exact ID in a sub query.

SELECT t2.*
FROM   table2 t2
WHERE  (SELECT 1 FROM table1 t1 WHERE t1.id = t2.from_id)
  AND  (SELECT 1 FROM table1 t1 WHERE t1.id = t2.to_id)

mysql> explain SELECT t2.*     FROM   table2 t2     WHERE  (SELECT 1 FROM table1 t1 WHERE t1.id = t2.from_id)       AND  (SELECT 1 FROM table1 t1 WHERE t1.id = t2.to_id);
+----+--------------------+-------+--------+---------------+---------+---------+-----------------+------+-------------+
| id | select_type        | table | type   | possible_keys | key     | key_len | ref             | rows | Extra       |
+----+--------------------+-------+--------+---------------+---------+---------+-----------------+------+-------------+
|  1 | PRIMARY            | t2    | ALL    | NULL          | NULL    | NULL    | NULL            |    4 | Using where |
|  3 | DEPENDENT SUBQUERY | t1    | eq_ref | PRIMARY       | PRIMARY | 4       | test.t2.to_id   |    1 | Using index |
|  2 | DEPENDENT SUBQUERY | t1    | eq_ref | PRIMARY       | PRIMARY | 4       | test.t2.from_id |    1 | Using index |
+----+--------------------+-------+--------+---------------+---------+---------+-----------------+------+-------------+
3 rows in set (0.00 sec)

It's hard to say which one will be faster, I don't have your data set. So long as table2.from_id, table2.to_id and t1.id are indexed, and they should be so long as they're properly declared as foreign and primary keys, you should be ok.

If it's still not fast enough, instead of denormalizing I would suggest you create a view or a temporary table or a query cache. These can effectively cache the query without having to denormalize. Which you choose depends on how often your data updates and how sensitive your application is to changes.

Thanks for such a detailed answer. I'm going with the denormalization because a) I have to work with the subset a lot b) I need to export the subset and c) the data isn't going to change... I guess b) isn't really a valid reason, but (I think?) a) is. Also Sequel Pro doesn't let you create views... — Pete Ludlow, Mar 21 '15 at 20:26

Select from table where in other table most efficient way

2 Answers2