2

I am building a sql query with a large set of data but query is too slow

I've got 3 tables; movies, movie_categories, skipped_movies
The movies table is normalized and I am trying to query a movie based on a category while excluding ids from skipped_movies table.

However I am trying to use WHERE IN and WHERE NOT IN to in my query.

movies table has approx. 2 million rows (id, name, score)
movie_categories approx. 5 million (id, movie_id, category_id)
skipped_movies has approx. 1k rows (id, movie_id, user_id)

When the skipped_movies table is very small 10 - 20 rows the query is quite fast. (about 40 - 50 ms) but when the table gets somewhere around 1k of data I get somewhere around 7 to 8 seconds on the query.

This is the query I'm using.

SELECT SQL_NO_CACHE * FROM `movies` WHERE `id` IN (SELECT `movie_id` FROM `movie_categories` WHERE `category_id` = 1) AND `id` NOT IN (SELECT `movie_id` FROM `skipped_movies` WHERE `user_id` = 1) AND `score` <= 9 ORDER BY `score` DESC LIMIT 1;

I've tried many ways that came to mind but this was the fastest one. I even tried the EXISTS method to no extent.

I'm using the SQL_NO_CACHE just for testing.

And I guess that the ORDER BY statement is running very slow.

gblblgl
  • 137
  • 2
  • 12

3 Answers3

1

Assuming that (movie_id,category_id) is unique in movies_categories table, I'd get the specified result using join operations, rather than subqueries.

To exclude "skipped" movies, an anti-join pattern would suffice... that's a left outer join to find matching rows in skipped_movies, and then a predicate in the WHERE clause to exclude any matches found, leaving only rows that didn't have a match.

SELECT SQL_NO_CACHE m.*
  FROM movies m
  JOIN movie_categories c 
    ON c.movie_id = m.id 
   AND c.category_id = 1
  LEFT
  JOIN skipped_movies s
    ON s.movie_id = m.id
   AND s.user_id = 1
 WHERE s.movie_id IS NULL
   AND m.score <= 9
 ORDER
    BY m.score DESC
 LIMIT 1

And appropriate indexes will likely improve performance...

... ON movie_categories (category_id, movie_id)
... ON skipped_movies (user_id, movie_id)
spencer7593
  • 106,611
  • 15
  • 112
  • 140
  • Thanks spencer! This way has gotten it down to 2 seconds. I still have to optimize it further but this has pointed me in the right way. I'll accept your question. – gblblgl Dec 29 '14 at 21:44
  • I've set an index on the score column on the Movies table and now it's 50 ms. Thanks again! – gblblgl Dec 29 '14 at 21:46
  • Yes, a "Using filesort" operation can be expensive on large sets, that whole resultset is going to be sorted BEFORE the LIMIT clause is applied. By adding that index with a leading column of `score`, you probably nudged MySQL into using that index to return rows in the specified order, rather than performing a sort operation. (If that's the case, the `EXPLAIN` output won't include "Using filesort" in the Extra column.) Very nice. – spencer7593 Dec 29 '14 at 21:50
1

Most IN/NOT IN queries can be expressed using JOIN/LEFT JOIN, which usually gives the best performance.

Convert your query to use joins:

SELECT m.*
FROM movies m
JOIN movie_categories mc ON m.id = mc.movie_id AND mc.category_id = 1
LEFT JOIN skipped_movies sm ON m.id = sm.movie_id AND sm.user_id = 1
WHERE sm.movie_id IS NULL
AND score <= 9
ORDER BY score DESC
LIMIT 1
Bohemian
  • 412,405
  • 93
  • 575
  • 722
0

Your query seem to be all right. Just a small tweak need. You can replace * with with the column/attribute names in your table. It will make this query work faster then ever. Since * operation is really slow

ahmad rabbani
  • 323
  • 2
  • 9