Efficient search in many-to-many related tables

Question

I have two tables related many-to-many via third connecting table: products and categories. Each product can be in a several categories. It's a typical many-to-many realationship:

products
-------------
id
product_name


categories
-------------
id
category_name


products_to_categories
-------------
product_id
caregory_id

I want to allow users to do a search for the products, that are in some of the chosen categories and are not in other chosen categories at the same time.

Example: find all producst that are in categories "Computers" and "Software", but are not in categories "Games", "Programming" and "Education".

Here is the query I designed to do this:

SELECT product_name
FROM products
WHERE
    EXISTS (SELECT product_id FROM products_to_categories WHERE category_id = 1 AND product_id = products.id) 
    AND EXISTS (SELECT product_id FROM products_to_categories WHERE category_id = 2 AND product_id = products.id) 
    AND NOT EXISTS (SELECT product_id FROM products_to_categories WHERE category_id = 3 AND product_id = products.id)
    AND NOT EXISTS (SELECT product_id FROM products_to_categories WHERE category_id = 4 AND product_id = products.id) 
    AND NOT EXISTS (SELECT product_id FROM products_to_categories WHERE category_id = 5 AND product_id = products.id)
ORDER BY id

It works. But it's so incredibly slow, that I just can't use it in production. All idexes are in place, but this query results in 5 dependant subqueries and the tables are huge.

Is there a way to solve the same task without dependant subqueries or optimize this query in some other way?

UPDATE

Indexes are:

products: PRIMARY KEY (id)
categories: PRIMARY KEY (id)
products_to_categories: PRIMARY KEY (product_id, caregory_id)

All tables are InnoDB

EXISTS is slow because it looks at every value. Try limiting the results you get and put a "more button" or other technique. — AlfredoVR, Feb 11 '12 at 17:36
@Silver Light: "indexes are in place". What indexes do you have? (especially in the `products_to_categories` table) — ypercubeᵀᴹ, Feb 11 '12 at 17:56
Try adding a `(caregory_id, product_id)` index. It would really help with any version of your query (and with other searches you might have in ture). — ypercubeᵀᴹ, Feb 11 '12 at 21:25

ypercubeᵀᴹ · Accepted Answer · 2012-02-12T10:51:57.227

2

Please post the tables' definitions (so the engine used and the indexes defined are shown).

You could also post the execution plan of your query (using the EXPLAIN statement).

You could also try rewriting the query in various ways. Here's one:

SELECT p.product_name
FROM products  AS p
  JOIN products_to_categories  AS pc1
    ON pc1.category_id = 1 
    AND pc1.product_id = p.id
  JOIN products_to_categories  AS pc2
    ON  pc2.category_id = 2 
    AND pc2.product_id = p.id
WHERE
    NOT EXISTS 
    ( SELECT * 
      FROM products_to_categories  AS pc 
      WHERE pc.category_id IN (3, 4, 5)
        AND pc.product_id = p.id
    )

Update: You don't have a (category_id, product_id) index. Try adding it.

edited Feb 12 '12 at 10:51

answered Feb 11 '12 at 18:02

ypercubeᵀᴹ

113,259
19
174
235

I think it would be worth trying this query with the `NOT EXISTS` changed for a `LEFT JOIN` instead. It would also help to see the output from `SHOW INDEXES FROM products_to_categories`. – user1191247 Feb 11 '12 at 18:32
This is a great optomization. Exactly what I was looking for, thank you – Silver Light Feb 12 '12 at 10:49
@Silver Light: Have you added that index? – ypercubeᵀᴹ Feb 12 '12 at 10:50
@ypercube, yes, and several others to fit more complex conditions. – Silver Light Feb 12 '12 at 10:59
Do you have other columns, too, in the intermediate table? – ypercubeᵀᴹ Feb 12 '12 at 11:00

score 0 · Answer 2 · edited Feb 11 '12 at 18:17

SELECT product_name
FROM products
-- we can use an inner join as an optimization, as some categories MUST exist
INNER JOIN products_to_categories ON products.product_id=products_to_categories.product_id
WHERE 
  products_to_categories.category_id NOT IN (3,4,5) -- substitute unwanted category IDs
  AND EXISTS (SELECT product_id FROM products_to_categories WHERE category_id = 1 AND product_id = products.id) 
  AND EXISTS (SELECT product_id FROM products_to_categories WHERE category_id = 2 AND product_id = products.id)

Optimist · Answer 3 · 2012-02-11T17:53:12.947

0

I removed my answer because other answers were more comprehensive. Just a general tip. To reduce the number of AND's in your statements you can use the IN operator to check for multiple categories

where category_id IN(1,2)

or

where category_id NOT IN(1,2)

edited Feb 11 '12 at 17:53

answered Feb 11 '12 at 17:35

Optimist

387
1
5

The `NOT IN(1,2)` is same as the OP's query. `a NOT IN(1,2)` means `NOT a=1 AND NOT a=2`. But the `IN(1,2)` is not the same. `x IN(1,2)` is the same as `x=1 OR x=2` – ypercubeᵀᴹ Feb 12 '12 at 11:03

score 0 · Answer 4 · answered Feb 11 '12 at 17:45

I think you want to avoid the in clauses because SQL server will do multiple queries or do an "or", which will be less efficient than what I'm pasting below, because it may not be able to take advantage of indexes.

You could also get rid of the #product_categories_filtered temp table and do it all in one big query, and use aliased sub-queries if you want to. You might want to play around with different configs and see which one is best, but temp tables have never been a performance issue in my app unless someone tries to query something with dozens of millions of records. I used #product_categories_filtered because in some cases SQL server queries run better when you break the queries down to use fewer joins, especially on larger tables like your product one.

create table #includes (category_id int not null primary key)
create table #excludes (category_id int not null primary key)

insert #includes (category_id) 
    select 1
    union all select 2
insert #excludes (category_id) 
    select 3
    union all select 4
    union all select 5

select 
  pc.product_id
into #product_catories_filtered
from 
  product_categories pc
  join #includes i 
    on pc.category_id = i.category_id
  left join #excludes e 
    on pc.category_id = i.category_id
where 
  e.category_id is null


select distinct
  p.product_name
from 
  #product_categories_filtered pc
  join products p
    on pc.product_id = p.id
order by 
  p.id

Efficient search in many-to-many related tables

4 Answers4