3

I have different categories of user, and a join table that allows users to be in more than one category. My join table is called categories_users and it consists of a user_id and a category_id.

I want to filter for users that are in both category1 and category2. For example, I want to find everyone that's interested in both baseball and football.

What is the best way to do this in PostgreSQL? I have the following working:

select * from users 
  where users.id IN 
    ( Select categories_users.user_id from categories_users 
      JOIN categories ON categories.id = categories_users.category_id 
      where categories.id = 1 OR categories.parent_id = 1) 
  AND users.id IN 
    (Select categories_users.user_id from categories_users 
    JOIN categories ON categories.id = categories_users.category_id 
    where categories.id = 2 OR categories.parent_id = 2)

However this feels clunky, and I am wondering if there is a better way to do this. I've tried a variety of joins, but always end up searching for rows in the categories_users table that have a category_id of 1 and 2, which isn't possible.

Edit, I actually need to also search on category parent, so I've changed the above query to include the parent_id

wildrhombus
  • 113
  • 7

3 Answers3

2

Just join with the same table twice (using aliases):

SELECT u.*
    FROM users u
    JOIN categories_users cu1 ON cu1.user_id = u.id
    JOIN categories_users cu2 ON cu2.user_id = u.id
    WHERE cu1.category_id = 1 AND cu2.category_id = 2
jcaron
  • 17,302
  • 6
  • 32
  • 46
  • Yes, thank you, that worked. Is this a more efficient query in Postgres? – wildrhombus Apr 18 '17 at 12:35
  • You can check the output of `EXPLAIN` or `EXPLAIN ANALYZE` for both queries to see which one is most efficient. Didn't check, but I wouldn't be surprised if the query plans were exactly the same using sub-queries or joins. Of course, I skipped the unnecessary lookup of `categories` as you already have the ids, but you may need to change that depending on how you find those. And proper indexes will certainly be necessary as soon as you tables grow a bit. – jcaron Apr 18 '17 at 12:38
  • Again thank you, I will try EXPLAIN. I do actually need the categories lookup for other reasons, this is actually a simplified version of what I was trying to do. What would you consider to be proper indexes? Both user_id and category_id are indexes now - created using btree. – wildrhombus Apr 18 '17 at 12:44
  • For categories_users you'll indeed probably need both indexes, as a common plan would be to search categories_users based on the category, find users, and then search categories_users again based on those user ids. But it'll probably depend a lot on the cardinality of the data. – jcaron Apr 18 '17 at 13:17
  • Actually I can't figure out how to do this using the category.parent_id. I've changed the above query to show what I actually want. – wildrhombus Apr 18 '17 at 19:44
  • Just join each categories_users with a category and add condition on each of those. – jcaron Apr 18 '17 at 19:46
  • I tried this, but it returned nothing. SELECT u.* FROM users u LEFT JOIN categories_users cu1 ON cu1.user_id = u.id LEFT JOIN categories c1 ON c1.id = cu1.category_id LEFT JOIN categories_users cu2 ON cu2.user_id = u.id LEFT JOIN categories c2 on c2.id = cu2.category_id WHERE (c1.id = 1 or c1.parent_id = 1) AND (c2.id = 2 or c2.parent_id = 2) – wildrhombus Apr 18 '17 at 20:30
  • Why are you using LEFT JOINs? – jcaron Apr 18 '17 at 20:31
  • Sorry, that wasn't the correct query to post, I am using only joins, I was just trying left joins to see if it made a difference. – wildrhombus Apr 18 '17 at 20:35
  • If it's not actually an answer, post it as an update to your question rather than as an answer. Also, please include a relevant sample of the data. – jcaron Apr 18 '17 at 21:42
  • You are correct, I was using the wrong category id when I was testing it. So this works now. I really appreciate your patience and help!!! – wildrhombus Apr 18 '17 at 21:53
1
select u.*
from
    users u
    inner join (
        select user_id
        from categories_users
        group by user_id
        having
            bool_or(1 in (category_id, parent_id)) and
            bool_or(2 in (category_id, parent_id))
    ) s on s.user_id = u.id
Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260
  • I also want to search on the category.parent_id, and I've changed my original query above to show this. How would you add the parent_id into this way of approaching the problem? – wildrhombus Apr 18 '17 at 19:40
1

You can also use COUNT(*) over a partition in order to see how many categories a user has in the set of searched categories.

I created the following sample in order to see how this can be defined and parameterized. I created a function test.find_users_in_categories(BIGINT[]) that accepts the array of categories for which we need the list of users. So the function will return all users that are in all given categories.

Solution for - get the users found in all given categories

CREATE SCHEMA test;

CREATE TABLE test.categories_users (
  category_id BIGINT NOT NULL,
  user_id BIGINT NOT NULL
);

INSERT INTO test.categories_users
  (user_id, category_id)
  VALUES
  (33, 103),
  (34, 104),
  (35, 105),
  (37, 105),
  (35, 106),
  (37, 106);

CREATE OR REPLACE FUNCTION test.find_users_in_categories(BIGINT[])
  RETURNS TABLE (
    user_id BIGINT
  )
AS
$$
DECLARE
  categories ALIAS FOR $1;
BEGIN
  RETURN QUERY
  SELECT t.user_id
    FROM
      (
        SELECT
          cu.user_id,
          cu.category_id,
          COUNT(*) OVER (PARTITION BY cu.user_id ) AS cnt
        FROM test.categories_users AS cu
        WHERE cu.category_id = ANY(categories)
      ) AS t
      WHERE t.cnt = array_length(categories, 1)
      GROUP BY t.user_id;
END;
$$
LANGUAGE plpgsql;

SELECT * FROM test.find_users_in_categories(ARRAY[105, 106]);

DROP SCHEMA test CASCADE;

EDIT - [recursive solution]

Solution for - get the users found in all given categories and sub-categories

Please see the following code about implementing a solution using JOIN + recursive CTE. I used a JOIN instead of COUNT() because it looks better for this case.

CREATE SCHEMA test;

CREATE TABLE test.categories (
  category_id BIGINT PRIMARY KEY,
  parent_id BIGINT REFERENCES test.categories(category_id)
);

CREATE TABLE test.categories_users (
  category_id BIGINT NOT NULL REFERENCES test.categories(category_id),
  user_id BIGINT NOT NULL
);

INSERT INTO test.categories
  (category_id, parent_id)
  VALUES
  (100, NULL),
  (101, 100),
  (102, 100),
  (103, 101),
  (104, 101),
  (105, 101),
  (106, NULL);


INSERT INTO test.categories_users
  (user_id, category_id)
  VALUES
  (33, 103),
  (34, 104),
  (35, 105),
  (37, 105),
  (35, 106),
  (37, 106);


CREATE OR REPLACE FUNCTION test.find_users_in_categories(BIGINT[])
  RETURNS TABLE (
    user_id BIGINT
  )
AS
$$
DECLARE
  main_categories ALIAS FOR $1;
BEGIN
  RETURN QUERY
  WITH
    -- get all main categories and subcategories
    RECURSIVE cte_categories (category_id, main_category_id) AS
    (
      SELECT cat.category_id, cat.category_id AS main_category_id
        FROM test.categories AS cat
        WHERE cat.category_id = ANY(main_categories)
      UNION ALL
      SELECT cat.category_id, cte.main_category_id
        FROM cte_categories AS cte
        INNER JOIN test.categories AS cat
          ON cte.category_id = cat.parent_id
    ),
    -- filter main categories that are found as children of other categories
    cte_categories_unique AS
    (
      SELECT cte.*
        FROM cte_categories AS cte
        LEFT JOIN
        (
          SELECT category_id
            FROM cte_categories
            WHERE category_id <> main_category_id
            GROUP BY category_id
        ) AS to_exclude
          ON cte.main_category_id = to_exclude.category_id
        WHERE to_exclude.category_id IS NULL
    ),
    -- compute the count of main categories
    cte_main_categories_count AS
    (
      SELECT COUNT(DISTINCT main_category_id) AS cnt
        FROM cte_categories_unique
    )
  SELECT t.user_id
    FROM
      (
        -- get the users which are found in each category/sub-category then group them under the main category
        SELECT
          cu.user_id,
          cte.main_category_id
        FROM test.categories_users AS cu
        INNER JOIN cte_categories_unique AS cte
          ON cu.category_id = cte.category_id
        GROUP BY cu.user_id, cte.main_category_id
      ) AS t
      GROUP BY t.user_id
      -- filter users that do not have a match on all main categories or their sub-categories
      HAVING COUNT(*) = (SELECT cnt FROM cte_main_categories_count);
END;
$$
LANGUAGE plpgsql;


SELECT * FROM test.find_users_in_categories(ARRAY[101, 106]);

DROP SCHEMA test CASCADE;
andreim
  • 3,365
  • 23
  • 21
  • I've edited my original query to include category.parent_id in the search, which is what I actually want. How would I include the parent_id if I approached it this way? – wildrhombus Apr 18 '17 at 19:39
  • @wildrhombus it depends on how do you want to behave. So, you have a tree of categories, which users should match your criteria? Lets suppose that you pass two categories as arguments. These two categories are parents of their own sub-trees. What kind of users should match this criteria? A user that is part of at least one category in each sub-tree or a user that has all the categories in every sub-tree? By being part of all categories I mean, if A is parent of B and C, and C is parent of D and E. Then the following might be equivalent - user has A, or user has B and C or user has B, D, E. – andreim Apr 19 '17 at 05:47
  • @wildrhombus should have asked this in the first place: how deep is your tree of categories? 2-3 levels or as many levels as can be supported? – andreim Apr 19 '17 at 05:49
  • A user can belong to a parent category A, and/or one of the subcategories. There are no restriction on how a user is connect to a category/subcategory. – wildrhombus Apr 19 '17 at 12:56
  • @andrea-macarie right now my tree is only 2 levels deep - a top level category has children, the children don't have children. However it's likely that it will change, so my answer is probably - as many levels as can be supported. – wildrhombus Apr 19 '17 at 12:58
  • If a search for the categorys that a user is included in, I would like to also find the users that are in that categories children. – wildrhombus Apr 19 '17 at 12:59
  • yes, but if A is parent of B and C, and C is parent of D and E. And you request all users for category A and the user is only in category D then he will be a valid user? Or in order to be a match he needs to have [A] or [B, C] or [B, D, E]? I mean the complete set of categories of the parent or a single one will suffice? – andreim Apr 19 '17 at 13:02
  • If you have 2 levels deep then you can go with a simple JOIN solution (the one you marked as an answer). If you think that this will change then you can go with JOIN/COUNT + recursive CTE. – andreim Apr 19 '17 at 13:03
  • For right now, it's just two levels deep, so I don't have to worry about D. But it's a valid point, I might have to at some point. For now I'm only going to address two levels and stay with the join. However I'm going to look at the recursive option for the future. Do you know of any good resources for learning about recursive queries? – wildrhombus Apr 19 '17 at 14:01
  • @wildrhombus I edited my answer to include also the recursive solution. For a reference to recursive CTE you can check the docs about [WITH RECURSIVE...](https://www.postgresql.org/docs/9.1/static/queries-with.html). – andreim Apr 19 '17 at 15:07
  • Thank you so much! – wildrhombus Apr 19 '17 at 20:45