3

I have one table that I'm using to build groups with in my database. The table contains a list of group names and ids. I have another table that has users, and a third table showing the relationships. (userid, groupid).

The situation is this, I need to create a list of userids that belong to a specific subset of groups. So for instance, I want all users that are in group 1, 3, and 8. That is straight forward enough. It gets more complicated though, I may need a list of all users that are in groups 1, 3, and 8, or 1, 2, and 8. Then I might need to exclude users that fit that criteria, but are also in group 27.

So I've got a script dynamically creating a query, using sub queries that works to a point. I have two problems with it. I don't think I'm handling the not-in part properly, because as I ad criteria, eventually, it just kinda hangs. (I think this is a result of me using sub-selects instead of joins, but I could not figure out how to build this with joins.)

Here is an example of a query with 4 ANDed OR groups, and 2 NOT clauses.

Please let me know if there is a better way to optimize this stmt. (I can handle the dynamic building of it in PHP)

If I need to clarify anything or provide more details, let me know.


select * from users_table where username IN
(
    select user_id from
    (
        select distinct user_id from group_user_map where user_id in 
        (
            select user_id from 
            (
                select * from 
                (
                    select count(*) as counter, user_id from  
                    (
                        (
                            select distinct(user_id) from group_user_map where group_id in (2601,119)
                        ) 
                        union all
                        (
                            select distinct(user_id) from group_user_map where group_id in (58,226)
                        ) 
                        union all
                        (
                            select distinct(user_id) from group_user_map where group_id in (1299,525)
                        ) 
                        union all
                        (
                            select distinct(user_id) from group_user_map where group_id in (2524,128)
                        ) 
                    ) 
                    thegroups group by user_id
                ) 
                getall where counter = 4
            ) 
            getuserids
        ) 
        and user_id not in 
        (
            select user_id from group_user_map where group_id in (2572)
        ) 
    ) 
    biggergroup 
);

Note, the first part of the query is comparing an id to a username. This is because I have the usernames stored as id's from the other table. (This whole thing is a link between two completely different databases).

(Also, if it looks like I have any extra sub-queries, that was to try to force mysql to evaluate the inner queries first.)

Thanks.

Aaron.

SilicaGel
  • 459
  • 3
  • 11
  • 1
    Silica, what does mean group_id IN (2601,119). I'd imagine you at this point are looking for just one group (1, 3 or 8). Why do you have a IN and 2 ids? – medina Apr 26 '13 at 02:38
  • Aaron, I don't think this does what you think it does. I suggest you provide a simplified version of the problem in the form of a set of CREATE and INSERT statements, the search criteria, and the desired resultset. – Strawberry Apr 27 '13 at 10:21
  • Please provide script structure/data sample for these tables. – Ronak Vyas Apr 28 '13 at 18:27

4 Answers4

1

It would be easier to understand your problem if you post the table structure and some sample data. But here are a few suggestions based on your current query that you might be able to use.

These queries reduce the number of subqueries that you are using. One of the obvious changes is the difference in the way it gets the list of user_id's with each group:

select user_id
from group_user_map 
where group_id in (2601,119)
union all
select user_id 
from group_user_map 
where group_id in (58,226)
union all
select user_id 
from group_user_map 
where group_id in (1299,525)
union all
select user_id 
from group_user_map 
where group_id in (2524,128);

This uses a UNION ALL which will list all of the user_id even if they are duplicated. Once you have this list of user_id's then you get the count by applying a count(distinct user_id) and use a HAVING clause to find those that have 4 occurrences.

First, you could consolidate your current query to the following version in a WHERE clause:

select * 
from users_table 
where username IN (select user_id
                  from
                  (
                    select user_id
                    from group_user_map 
                    where group_id in (2601,119)
                    union all
                    select user_id 
                    from group_user_map 
                    where group_id in (58,226)
                    union all
                    select user_id 
                    from group_user_map 
                    where group_id in (1299,525)
                    union all
                    select user_id 
                    from group_user_map 
                    where group_id in (2524,128)
                  ) thegroups
                  where user_id not in (select user_id 
                                        from group_user_map 
                                        where group_id in (2572)) 
                  group by userid
                  having count(distinct userid) = 4);

Or you could use the query in the WHERE clause in a subquery that you JOIN to:

select ut.* 
from users_table ut
inner join
(
  select user_id
  from
  (
    select user_id
    from group_user_map 
    where group_id in (2601,119)
    union all
    select user_id 
    from group_user_map 
    where group_id in (58,226)
    union all
    select user_id 
    from group_user_map 
    where group_id in (1299,525)
    union all
    select user_id 
    from group_user_map 
    where group_id in (2524,128)
  ) thegroups
  where user_id not in (select user_id 
                        from group_user_map 
                        where group_id in (2572)) 
  group by userid
  having count(distinct userid) = 4
) biggergroup
  on ut.username = biggergroup.user_id;
Taryn
  • 242,637
  • 56
  • 362
  • 405
  • @SilicaGel : Is AlL UNION Statements fetches data from different databases(tables). If not then please also try to optimize UNIONS and NOT IN clause also can be excluded because that group_id will never come. Thanks. – Ronak Vyas Apr 29 '13 at 12:03
1

Avoiding subselects used for IN clauses:-

SELECT * 
FROM users_table
INNER JOIN 
(
    SELECT Sub1.user_id 
    FROM (
            SELECT COUNT(*) AS counter, user_id   
            FROM (
                SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (2601,119)
                UNION ALL
                SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (58,226)
                UNION ALL
                SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (1299,525)
                UNION ALL
                SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (2524,128)
            ) thegroups
            GROUP BY user_id
            HAVING counter = 4
    ) Sub1
    LEFT OUTER JOIN (SELECT user_id FROM group_user_map WHERE group_id IN (2572)) Sub2
    ON group_user_map.user_id = Sub2.user_id
    WHERE Sub2.user_id IS NULL
) Sub3
ON  users_table.username = Sub3.user_id

Or avoiding using the COUNTs to check that the user id exists in all 4 tables, instead using inner joins

SELECT * 
FROM users_table
INNER JOIN 
(
    SELECT Sub1.user_id 
    FROM (
        SELECT z.user_id   
        FROM (
            SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (2601,119)) z
            INNER JOIN
            (SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (58,226)) y ON z.user_id = y.user_id
            INNER JOIN
            (SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (1299,525)) x ON z.user_id = x.user_id
            INNER JOIN
            (SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (2524,128)) w ON z.user_id = w.user_id
    ) Sub1
    LEFT OUTER JOIN (SELECT user_id FROM group_user_map WHERE group_id IN (2572)) Sub2
    ON group_user_map.user_id = Sub2.user_id
    WHERE Sub2.user_id IS NULL
) Sub3
ON  users_table.username = Sub3.user_id

Cleaning up that 2nd query a bit

SELECT * 
FROM users_table
INNER JOIN 
(
    SELECT z.user_id   
    FROM (SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (2601,119)) z
    INNER JOIN (SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (58,226)) y 
    ON z.user_id = y.user_id
    INNER JOIN (SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (1299,525)) x 
    ON z.user_id = x.user_id
    INNER JOIN (SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (2524,128)) w 
    ON z.user_id = w.user_id
    LEFT OUTER JOIN (SELECT user_id FROM group_user_map WHERE group_id IN (2572)) Sub2
    ON z.user_id = Sub2.user_id
    WHERE Sub2.user_id IS NULL
) Sub3
ON  users_table.username = Sub3.user_id

Using your SQL in the comment below, it can be cleaned up to :-

select SQL_NO_CACHE id 
from users_table 
INNER JOIN ( SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (0, 67) ) ij1 
ON users_table.username = ij1.user_id 
LEFT OUTER JOIN ( SELECT user_id FROM group_user_map WHERE group_id IN (0) ) Sub2 
ON users_table.username = Sub2.user_id 
WHERE Sub2.user_id IS NULL 

Cleaning up my SQL in the same way:-

SELECT users_table.* 
FROM users_table
INNER JOIN (SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (2601,119)) z ON users_table.username = z.user_id
INNER JOIN (SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (58,226)) y ON users_table.username = y.user_id
INNER JOIN (SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (1299,525)) x ON users_table.username = x.user_id
INNER JOIN (SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (2524,128)) w ON users_table.username = w.user_id
LEFT OUTER JOIN (SELECT user_id FROM group_user_map WHERE group_id IN (2572)) Sub2 ON users_table.username = Sub2.user_id
WHERE Sub2.user_id IS NULL

Removing the subselects and doing the joins directly (might help or hinder, suspect it will depend on how many duplicate user_id records there are for each set of group_id records)

SELECT DISTINCT users_table.* 
FROM users_table
INNER JOIN group_user_map z ON users_table.username = z.user_id AND z.group_id IN (2601,119)
INNER JOIN group_user_map y ON users_table.username = y.user_id AND y.group_id IN (58,226)
INNER JOIN group_user_map x ON users_table.username = x.user_id AND x.group_id IN (1299,525)
INNER JOIN group_user_map w ON users_table.username = w.user_id AND w.group_id IN (2524,128)
LEFT OUTER JOIN group_user_map Sub2 ON users_table.username = Sub2.user_id AND Sub2.group_id IN (2572)
WHERE Sub2.user_id IS NULL
Kickstart
  • 21,403
  • 2
  • 21
  • 33
  • Hey Kickstart, this last query works really well using the criteria in my example. However, as I'm building this whole query dynamically, this query seems to take forever when i only have one IN, and one NOT IN. (Basically deleting the three inner joins of y,x and w) Any idea why, or if I can fix this? – SilicaGel May 08 '13 at 19:10
  • What kind of record numbers is it dealing with? Could you do an EXPLAIN for it and post the results? Wonder if the first IN is resulting in lots of records and no suitable index for the join (might be worth trying a covering index on group_id and user_id) – Kickstart May 08 '13 at 23:32
  • Hey Kickstart, i've zipped up a simplified version of my data. There are indeed a lot of records, which is probably part of the problem. The zipped sql file is available here: https://dl.dropboxusercontent.com/u/8317351/user_group_db.zip Please let me know if you can think of a way to better optimize this. I really appreciate your help on this. – SilicaGel May 13 '13 at 19:15
  • Been having a play and struggling to get it to perform. The killer seems to be the group_id field being a varchar. I swapped that for an int with the test data and it was MASSIVELY faster. Could it be changed to an INT field on you real data? Or maybe put to a different table with just the int id on that being used in the group_user_map table. – Kickstart May 13 '13 at 21:50
  • Hi Kickstart, first, thanks a lot for you help with this. I have changed the field to an INT, but it's still taking 6-20 seconds to run some of the queries. The problem is that it will have to run these queries (upwards of 2-3000 nightly.) which could take many hours (way too long). This is one of the simplest queries. (They're built dynamically, which is why i have the extra 1=1 at the top, and a 0 id join at the bottom. (next comment.) I'd appreciate it if we could have a Skype or email conversation about this. Let me know if you'd be willing to do that. Thanks – SilicaGel May 28 '13 at 15:32
  • select id from users_table where username in ( SELECT z.user_id FROM ( SELECT distinct(user_id) FROM group_user_map WHERE 1=1 ) z INNER JOIN ( SELECT distinct(user_id) FROM group_user_map WHERE group_id IN (0, 67) ) ij1 ON z.user_id = ij1.user_id LEFT OUTER JOIN ( SELECT user_id FROM group_user_map WHERE group_id IN (0) ) Sub2 ON z.user_id = Sub2.user_id WHERE Sub2.user_id IS NULL ) – SilicaGel May 28 '13 at 15:33
  • @SilicaGel - I have had a quick play with your SQL there. The subselect with WHERE 1=1 seems to cause a major amount of processing (that will bring back every single user id). Removing that (as unnecessary) results in the query taking 0.0017 seconds on my machine, as opposed to 0.5 seconds with it in place. I will put up the SQL I used in the main answer – Kickstart May 28 '13 at 16:55
  • Wow, that last query in your answer REALLY makes a difference, 30 second queries are down to 900ms. Thanks a lot, I was almost ready to write some php code to handle it instead. – SilicaGel May 28 '13 at 18:48
0

It's not exactly clear what you mean when you say "I want all users that are in group 1, 3, and 8" and then write

select distinct(user_id) from group_user_map where group_id in (58,226)

because the English suggests you want a user who is in all three groups but the SQL gives you users who are in any 1 of the groups. So you need to be clearer about what exactly you want.

It's a little hard to believe that you are trying to find users that are in all of 4 supergroups with each supergroup being made of exactly 2 groups. It makes me question what you are doing and why.

There are a few different approaches I can think of depending on what you are really going to run into. Obviously the simplest is to break it into multiple queries and combine the results in your code. You can auto-join the group table if it's not too big, but it probably is too big to to join 3 times. You might get better performance with NOT EXISTS than with NOT IN but probably not. You can try to further leverage aggregation functions with CASE functions to compute success values in an intermediate table, but that's getting pretty crazy. More likely you'd be better off reworking your data structure.

The main problem I see with your existing solution is the large number of temporary tables you create. In general you are going to need a temporary table of some kind to do something this complicated so I would focus on limiting it to two tables, each of which is smaller than the relationships table.

Old Pro
  • 24,624
  • 7
  • 58
  • 106
0

Is this right query

  select * from users_table where username IN    
            (
(select distinct(user_id) from group_user_map where group_id in (2601,119)) a 
inner join
(select distinct(user_id) from group_user_map where group_id in (58,226)) b 
on a.user_id = b.user_id inner join 
(select distinct(user_id) from group_user_map where group_id in (1299,525)) c 
on a.user_id = c.user_id inner join 
(select distinct(user_id) from group_user_map where group_id in (2524,128)) d
on a.user_id = d.user_id 
)  and user_id  not in (select user_id from group_user_map where group_id in (2572))

Instead of union all and finally filter with having counter of 4, I replaced with intersect. Kindly check whether result is correct and it runs fast ?

Vinit

Vinit Prajapati
  • 1,593
  • 1
  • 17
  • 29