3

Simplifying, I have four tables.

ref_TagGroup (top-level descriptive containers for various tags)
ref_Tag (tags with name and unique tagIDs)
ref_Product
ref_TagMap (TagID,Container,ContainerType)
A fifth table, ref_ProductFamily exists but is not directly part of this query.

I use the ref_TagMap table to map tags to products, but also to map Tags to TagGroups and also to product families. The ContainerType is set to PROD/TAGGROUP/PRODFAM accordingly.

So, I want to return the tag group, tagname and the number of products AND product families that the tag is mapped to...so results like:

GroupName | TagName | TagHitCnt

My question is, why does the first query come back in milliseconds, the second query comes back in milliseconds but the third query (which is just adding an "OR" condition to include both tag to product and tag to family mappings) takes forever (well, over ten minutes anyway...I haven't let it run all night yet.)

QUERY 1:

SELECT ref_taggroup.groupname,ref_tag.tagname,COUNT(DISTINCT IFNULL(ref_product.familyid,ref_product.id + 100000000),ref_product.name) AS 'taghitcnt' 
FROM (ref_taggroup,ref_tag,ref_product) 
LEFT JOIN ref_tagmap GROUPMAP ON GROUPMAP.containerid=ref_taggroup.groupid 
LEFT JOIN ref_tagmap PRODMAP ON PRODMAP.containerid=ref_product.id 
WHERE 
GROUPMAP.tagid=ref_tag.tagid AND GROUPMAP.containertype='TAGGROUP' 
AND
PRODMAP.tagid=ref_tag.tagid AND PRODMAP.containertype='PROD' 
GROUP BY tagname 
ORDER BY groupname,tagname ;

QUERY 2:

SELECT ref_taggroup.groupname,ref_tag.tagname,COUNT(DISTINCT IFNULL(ref_product.familyid,ref_product.id + 100000000),ref_product.name) AS 'taghitcnt' 
FROM (ref_taggroup,ref_tag,ref_product) 
LEFT JOIN ref_tagmap GROUPMAP ON GROUPMAP.containerid=ref_taggroup.groupid 
LEFT JOIN ref_tagmap PRODFAMMAP ON PRODFAMMAP.containerid=ref_product.familyid 
WHERE 
GROUPMAP.tagid=ref_tag.tagid AND GROUPMAP.containertype='TAGGROUP' 
AND
PRODFAMMAP.tagid=ref_tag.tagid AND PRODFAMMAP.containertype='PRODFAM' 
GROUP BY tagname 
ORDER BY groupname,tagname ;

QUERY 3:

SELECT ref_taggroup.groupname,ref_tag.tagname,COUNT(DISTINCT IFNULL(ref_product.familyid,ref_product.id + 100000000),ref_product.name) AS 'taghitcnt' 
FROM (ref_taggroup,ref_tag,ref_product) 
LEFT JOIN ref_tagmap GROUPMAP ON GROUPMAP.containerid=ref_taggroup.groupid 
JOIN ref_tagmap PRODMAP ON PRODMAP.containerid=ref_product.id 
JOIN ref_tagmap PRODFAMMAP ON PRODFAMMAP.containerid=ref_product.familyid 
WHERE 
GROUPMAP.tagid=ref_tag.tagid AND GROUPMAP.containertype='TAGGROUP' 
AND
((PRODMAP.tagid=ref_tag.tagid AND PRODMAP.containertype='PROD') 
OR 
(PRODFAMMAP.tagid=ref_tag.tagid AND PRODFAMMAP.containertype='PRODFAM' ))
GROUP BY tagname 
ORDER BY groupname,tagname ;

-- To answer a question that may come up, the COUNT Distinct ifnull in the select is designed to return one record for large numbers of products that are grouped into families and one record for each 'standalone' product that isn't in a family as well. This code works well in other queries.

I've tried doing a UNION on the first two queries, and that works and comes back very quickly, but it's not practical for other reasons that I won't go into here.

What is the best way to do this? What am I doing wrong?

Thanks!


ADDING EXPLAIN OUTPUT

QUERY1                                  
id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  GROUPMAP    ALL                 5640    Using where; Using temporary; Using filesort
1   SIMPLE  ref_tag ref PRIMARY PRIMARY 4   lsslave01.GROUPMAP.tagid    1   Using index
1   SIMPLE  ref_taggroup    ref PRIMARY PRIMARY 4   lsslave01.GROUPMAP.containerid  3   Using index
1   SIMPLE  PRODMAP ALL                 5640    Using where; Using join buffer
1   SIMPLE  ref_product eq_ref  PRIMARY PRIMARY 4   lsslave01.PRODMAP.containerid   1   

QUERY2                                  
id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  GROUPMAP    ALL                 5640    Using where; Using temporary; Using filesort
1   SIMPLE  ref_tag ref PRIMARY PRIMARY 4   lsslave01.GROUPMAP.tagid    1   Using index
1   SIMPLE  ref_taggroup    ref PRIMARY PRIMARY 4   lsslave01.GROUPMAP.containerid  3   Using index
1   SIMPLE  PRODFAMMAP  ALL                 5640    Using where; Using join buffer
1   SIMPLE  ref_product ref FixtureType FixtureType 5   lsslave01.PRODFAMMAP.containerid    39  Using where

QUERY3                                  
id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  GROUPMAP    ALL                 5640    Using where; Using temporary; Using filesort
1   SIMPLE  ref_tag ref PRIMARY PRIMARY 4   lsslave01.GROUPMAP.tagid    1   Using index
1   SIMPLE  ref_taggroup    ref PRIMARY PRIMARY 4   lsslave01.GROUPMAP.containerid  3   Using index
1   SIMPLE  PRODMAP ALL                 5640    Using join buffer
1   SIMPLE  PRODFAMMAP  ALL                 5640    Using where; Using join buffer
1   SIMPLE  ref_product eq_ref  PRIMARY,FixtureType PRIMARY 4   lsslave01.PRODMAP.containerid   1   Using where

enter code here

One more update for anyone who is interested: I finally let the third query above run to completion. It took right around 1000 seconds. Dividing this time by the time it takes each of the queries (1 or 2) to run, we get a number around 6000...which is very close to the size of the ref_tagmap table that we're using in our dev environment (much larger in production). So, it looks like we're running one query against each record in that table...but I still can't see why.

Any help would be much appreciated...and I mean seriously, seriously appreciated.

OldSchool
  • 41
  • 4
  • What do you get when you put `EXPLAIN` before each query? Can you show us the `CREATE TABLE` output? Why are you mixing join syntax? – Konerak Mar 03 '12 at 00:18
  • I appended the explain output for each of the three queries. Which tables do you need creates for? They're pretty simple...I think the problem lies in how I'm using the "OR" clause in the third query...any guidance there? As for user of LEFT JOINS and JOINS, that's just how I ended up...I've tried multiple combinations. Of course, I'm not a sql genius. – OldSchool Mar 03 '12 at 00:40
  • Try rewriting Query3 into a Union. – ypercubeᵀᴹ Mar 03 '12 at 00:56
  • 2
    index the columns used in the join conditions – scibuff Mar 03 '12 at 01:59
  • ypercube, I've done that and it speeds things up as expected, but I can't use unions in this piece of code for other reasons as explained. – OldSchool Mar 03 '12 at 03:34
  • scibuff, I've tried indexing everything...I don't think that's the issue. I can't see why two queries that each execute in less than an eighth of a second would then take 5 minutes or more when combined with an OR in the third one. Does anyone know why that would happen? – OldSchool Mar 03 '12 at 03:36
  • Did you ever get this resolved, or is it still open? – DRapp Apr 23 '12 at 11:57

1 Answers1

0

This is less an "answer" than a couple of observations/suggestions.

First, I'm curious whether you could GROUP BY on an integer ID instead of the tag name? I'd change the ref_TagMap.containertype field to hold tinyint enumerated values representing the three possible values of TAGGROUP, PROD and PRODFAM. An indexed tinyint field should be slightly faster than an index of string values. It probably won't make much difference though because it's the second conditional in the join clause and there isn't that much spread in the indexed values anyway.

Next is the observation/reminder that when the first half of an OR statement evaluates to FALSE often, then you're making MySQL evaluate both halves of the conditional every time. So you want to put the condition most likely to evaluate to TRUE first (aka prior to the OR).

I doubt either of those two issues are your real problem... though the issue in the second paragraph may play a part. Seems like the quickest way to a performant version of query 3 may be to simply populate a temp table with the results from the first two queries and pull from that temp table to get the results you're looking for from the third. Perhaps in doing so you'll discover why that third query is so slow.

codemonkey
  • 2,661
  • 2
  • 24
  • 34