14

So I have this query written by someone else that I'm trying to refactor, which pulls some features/materials for an item(shoes, generally).

There are a lot of products, and thus a whole lot of joining table-entries, but only a few few features that are available for them. I'm thinking that there has to be a way to cut down the need to touch upon the "big" list of items, to get these features, and I have heard that distinct is to be avoided, but I don't have a statement that can replace the "distinct" options here.

According to my logs, I'm getting slow result times:

Query_time: 7 Lock_time: 0 Rows_sent: 32 Rows_examined: 5362862

Query_time: 8 Lock_time: 0 Rows_sent: 22 Rows_examined: 6581994

As the message says, sometimes it is taking 7 or 8 seconds and sometimes or every time it is querying over 5 million rows.

That may be due to other load occurring at the same time, because here are the selects run on the database directly from the mysql command line:

mysql> SELECT DISTINCT features.FeatureId, features.Name
       FROM features, itemsfeatures, items
       WHERE items.FlagStatus != 'U'
         AND items.TypeId = '13'
         AND features.Type = 'Material'
         AND features.FeatureId = itemsfeatures.FeatureId
       ORDER BY features.Name;
+-----------+--------------------+
| FeatureId | Name               |
+-----------+--------------------+
|        40 | Alligator          |
|        41 | Burnished Calfskin |
|        42 | Calfskin           |
|        59 | Canvas             |
|        43 | Chromexcel         |
|        44 | Cordovan           |
|        57 | Cotton             |
|        45 | Crocodile          |
|        58 | Deerskin           |
|        61 | Eel                |
|        46 | Italian Leather    |
|        47 | Lizard             |
|        48 | Nappa              |
|        49 | NuBuck             |
|        50 | Ostrich            |
|        51 | Patent Leather     |
|        60 | Rubber             |
|        52 | Sharkskin          |
|        53 | Silk               |
|        54 | Suede              |
|        56 | Veal               |
|        55 | Woven              |
+-----------+--------------------+
22 rows in set (0.00 sec)

mysql> select count(*) from features;
+----------+
| count(*) |
+----------+
|      122 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from itemsfeatures;
+----------+
| count(*) |
+----------+
|    38569 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from items;
+----------+
| count(*) |
+----------+
|     8656 |
+----------+
1 row in set (0.00 sec)

explain SELECT DISTINCT features.FeatureId, features.Name  FROM features, itemsfeatures, items    WHERE items.FlagStatus != 'U'  AND items.TypeId = '13'  AND features.Type = 'Material' AND features.FeatureId = itemsfeatures.FeatureId  ORDER BY features.Name;
+----+-------------+---------------+------+-------------------+-----------+---------+---------------------------------+------+----------------------------------------------+
| id | select_type | table         | type | possible_keys     | key       | key_len | ref                             | rows | Extra                                        |
+----+-------------+---------------+------+-------------------+-----------+---------+---------------------------------+------+----------------------------------------------+
|  1 | SIMPLE      | features      | ref  | PRIMARY,Type      | Type      | 33      | const                           |   21 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | itemsfeatures | ref  | FeatureId         | FeatureId | 4       | sherman_live.features.FeatureId |  324 | Using index; Distinct                        |
|  1 | SIMPLE      | items         | ALL  | TypeId,FlagStatus | NULL      | NULL    | NULL                            | 8656 | Using where; Distinct; Using join buffer     |
+----+-------------+---------------+------+-------------------+-----------+---------+---------------------------------+------+----------------------------------------------+
3 rows in set (0.04 sec)

Edit:

Here is sample results without the distinct, (but with a limit, since otherwise it just hangs) for comparison:

SELECT features.FeatureId, features.Name        FROM features, itemsfeatures, items        WHERE items.FlagStatus != 'U'          AND items.TypeId = '13'          AND features.Type = 'Material'          AND features.FeatureId = itemsfeatures.FeatureId        ORDER BY features.Name limit 10;
+-----------+-----------+
| FeatureId | Name      |
+-----------+-----------+
|        40 | Alligator |
|        40 | Alligator |
|        40 | Alligator |
|        40 | Alligator |
|        40 | Alligator |
|        40 | Alligator |
|        40 | Alligator |
|        40 | Alligator |
|        40 | Alligator |
|        40 | Alligator |
+-----------+-----------+
10 rows in set (23.30 sec)

here's using a group by instead of a select distinct:

SELECT features.FeatureId, features.Name        FROM features, itemsfeatures, items        WHERE items.FlagStatus != 'U'          AND items.TypeId = '13'          AND features.Type = 'Material'          AND features.FeatureId = itemsfeatures.FeatureId        group by features.name ORDER BY features.Name;
+-----------+--------------------+
| FeatureId | Name               |
+-----------+--------------------+
|        40 | Alligator          |
|        41 | Burnished Calfskin |
|        42 | Calfskin           |
|        59 | Canvas             |
|        43 | Chromexcel         |
|        44 | Cordovan           |
|        57 | Cotton             |
|        45 | Crocodile          |
|        58 | Deerskin           |
|        61 | Eel                |
|        46 | Italian Leather    |
|        47 | Lizard             |
|        48 | Nappa              |
|        49 | NuBuck             |
|        50 | Ostrich            |
|        51 | Patent Leather     |
|        60 | Rubber             |
|        52 | Sharkskin          |
|        53 | Silk               |
|        54 | Suede              |
|        56 | Veal               |
|        55 | Woven              |
+-----------+--------------------+
22 rows in set (13.28 sec)

Edit: Added a bounty

...Because I'm trying to understand this general problem, how to replace bad select distinct queries in general, in addition to the slowness that this query specifically tends to cause.

I'm wondering whether the replacement for a select distinct is generally a group by (although in this case that isn't a comprehensive solution since it's still slow)?

Community
  • 1
  • 1
Kzqai
  • 22,588
  • 25
  • 105
  • 137
  • 1
    These queries are taking practically no time to execute. Why are you worrying? – Mark Ransom May 27 '11 at 19:19
  • 1
    Why full outer join on itemFeatures to items? – dotjoe May 27 '11 at 19:22
  • @Mark Ransom - I'm guessing it's because of "Using temporary; Using filesort" on the first explain. But if the table won't grow much beyond current size, then not really an issue I guess. – AJ. May 27 '11 at 19:26
  • 1
    @Mark Ransom Ah, sorry, should have added that it's resulting in slow queries on the live server (though it's relatively fast on a dev copy, since that copy isn't under load). Edited in the "this query taking 7 seconds" log info. – Kzqai May 27 '11 at 19:41
  • Just re-ran the queries on the live server, but when I run them from a mysql console, they run fast and furious, so now I'm wondering what conditions cause the slowdown. – Kzqai May 27 '11 at 19:48
  • Can you explain in plain english what your query is trying to do? As it stands, it is very hard to understand the purpose. – Phil Sandler Jul 14 '11 at 14:30
  • 1
    Also, try removing the distinct and see how many results you get back. That may give you a better understanding of why your query is going wrong. – Phil Sandler Jul 14 '11 at 14:35
  • Edited the question to reflect what happens when you drop the distinct. – Kzqai Jul 14 '11 at 16:29

3 Answers3

10

Looks like you're missing a JOIN condition linking itemsfeatures to items. It's more obvious if you write the query using explicit JOIN operations.

SELECT DISTINCT f.FeatureId, f.Name  
    FROM features f
        INNER JOIN itemsfeatures ifx
            ON f.FeatureID = ifx.FeatureID
        INNER JOIN items i
            ON ifx.ItemID = i.ItemID /* This is the part you're missing */
    WHERE i.FlagStatus != 'U'  
        AND i.TypeId = '13'  
        AND f.Type = 'Material' 
    ORDER BY f.Name;
Joe Stefanelli
  • 132,803
  • 19
  • 237
  • 235
  • Hmmm, that seems to give different results... checking into it further. – Kzqai May 27 '11 at 19:38
  • @Tchalvak: Yes, I'd expect the results to be different. – Joe Stefanelli May 27 '11 at 19:40
  • I mean a completely different resultset (whether it's an inner join or outer). – Kzqai May 27 '11 at 19:45
  • 10 results instead of the 22 of the original. Not sure why that is. – Kzqai May 27 '11 at 19:56
  • Would whoever anonymously downvoted this today care to explain their objection? – Joe Stefanelli Jul 14 '11 at 14:22
  • was me, doesn't return the same results as the select distinct, so I can't rely on it. – Kzqai Jul 14 '11 at 14:24
  • 2
    @Tchalvak: We discussed this on May 27. This query should not return the same results as your original because I added what I believed was a missing join condition. – Joe Stefanelli Jul 14 '11 at 14:30
  • 4
    Tchalvak: look at *why* the results are different. Which results are "missing" and why are they missing? Joe's query makes perfect sense, yours is (at best) confusing. – Phil Sandler Jul 14 '11 at 14:34
  • 1
    @Tchalvak It returns different results, but this query is good. Your old query has a bug. Don't you think so? – Karolis Jul 14 '11 at 14:38
  • Oh trust me, the original is not my query either, so I'm confused by it too. – Kzqai Jul 14 '11 at 14:48
  • I'm trying to understand why the added join condition is necessary, considering that the original query "works" and returns a certain resultset... – Kzqai Jul 14 '11 at 14:55
  • 1
    Without that added condition, every row from items that has a typeid of '13' is being combined with every row from features that has a feature type of 'material' without any regard to the fact that these two rows may be talking about different items. – EvilAmarant7x Jul 14 '11 at 15:11
  • 1
    @Joe +1 for a good query but your aliasing a table "if" makes me cringe :D – Matthew Jul 14 '11 at 18:11
  • @Matthew PK: I know, I know. :-) It's such a habit of mine to alias tables with the first letter of each word that it didn't even occur to me at first that I was using a keyword. – Joe Stefanelli Jul 14 '11 at 18:14
  • Not worthy of a whole answer, but be sure that if you are worried about speed (and not disk space) to have indexes on the fields you are joining on and the fields you are distinct-ing – Evan Jul 19 '11 at 23:54
6

As Joe states, there does seem to be a missing join condition

This is your current query

SELECT DISTINCT 
        features.FeatureId, 
        features.Name
FROM    features, 
        itemsfeatures, 
        items
WHERE   items.FlagStatus != 'U'
        AND items.TypeId = '13'
        AND features.Type = 'Material'
        AND features.FeatureId = itemsfeatures.FeatureId
ORDER BY features.Name

This is your query with explicit joins

SELECT DISTINCT 
        features.FeatureId, 
        features.Name
FROM    features INNER JOIN
        itemsfeatures on features.FeatureId = itemsfeatures.FeatureId CROSS JOIN
        items
WHERE   items.FlagStatus != 'U'
        AND items.TypeId = '13'
        AND features.Type = 'Material'
ORDER BY features.Name

I can't be 100% sure but it looks like removing any reference to the items table should give you the exact same result

SELECT DISTINCT 
        features.FeatureId, 
        features.Name
FROM    features, 
        itemsfeatures
WHERE   features.Type = 'Material'
        AND features.FeatureId = itemsfeatures.FeatureId
ORDER BY features.Name

The way the query is written, it seems it wants a list of materials for items with a typeID of 13 and Flagstatus <> U. If that is the case the result being returned by the orignial query are wrong. It is simply returning all materials for all items.

So as Joe states add the inner join for items and use explicit joins as they make the meaning clearer. I prefer to use group by but distinct will do the same thing.

SELECT  features.FeatureId, 
        features.Name
FROM    features INNER JOIN
        itemsfeatures on features.FeatureId = itemsfeatures.FeatureId INNER JOIN
        items on itemsfeatures.ItemID = items.ItemID
WHERE   items.FlagStatus != 'U'
        AND items.TypeId = '13'
        AND features.Type = 'Material'
GROUP BY features.FeatureId, 
        features.Name
ORDER BY features.Name

With that now sorted, now comes the speed. Create the following three indexes.

FeaturesIndex(Type,FeatureID,Name)
ItemsFeaturesIndex(FeatureId)
ItemsIndex(TypeId,FlagStatus,ItemID)

This should speed up both your current query and the one I listed.

John Petrak
  • 2,898
  • 20
  • 31
  • Yeah, I'll check these options out. I've actually started by using a group-by and I'm going to deploy that and see what speed difference it makes, and then refactor more from there. Is your query fully-joined substantially different from joe's (apart from the cross join one). – Kzqai Jul 18 '11 at 13:53
  • No both queries are pretty much the same apart from the join, mine just follows the ANSI standard for the joins. Both will work and it is more of what you prefer to use. I know a few Oracle guys and some from the old school hate using explicit joins, but others like them. I'm mostly in the MSSQL space and prefer to use them. – John Petrak Jul 19 '11 at 00:56
  • Since the actual code is dynamically generated with php (with ugly additions happening in the middle based on certain criteria), I'm going to have to mix & match approaches, and just using `group by` has already sped up the result so it's not the beast that it was. – Kzqai Jul 20 '11 at 15:33
3

I am almost confident that Joe's answer is correct. But if you think that Joe is wrong and you want to get the same results as your original query, but faster, then use this query:

SELECT DISTINCT features.FeatureId, features.Name
    FROM features, itemsfeatures
    WHERE features.Type = 'Material'
        AND features.FeatureId = itemsfeatures.FeatureId
    ORDER BY features.Name;
Karolis
  • 9,396
  • 29
  • 38
  • That is faster indeed, and may be what I need, short of just deleting all the code that requires that original horrible query from the codebase. – Kzqai Jul 14 '11 at 15:00
  • You've eliminated the conditions on the `items` table from the original query: `items.FlagStatus != 'U' AND items.TypeId = '13'` @Tchalvak: Are those conditions not relevant to your query? – Joe Stefanelli Jul 14 '11 at 15:03
  • 1
    @Joe Yeah, I have eliminated them because these conditions have no impact for the final result if no joining condition for table **items** is NOT provided. – Karolis Jul 14 '11 at 15:11
  • 2
    @Karolis: Understood. I'm trying to help @Tchalvak see that there is probably a reason why those conditions were included in the original query and, although he's getting results, I don't think he's getting the *right* results without the additional join to make those conditions on the `items` table valid. – Joe Stefanelli Jul 14 '11 at 15:16
  • That's fair enough. I'll try both of these alternatives (either that or use the delete key liberally) and see if one or the other is more apt for the purpose. – Kzqai Jul 14 '11 at 15:26