Optimise query using filesort and temporary

Question

Can anyone help me to optimise this query. It's taking a while to run:

The su_pref table has almost 900,000 rows

SELECT p.mykey,
       p.merchant_name,
       m.merchant_url, 
       p.name,
       p.description,
       p.image_url, 
       p.deep_link,
       p.rrp_price, 
       p.display_price, 
       c.category,
       p.su_parent_name 
FROM su_pref p #
        INNER JOIN su_categoryrefs cr ON p.mykey = cr.mykey 
        INNER JOIN su_categories c ON cr.id = c.id 
        INNER JOIN su_merchants m ON p.merchant_id=m.id 
WHERE 
     cr.id =36 
 ORDER BY p.date_created DESC LIMIT 0,20

And this is what I get when I use EXPLAIN

enter image description here I'm really struggling to understand how to use EXPLAIN to optimise a query, so any help would be greatly appreciated.

Here are the table definitions. The su_pref table is quite big so I stripped some fields out

CREATE TABLE `su_pref` (
  `mykey` varchar(50) NOT NULL,
  `merchant_id` smallint(3) default NULL,
  `merchant_name` varchar(50) NOT NULL default '',
  `brand` varchar(50) default NULL,
  `merchantproductkey` varchar(50) default NULL,
  `upc` varchar(15) default NULL,
  `name` varchar(255) NOT NULL default '',
  `description` varchar(2500) NOT NULL default '',
  `short_description` varchar(500) default NULL,
  `thumb_url` varchar(500) default NULL,
  `image_url` varchar(500) default NULL,
  `deep_link` varchar(1000) default NULL,
  `merchant_link` varchar(255) default NULL,
  `rrp_price` decimal(11,2) default NULL,
  `display_price` decimal(11,2) default NULL,
  `delivery_cost` decimal(11,2) default NULL,
  `price_range` tinyint(1) default NULL,
  `keywords` varchar(500) default NULL,
  `su_parent_name` enum('Women','Men','') NOT NULL,
  `date_created` date default NULL,
  `date_modified` datetime default NULL,
  `wp_featured` varchar(1) default NULL,
  `hp_featured` varchar(1) default NULL,
  `published` varchar(1) default NULL,
  `in_todays_feed` varchar(1) default NULL,
  PRIMARY KEY  (`mykey`),
  KEY `merchant_id` (`merchant_id`),
  KEY `date_created` (`date_created`),
  FULLTEXT KEY `product_search` 
                (`name`,`description`,`short_description`,
                 `keywords`, `product_type`,`colour`,`style`,
                 `material`,`datafeed_category_name`,
                 `datafeed_subcategory_name`,
                 `brand`,`merchant_name`),
  FULLTEXT KEY `name` (`name`,`datafeed_category_name`, 
                       `datafeed_subcategory_name`,`product_type`,`keywords`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

-

CREATE TABLE `su_categories` (
  `id` mediumint(9) NOT NULL,
  `category` varchar(100) NOT NULL,
  `parent_id` tinyint(4) NOT NULL,
  `update_query` varchar(3000) default NULL,
  `url` varchar(150) default NULL,
  `last_update` datetime default NULL,
  PRIMARY KEY  (`id`,`category`,`parent_id`),
  KEY `parent_id` (`parent_id`),
  KEY `category_id` (`id`),
  FULLTEXT KEY `category_name` (`category`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

-

CREATE TABLE `su_categoryrefs` (
  `mykey` varchar(255) NOT NULL,
  `id` smallint(4) NOT NULL,
  PRIMARY KEY  (`mykey`,`id`),
  KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

-

CREATE TABLE `su_merchants` (
  `id` mediumint(9) NOT NULL,
  `merchant_name` varchar(40) NOT NULL,
  `merchant_url` varchar(40) default NULL,
  `merchant_website` varchar(40) default NULL,
  `merchant_description` varchar(2000) default NULL,
  `merchant_featured` varchar(1) default NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `merchant_name` (`merchant_name`)
) ENGINE=MyISAM AUTO_INCREMENT=128 DEFAULT CHARSET=latin1

Thank you in advance

I cant see the screenshot, but do you have indexes on some of those fields? — JonH, Jul 31 '12 at 18:47
Hi, Yes I have indexes on all of the fields I am joining and on date_created — Amara, Jul 31 '12 at 18:58
The two columns in the join: `ON cr.id = c.id` are of different datatype (one is `SMALLINT` and the other is `MEDIUMINT`). That's not good. — ypercubeᵀᴹ, Jul 31 '12 at 21:37
@ypercube, thanks for that. I've changed it. I performed an analysis on the large table a couple of days ago and reduced the sizes of the fields based on the data. I forgot to do the same for the other tables — Amara, Jul 31 '12 at 21:49
I see the other columns involved in the joins have the same issue. I fear this is not why the query is slow but you should still fix it. Just be careful not to delete any data (if you convert the `VARCHAR(255)` to `VARCHAR(40)`, anything that is longer than 40 characters will be lost). — ypercubeᵀᴹ, Jul 31 '12 at 21:53
Thank you @ypercube. I dealt with most of these issues last night — Amara, Aug 01 '12 at 18:30

spencer7593 · Accepted Answer · 2012-07-31T20:51:28.103

2

It's impossible to tell which table each of those unqualified columns is being returned from, or which table the "merchant_id" in the JOIN predicate is being referenced from, or which table date_created in the ORDER BY clause is being referenced from. (We can make some guesses, but they are only that, guesses.

~~The first step, really, is to qualify all of those column references with either the table_name, or preferably a table alias.~~

~~Also, formatting the statement so that mere mortals can comprehend it helps a great deal.~~

Your reformatted statement with qualified column references helps, and the inclusion of the schema definition also helps. (I will get my answered updated.)

SELECT p.mykey
     , p.merchant_name
     , merchant_url
     , `name`
     , description
     , image_url
     , deep_link
     , rrp_price
     , display_price
     , category
     , su_parent_name
 FROM su_pref p
 JOIN su_categoryrefs cr ON cr.mykey = p.mykey
 JOIN su_categories c    ON c.id     = cr.id
 JOIN su_merchants m     ON m.id     = p.merchant_id
WHERE cr.id = 36
ORDER BY date_created DESC LIMIT 0,20

As a general rule of thumb, its desirable to have an index with leading columns the columns referenced in the join predicates. Typically, these are the PRIMARY KEY and all the foreign keys.

In addition, if the are "covering indexes" (an index that contains all of the columns referenced in the query), you will see "Using index" in the EXPLAIN output; this means the query is being satisfied from the index without referencing the data blocks.

The EXPLAIN seems to show a "colour_id" index being referenced, but it's not at all obvious which columns are contained in that index. (The EXPLAIN PLAN output image is too hard for me to read.)

here are some suggestions:

ensure you have an indexes

ON su_categories (id, category)     -- you do
ON su_merchants (id, merchant_url)  -- never mind this one, it's a small table
ON su_categoryrefs (id, mykey)

I'm assuming that the id column in the su_categoryrefs table is a foreign key referencing su_categories.id, and not the primary key of the table. (This looks like a relationship table that resolves a many-to-many relationship between su_categories and 'su_preferences`, but I'm just guessing.

edited Jul 31 '12 at 20:51

answered Jul 31 '12 at 19:13

spencer7593

106,611
15
112
140

Hi thank you for your answer, I've updated my question as much as I can, thanks for the advice. The colour_id reference was incorrect, I had copied another table and had forgotten to rename the index, I've updated that and changed the image, but I can't seem to make the image bigger,unfortunately. In regards to the indexes they are all single indexes apart from the mykey index on the cr table, which consists of a foreign key also – Amara Jul 31 '12 at 20:11
I guess it's going to be difficult for anyone to help me without seeing the full table definitions – Amara Jul 31 '12 at 20:15
@Amara, the plain text output from the EXPLAIN (as given from mysql command line client) would work well. – spencer7593 Jul 31 '12 at 20:31
thanks for your updated query, do I create an index the 3rd index you mentioned, when I already have one on (mykey,id) – Amara Jul 31 '12 at 21:08
Yes, I think you need an index with id as the leading column. your predicate is on cr.id = 36, if there's no index with id as a leading column, that's gonna be a scan of some sort (either table or index). The inclusion of the mykey column will make it a covering index, that is, allow the index to satisfy the query without any need for the query to visit the related data pages. I see you already have an index on id. If you do add an index on `(id,mykey)`, you can drop the index on `(id)`, it would be redundant. – spencer7593 Jul 31 '12 at 21:49
Just realised I've got rid of filesort and temporary, so problem solved. It's still a bit slow though... – Amara Jul 31 '12 at 21:52
@Amara: it sounds as if you are learning how to use the EXPLAIN output to do query tuning. Not every occurrence of filesort in the explain is evil, but it's often a culprit. I'm working on an alternate query that may perform better for you. – spencer7593 Jul 31 '12 at 21:55
yep I'm learning, it's really confusing me if I'm honest. I don't actually understand why you suggested the id,key index, but it seemed to work. It's very kind of you to help me with a better query. I really appreciate your help – Amara Jul 31 '12 at 22:08
Just saw your previous comment on why I should add the composite index. Missed it before. Am I right in my thinking that by adding mykey to the index, means that only mykeys with an id that matches 36 will be looked at? – Amara Jul 31 '12 at 22:20
@Amara: that's the idea... to restrict, as quickly as possible, the number of blocks we need to look at. What you don't want the query to do (usually) is to look at every single one of the 900,000 rows in su_pref, and check the other tables to see if its one that is needed. (But sometimes, that does turn out to be the most efficient approach.) In your case, though you need to get ALL the rows to get them ordered, and then return only the first 20. So even though you are getting only 20 rows, the query may be a LOT more expensive. – spencer7593 Jul 31 '12 at 22:25
Thanks guys, but when I removed the order by clause it made no difference to the number of rows included. I wish there was a way to limit the number of rows scanned, as I only return 20 – Amara Jul 31 '12 at 22:44
@Amara: that is correct. the LIMIT clause is applied LAST in the execution plan; it doesn't limit the number of rows as the resultset is being prepared; it is only applied AFTER all the rows satisfying the predicates in the query are found, AFTER the HAVING clause is applied, and AFTER any sort operation. – spencer7593 Jul 31 '12 at 22:47

Optimise query using filesort and temporary

1 Answers1