how to speed up query when using count and group by

Question

I have two tables named seller and item. They are connected through a third table (seller_item) using a "n" to "m" foreign key relation.

Now I a try to answer the requirement: "I as a seller want a list of my competitors with a count of items I am selling and they are selling as well". So a list of all sellers with the count of overlapping items in relation to one specific seller. Also I want this to be sorted by count and limited. But the query is using temp table and filesort which is very slow. Explain says:

Using where; Using index; Using temporary; Using filesort

How can I speed this up ?

Here is the query:

SELECT
          COUNT(*) AS itemCount,
          s.sellerName
        FROM
          seller s,
          seller_item si
        WHERE
          si.itemId IN 
           (SELECT itemId FROM seller_item WHERE sellerId = 4711)
        AND
          si.sellerId=s.id
        GROUP BY
          sellerName
        ORDER BY
          itemCount DESC
        LIMIT 50;

the table defs:

CREATE TABLE `seller` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`sellerName` varchar(50) NOT NULL
PRIMARY KEY (`id`),
UNIQUE KEY `unique_index` (`sellerName`),
) ENGINE=InnoDB 

contains about 200.000 rows

--

CREATE TABLE `item` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`itemName` varchar(20) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_index` (`itemName`),
) ENGINE=InnoDB

contains about 100.000.000 rows

--

CREATE TABLE `seller_item` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`sellerId` bigint(20) unsigned NOT NULL,
`itemId` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `sellerId` (`sellerId`,`itemId`),
KEY `item_id` (`itemId`),
CONSTRAINT `fk_1` FOREIGN KEY (`sellerId`) REFERENCES `seller` (`id`)            ON DELETE CASCADE ON UPDATE NO ACTION,
CONSTRAINT `fk_2` FOREIGN KEY (`itemId`) REFERENCES `item` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION
) ENGINE=InnoDB

contains about 170.000.000 rows

Database is Mysql Percona 5.6

Output of EXPLAIN:

+----+-------------+-------------+--------+----------------------+-----    ---------+---------+---------------------+------+----------------------------------------------+
| id | select_type | table       | type   | possible_keys        | key                | key_len | ref                 | rows | Extra                                           |
+----+-------------+-------------+--------+----------------------+--------------+---------+---------------------+------+----------------------------------------------+
|  1 | SIMPLE      | s           | index  | PRIMARY,unique_index | unique_index | 152     | NULL                |    1 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | si          | ref    | sellerId,item_id     | sellerId     | 8       | tmp.s.id            |    1 | Using index                                  |
|  1 | SIMPLE      | seller_item | eq_ref | sellerId,item_id     |  sellerId     | 16      | const,tmp.si.itemId |    1 | Using where; Using  index                     |
+----+-------------+-------------+--------+----------------------+--------------+---------+---------------------+------+----------------------------------------------+

can you provide the full explain plan please? need current keys that it is using to traverse the data... — BK435, Apr 13 '16 at 20:26
It is a bridge table between the seller and items...it needs its own surrogate key... — BK435, Apr 13 '16 at 20:39
how do you traverse the table otherwise unless you make the PK the composition of sellerID and itemID...Why not use a separate surrogate key and make the alternate key the composition of sellerID and itemID? — BK435, Apr 13 '16 at 20:41
@BK435: because it's an extra field in a huge table which serves no actual purpose. Making `(seller, item)` the PK is exactly what you should do. — Quassnoi, Apr 13 '16 at 20:49
@Quassnoi: Guess you're right and I could remove the extra auto_inc its just kind of a personal best practice to add auto_inc key to every table and I think removing it would not really speed up the query. — mostart, Apr 13 '16 at 20:54
@mostart: right now, you have two exact copies of your `seller_item` table on disk: the first one is the actual table (three fields), the second one is the unique index on `(seller, item)` (two fields + pk reference). And the database can't even find seller by item in one lookup: first, it needs to find id by item, then seller by id. — Quassnoi, Apr 13 '16 at 21:03
@Quassnoi: I was not disagree with you. I think in this case you are right. I was just curious what your reasoning was...Thank you, you have def taught me something today...+1 on your answer below, very interesting method... — BK435, Apr 13 '16 at 21:04

Quassnoi · Answer 1 · 2016-04-13T21:05:01.487

I doubt it's feasible to make a query like that run fast in realtime on a database of your size, especially for sellers with lots of popular items in stock.

You should materialize it. Create a table like this

CREATE TABLE
        matches
        (
        seller INT NOT NULL,
        competitor INT NOT NULL,
        matches INT NOT NULL,
        PRIMARY KEY
                (seller, competitor)
        )

and update it in batches in a cron script:

DELETE
FROM    matches
WHERE   seller = :seller

INSERT
INTO    matches (seller, competitor, matches)
SELECT  si.seller, sc.seller, COUNT(*) cnt
FROM    seller_item si
JOIN    seller_item sc
ON      sc.item = si.item
        AND sc.seller <> si.seller
WHERE   si.seller = :seller
GROUP BY
        si.seller, sc.seller
ORDER BY
        cnt DESC
LIMIT   50

You also need to make (seller, item) the PRIMARY KEY on seller_item. The way it is now, finding a seller by item requires two lookups instead of one: first id by item using KEY (item), then seller by id using the PRIMARY KEY (id)

This does not solve the tmp Table and filesort problem making this select still very slow. — mostart, Apr 13 '16 at 21:30
@mostart: it won't. That particular problem is very well know and hard to solve. But you at least run the slow queries just once per day per seller, not every time a user visits the seller's page. Querying the materialized table is instant. — Quassnoi, Apr 13 '16 at 21:32
@mostart - Temp tables and filesorts are not the _cause_ of a slow query as much as a _symptom_. — Rick James, Apr 14 '16 at 16:08

score 2 · Answer 2 · answered Apr 13 '16 at 22:19

I believe you're under a misimpression about your ability to eliminate the Using temporary; Using filesort steps to satisfy your query. Queries of the form

 SELECT COUNT(*), grouping_value
   FROM table
  GROUP BY grouping_value
  ORDER BY COUNT(*) 
  LIMIT n

always use a temporary in-memory result set, and always sort that resultset. That's because the result set doesn't exist anywhere until the query runs, and it has to be sorted before the LIMIT clause can be satisfied.

"Filesort" is somewhat misnamed. It doesn't necessarily mean the sorting is happening on a file in the file system, just that a temporary resultset is being sorted. If that resultset is massive, the sort can spill out of RAM into the filesystem, but it doesn't have to. Please read this. https://www.percona.com/blog/2009/03/05/what-does-using-filesort-mean-in-mysql/ Don't get distracted by the Using filesort item in your EXPLAIN results.

One of the tricks to getting better performance from this sort of query is to minimize the size of the sorted results. You've already filtered them down to the stuff you want; that's good.

But, you can still arrange to sort less stuff, by sorting just the seller.id and the count, then joining the (longer) sellerName in after you know the exact fifty rows you need. That also has the benefit of letting you do your aggregating with just the seller_item table, rather than with the resultset that comes from joining the two.

Here's what I mean. This subquery generates the list of fifty sellerId values you need. All it has to sort is the count and sellerId. That's faster than sorting the count and sellerName because there's less data, and fixed-length data, to shuffle around in the sort operation.

SELECT COUNT(*) AS itemCount,
       sellerId
  FROM seller_item 
 WHERE itemId IN
        (SELECT itemId FROM seller_item WHERE sellerId = 4711)
 GROUP BY SellerId
 ORDER BY COUNT(*) DESC
 LIMIT 50

Notice that this sorts a big result set, then discards most of it. It gives you the exact fifty seller id values you need.

You can make this even faster by filtering out more rows by adding HAVING COUNT(*) > 1 right after your GROUP BY clause, but that changes the meaning of your query and may not meet your business requirements.

Once you have those fifty items, you can retrieve the seller names. The whole query looks like this:

SELECT s.sellerName, c.itemCount
  FROM seller s
  JOIN (
         SELECT COUNT(*) AS itemCount, sellerId
           FROM seller_item 
          WHERE itemId IN
                      (SELECT itemId FROM seller_item WHERE sellerId = 4711)
                GROUP BY SellerId
                ORDER BY COUNT(*) DESC
                LIMIT 50
       ) c ON c.sellerId = s.id
 ORDER BY c.itemCount DESC

Your indexing effort should be spent trying to make the inner queries fast. The outer query will be fast no matter what; it's only handling fifty rows, and using an indexed id value to look up other values.

The inmost query is SELECT itemId FROM seller_item WHERE sellerId = 4711. This will benefit greatly from your existing (sellerId, itemId) compound index: it can random-access and then scan that index, which is very quick.

The SELECT COUNT(*)... query will benefit from a (itemId, sellerId) compound index. That part of your query is the hard and slow part, but still, this index will help.

Look, others have mentioned this, and so will I. Having both a unique composite key (sellerId, itemId) and a primary key id on that seller_item table is, with respect, incredibly wasteful.

It makes your updates and inserts slower.
It means your table is organized as a tree based on the meaningless id rather than the meaningful value pair.

If you make one of the two indexes I mentioned the primary key, and create the other one without making it unique, you'll have a much more efficient table. These many-to-many join tables don't need, and should not have, surrogate keys.

Rick James · Answer 3 · 2016-04-14T16:29:26.320

Reformulation

I think this is what you really wanted:

SELECT  si2.sellerId, COUNT(DISTINCT si2.itemId) AS itemCount
    FROM  seller_item si1
    JOIN  seller_item si2 ON si2.itemId = si1.itemId
    WHERE  si1.sellerId = 4711
    GROUP BY  si2.sellerId
    ORDER BY  itemCount DESC
    LIMIT  50;

(Note: DISTINCT is probably unnecessary.)

In words: For seller #4711, find the items he sells, then find which sellers are selling nearly the same set of items. (I did not try to filter out #4711 from the resultset.)

More efficient N:M

But there is still an inefficiency. Let's dissect your many-to-many mapping table (seller_item).

It has an id which is probably not used for anything. Get rid of it.
Then promote UNIQUE(sellerId, itemId) to PRIMARY KEY(sellerId, itemId).
Now change INDEX(itemId) to INDEX(itemId, sellerId) so that the last stage of the query can be "using index".

Blog discussing that further.

You have a very large dataset; you have debugged your app. Consider removing the FOREIGN KEYs; they are somewhat costly.

Getting sellerName

It may be possible to JOIN to sellers to get sellerName. But try it with just sellerId first. Then add the name. Verify that the count does not inflate (that often happens) and that the query does not slow down.

If either thing goes wrong, then do

SELECT s.sellerName, x.itemCount
    FROM ( .. the above query .. ) AS x
    JOIN sellers AS s  USING(sellerId);

(Optionally you could add ORDER BY sellerName.)

+10. Adding the covering index on (itemId, sellerId), and putting off the join to seller until after limiting to 50 rows. There's no getting around the "Using filesort" operation without a precalculated (and presorted) summary available. — spencer7593, Apr 14 '16 at 06:01
A "filesort" (which might actually happen in RAM) is necessary when doing `GROUP BY one_thing ORDER BY something_else`. — Rick James, Apr 14 '16 at 16:02
Given a guarantee that (sellerId,itemId) tuple is unique in seller_item table, the DISTINCT keyword doesn't change the result. (It would potentially give a different result if that tuple wasn't unique.) — spencer7593, Apr 14 '16 at 16:26

score 0 · Answer 4 · answered Apr 13 '16 at 21:23

I'm not sure how fast this would be on your database but I'd write the query like this.

    select * from (
        select seller.sellerName,
        count(otherSellersItems.itemId) itemCount from (
            select sellerId, itemId from seller_item where sellerId != 4711
        ) otherSellersItems
        inner join ( 
            select itemId from seller_item where sellerId = 4711
        ) thisSellersItems
        on otherSellersItems.itemId = thisSellersItems.itemId
        inner join seller
        on otherSellersItems.sellerId = seller.id
        group by seller.sellerName
    ) itemsSoldByOtherSellers
    order by itemCount desc
    limit 50 ;

Still using filesort and tmp table with this. – mostart Apr 13 '16 at 21:31 — mostart, Apr 13 '16 at 21:31

score 0 · Answer 5 · answered Apr 14 '16 at 05:58

Since we are limiting the (potentially large) resultset to at most 50 rows, I would put off getting the sellername until after we have the counts, so we only need to get 50 seller names.

First, we get the itemcount by seller_id

SELECT so.seller_id
     , COUNT(*) AS itemcount
  FROM seller_item si
  JOIN seller_item so
    ON so.item_id = si.item_id
 WHERE si.seller_id = 4711
 GROUP BY so.seller_id
 ORDER BY COUNT(*) DESC, so.seller_id DESC
 LIMIT 50

For improved performance, I would make a suitable covering index available for the join to so. e.g.

CREATE UNIQUE INDEX seller_item_UX2 ON seller_item(item_id,seller_id)

By using a "covering index", MySQL can satisfy the query entirely from the index pages, without a need to visit the pages in the underlying table.

Once the new index is created, I would drop the index on the singleton item_id column, since that index is now redundant. (Any query that could make effective use of that index will be able to make effective use of the composite index which has item_id as the leading column.)

There's no getting around a "Using filesort" operation. MySQL has to evaluate the COUNT() aggregate on each row, before it can perform a sort. There's no way (given the current schema) for MySQL to return the rows in order using an index to avoid a sort operation.

Once we have that set of (at most) fifty rows, then we can get the sellername.

To get the sellername, we could either use a correlated subquery in the SELECT list, or a join operation.

1) Using a correlated subquery in SELECT list, e.g.

SELECT so.seller_id
     , ( SELECT s.sellername
           FROM seller s
          WHERE s.seller_id = so.seller_id
          ORDER BY s.seller_id, s.sellername
          LIMIT 1
       ) AS sellername   
     , COUNT(*) AS itemcount
  FROM seller_item si
  JOIN seller_item so
    ON so.item_id = si.item_id
 WHERE si.seller_id = 4711
 GROUP BY so.seller_id
 ORDER BY COUNT(*) DESC, so.seller_id DESC
 LIMIT 50

(We know that subquery will be executed (at most) fifty times, once for each row returned by the outer query. Fifty executions (with a suitable index available) isn't that bad, at least compared to 50,000 executions.)

Or, 2) using a join operation, e.g.

SELECT c.seller_id
     , s.sellername
     , c.itemcount
  FROM ( 
         SELECT so.seller_id
              , COUNT(*) AS itemcount
           FROM seller_item si
           JOIN seller_item so
             ON so.item_id = si.item_id
          WHERE si.seller_id = 4711
          GROUP BY so.seller_id
          ORDER BY COUNT(*) DESC, so.seller_id DESC
          LIMIT 50
       ) c
  JOIN seller s
    ON s.seller_id = c.seller_id
 ORDER BY c.itemcount DESC, c.seller_id DESC

(Again, we know the the inline view c will return (at most) fifty rows, and getting fifty sellername (using a suitable index) should be fast.

SUMMARY TABLE

If we denormalized the implementation, and added summary table containing item_id (as the primary key) and a "count" of the number of sellers of that item_id, our query could take advantage of that.

As an illustration of what that might look like:

CREATE TABLE item_seller_count
( item_id BIGINT NOT NULL PRIMARY KEY
, seller_count BIGINT NOT NULL
) Engine=InnoDB
;

INSERT INTO item_seller_count (item_id, seller_count)
SELECT d.item_id
     , COUNT(*)
  FROM seller_item d
 GROUP BY d.item_id
 ORDER BY d.item_id
;

CREATE UNIQUE INDEX item_seller_count_IX1 
  ON item_seller_count (seller_count, item_id)
;

The new summary table will become "out of sync" when rows are inserted/updated/deleted from the seller_item table.

And populating this table would take resources. But having this available would speed up queries of the type we're working on.

A summary table can be updated at the same time as the "Fact" table is augmented. `INSERT .. ON DUPLICATE KEY UPDATE ..` might be optimal here. — Rick James, Apr 14 '16 at 16:04
@RickJames: Yes, maintenance of the summary table could be automated. My point was that having the aggregate results pre-calculated and indexed would make for a very fast query. But that requires a schema change. It might be possible to do the maintenance of the summary table real-time "behind the scenes" using INSERT/UPDATE/DELETE triggers. Or the module/library that stewards the item_seller table could do it. (After reading the specification again, I'm still not sure I understand it. I really need example data and the expected query output to understand what the query is supposed to return.) — spencer7593, Apr 14 '16 at 16:19

how to speed up query when using count and group by

5 Answers5