Alternative to COUNT for innodb to prevent table scan?

Question

I've managed to put together a query that works for my needs, albeit more complicated than I was hoping. But, for the size of tables the query is slower than it should be (0.17s). The reason, based on the EXPLAIN provided below, is because there is a table scan on the meta_relationships table due to it having the COUNT in the WHERE clause on an innodb engine.

Query:

SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
RIGHT JOIN meta_relationships ON (posts.post_id = meta_relationships.object_id)
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = computers AND meta_relationships.object_id 
NOT IN (SELECT meta_relationships.object_id FROM meta_relationships
        GROUP BY meta_relationships.object_id HAVING count(*) > 1)
GROUP BY meta_relationships.object_id

This particular query, selects posts which have ONLY the computers category. The purpose of count > 1 is to exclude posts that contain computers/hardware, computers/software, etc. The more categories that are selected, the higher the count would be.

Ideally, I'd like to get it functioning like this:

WHERE meta.meta_name IN ('computers') AND meta_relationships.meta_order IN (0)

or

WHERE meta.meta_name IN ('computers','software') 
AND meta_relationships.meta_order IN (0,1)

etc..

But unfortunately this doesn't work, because it doesn't take into consideration that there may be a meta_relationships.meta_order = 2.

I've tried...

WHERE meta.meta_name IN ('computers')
GROUP BY meta_relationships.meta_order
HAVING meta_relationships.meta_order IN (0) AND meta_relationships.meta_order NOT IN (1)

but it doesn't return the correct amount of rows.

EXPLAIN:

id  select_type   table               type    possible_keys          key               key_len ref                                   rows   Extra   
1   PRIMARY       meta                ref     PRIMARY,idx_meta_name  idx_meta_name     602     const                                 1      Using where; Using index; Using temporary; Using filesort
1   PRIMARY       meta_data           ref     PRIMARY,idx_meta_id    idx_meta_id       8       database.meta.meta_id                 1  
1   PRIMARY       meta_relationships  ref     idx_meta_data_id       idx_meta_data_id  8       database.meta_data.meta_data_id       11     Using where
1   PRIMARY       posts               eq_ref  PRIMARY                PRIMARY           4       database.meta_relationships.object_id 1  
2   MATERIALIZED  meta_relationships  index   NULL                   idx_object_id     4       NULL                                  14679  Using index

Tables/Indexes:
meta
This table contains the category and tag names.
indexes:
PRIMARY KEY (meta_id), KEY idx_meta_name (meta_name)
meta_data
This table contains additional data about the categories and tags such as type (category or tag), description, parent, count.
indexes:
PRIMARY KEY (meta_data_id), KEY idx_meta_id (meta_id)
meta_relationships
This is a junction/lookup table. It contains a foreign key to the posts_id, a foreign key to the meta_data_id, and also contains the order of the categories.
indexes:
PRIMARY KEY (relationship_id), KEY idx_object_id (object_id), KEY idx_meta_data_id (meta_data_id)

The count allows me to only select the posts with that correct level of category. For example, the category computers has posts with only the computers category but it also has posts with computers/hardware. The count filters out posts that contain those extra categories. I hope that makes sense.
I believe the key to optimizing the query is to get away completely from doing the COUNT.
An alternative to the COUNT would possibly be using meta_relationships.meta_order or meta_data.parent instead.
The meta_relationships table will grow quickly and with the current size (~15K rows) I'm hoping to achieve an execution time in the 100th of seconds rather than the 10ths of seconds.
Since there needs to be multiple conditions in the WHERE clause for each category/tag, any answer optimized for a dynamic query is preferred.
I have created an IDE with sample data.

How can I optimize this query?

EDIT :

I was never able to find an optimal solution to this problem. It was really a combination of smcjones recommendation of improving the indexes for which I would recommend doing an EXPLAIN and looking at EXPLAIN Output Format then change the indexes to whatever gives you the best performance.
Also, hpf's recommendation to add another column with the total count helped tremendously. In the end, after changing the indexes, I ended up going with this query.

SELECT posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
JOIN meta_relationships ON meta_relationships.object_id = posts.post_id
JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE posts.meta_count = 2
GROUP BY posts.post_id
HAVING category = 'category,subcategory'

After getting rid of the COUNT, the big performance killer was the GROUP BY and ORDER BY, but the indexes are your best friend. I learned that when doing a GROUP BY, the WHERE clause is very important, the more specific you can get the better.

Can you provide `SHOW CREATE TABLE tablename` for each table, especially `meta_relation`, so we can see what the indices are composed of. — Willem Renzema, Apr 04 '15 at 16:54
Explain (in English) the purpose of the `NOT IN`; that's where the table scan is. (And you are lucky -- in older versions, it would run _much_ slower.) — Rick James, Apr 05 '15 at 22:02
@RickJames - The purpose of it is to eliminate any `object_id` that has more than one category or tag. — EternalHour, Apr 05 '15 at 23:54
It's a bit `meta` to me. You want to select posts that have at most one tag ? — LeGEC, Apr 20 '15 at 08:18
@LeGEC - On the example query yes, but I need to be able to select any number of categories. — EternalHour, Apr 20 '15 at 17:52

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

With a combination of optimized queries AND optimizing your tables, you will have fast queries. However, you cannot have fast queries without an optimized table.

I cannot stress this enough: If your tables are structured correctly with the correct amount of indexes, you should not be experiencing any full table reads on a query like GROUP BY... HAVING unless you do so by design.

Based on your example, I have created this SQLFiddle.

Compare that to SQLFiddle #2, in which I added indexes and added a UNIQUE index against meta.meta_naame.

From my testing, Fiddle #2 is faster.

Optimizing Your Query

This query was driving me nuts, even after I made the argument that indexes would be the best way to optimize this. Even though I still hold that the table is your biggest opportunity to increase performance, it did seem that there had to be a better way to run this query in MySQL. I had a revelation after sleeping on this problem, and used the following query (seen in SQLFiddle #3):

SELECT posts.post_id,posts.post_name,posts.post_title,posts.post_description,posts.date,meta.meta_name
   FROM posts
   LEFT JOIN meta_relationships ON meta_relationships.object_id = posts.post_id
   LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
   LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
   WHERE meta.meta_name = 'animals'
   GROUP BY meta_relationships.object_id
   HAVING sum(meta_relationships.object_id) = min(meta_relationships.object_id);

HAVING sum() = min() on a GROUP BY should check to see if there is more than one record of each type. Obviously, each time the record shows up, it will add more to the sum. (Edit: On subsequent tests it seems like this has the same impact as count(meta_relationships.object_id) = 1. Oh well, the point is I believe you can remove subquery and have the same result).

I want to be clear that you won't notice much if any optimization on the query I provided you unless the section, WHERE meta.meta_name = 'animals' is querying against an index (preferably a unique index because I doubt you'll need more than one of these and it will prevent accidental duplication of data).

So, instead of a table that looks like this:

CREATE TABLE meta_data (
  meta_data_id BIGINT,
  meta_id BIGINT,
  type VARCHAR(50),
  description VARCHAR(200),
  parent BIGINT,
  count BIGINT);

You should make sure you add primary keys and indexes like this:

CREATE TABLE meta_data (
  meta_data_id BIGINT,
  meta_id BIGINT,
  type VARCHAR(50),
  description VARCHAR(200),
  parent BIGINT,
  count BIGINT,
  PRIMARY KEY (meta_data_id,meta_id),
  INDEX ix_meta_id (meta_id)
);

Don't overdo it, but every table should have a primary key, and any time you are aggregating or querying against a specific value, there should be indexes.

When indexes are not used, the MySQL will walk through each row of the table until it finds what you want. In such a limited example as yours this doesn't take too long (even though it's still noticeably slower), but when you add thousands or more records, this will become extraordinarily painful.

In the future, when reviewing your queries, try to identify where your full table scans are occurring and see if there is an index on that column. A good place to start is wherever you are aggregating or using the WHERE syntax.

A note on the `count` column

I have not found putting count columns into the table to be helpful. It can lead to some pretty serious integrity issues. If a table is properly optimized, It should be very easy to use count() and get the current count. If you want to have it in a table, you can use a VIEW, although that will not be the most efficient way to make the pull.

The problem with putting count columns into a table is that you need to update that count, using either a TRIGGER or, worse, application logic. As your program scales out that logic can either get lost or buried. Adding that column is a deviation from normalization and when something like this is to occur, there should be a VERY good reason.

Some debate exists as to whether there is ever a good reason to do this, but I think I'd be wise to stay out of that debate because there are great arguments on both sides. Instead, I will pick a much smaller battle and say that I see this causing you more headaches than benefits in this use case, so it is probably worth A/B testing.

Although I do agree with you that the indexes need to be improved, the database is small enough that they shouldn't have a large impact. I believe the key to significantly improving the speed is to optimize the query itself and prevent the table scan. — EternalHour, Apr 20 '15 at 00:45
@EternalHour you do not know enough to make statements like that. Indices (indexes for the yanks) are what prevent table scans regardless of 'optimisation'. — David Soussan, Apr 22 '15 at 22:45
@DavidSoussan - I tend to agree with you, since "indices" are considered by the optimizer to perform the query. But regardless of what you THINK I know, I do already have indexes in place (same as suggested) and I still don't believe that's what's causing the performance issue. — EternalHour, Apr 23 '15 at 00:27
I am not sure your SQL is correct then, because if you tried to put a `PRIMARY KEY` on `meta_data.meta_data_id` then it would fail because three rows share 'ID 10.' Updating my answer to reflect things like this. — smcjones, Apr 23 '15 at 03:41
Your indexes are not optimized if you followed my instructions to the tee in the first run-through. You need a `UNIQUE` constraint against `meta.meta_name`. Otherwise the table will inevitably do a full table scan when you search for "animals" or "entertainment" or "computers" or whatever else. — smcjones, Apr 23 '15 at 13:35
I have made some final edits. Please let me know how that works. — smcjones, Apr 23 '15 at 14:43
It's a good answer @smcjones, currently testing it. I have lost sleep over this problem :) — EternalHour, Apr 23 '15 at 19:36
It is slightly faster after tweaking the indexes a bit. But I can't seem to find a way to select the records in a hierarchical fashion this way. For example, it selects every record that matches `animals`, but I may only need to select the `animals` rows that are not associated with a second category. — EternalHour, Apr 23 '15 at 20:33
Your examples don't really show this use case, or at least I had a hard time seeing it. By "category" do you mean what is found under the meta_data table and compared using `relationships`? I assume that's what you mean and not the `parent` column. The `GROUP BY... HAVING(item_relationships.object_id) = 1` should restrict to only one type. — smcjones, Apr 24 '15 at 01:01
The categories are held in the `meta` table. The `meta_data` table links the categories to `meta_relationships`, and `meta_relationships` links the posts to `meta_data` and each category. I've kind of lost hope that I will ever improve the performance. I may start trying to come up with a different schema instead. — EternalHour, Apr 24 '15 at 19:37
If it is possible it would be a good idea. Your tables are not quite 3NF and if they were your queries would be much simpler. In any case you should be able to group by category and meta data and then use the `HAVING` syntax to restrict by relationship ID. — smcjones, Apr 24 '15 at 19:50

hpf · Answer 2 · 2015-04-20T21:48:53.167

2

Since the HAVING seems to be the issue, can you instead create a flag field in the posts table and use that instead? If I understand the query correctly, you're trying to find posts with only one meta_relationship link. If you created a field in your posts table that was either a count of the meta_relationships for that post, or a boolean flag for whether there was only one, and indexed it of course, that would probably be much faster. It would involve updating the field if the post was edited.

So, consider this:

Add a new field to the posts table called "num_meta_rel". It can be an unsigned tinyint as long as you'll never have more than 255 tags to any one post.

Update the field like this:

UPDATE posts
SET num_meta_rel=(SELECT COUNT(object_id) from meta_relationships WHERE object_id=posts.post_id);

This query will take some time to run, but once done you have all the counts precalculated. Note this can be done better with a join, but SQLite (Ideone) only allows subqueries.

Now, you rewrite your query like this:

SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
RIGHT JOIN meta_relationships ON (posts.post_id = meta_relationships.object_id)
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
LEFT JOIN meta ON meta_data.meta_id = meta.meta_id
WHERE meta.meta_name = computers AND posts.num_meta_rel=1
GROUP BY meta_relationships.object_id

If I've done this correctly, the runnable code is here: http://ideone.com/ZZiKgx

Note that this solution requires that you update the num_meta_rel (choose a better name, that one is terrible...) if the post has a new tag associated with it. But that should be much faster than scanning your entire table over and over.

edited Apr 20 '15 at 21:48

answered Apr 19 '15 at 11:54

hpf

428
2
9

It's not really based on the count of categories/tags, the count allows me to only select the posts with that correct level of category. For example, the category computers has posts with only the computers category but it also has posts with computers/hardware. The count filters out posts that contain those extra categories. I hope that makes sense. – EternalHour Apr 19 '15 at 19:44
Right. The problem is the subquery: SELECT meta_relationships.object_id FROM meta_relationships GROUP BY meta_relationships.object_id HAVING count(*) > 1. This requires that the entire table (or index) be read and grouped, and then all of the rows with >1 count be dropped. See my edit above for more details. – hpf Apr 20 '15 at 20:40
I'm glad you understand the problem. I'm not excited about having another column to update, but if it works this may be the way to go. `meta_relationships.meta_order` actually functions much the same way you are proposing. But when I do `WHERE meta.meta_name = computers AND meta_relationships.meta_order=1` it still returns every row with the category of computers because there could also be a meta_order = 2. Ideally this is the way I'd like to get it working. Can't test your solution at the moment. – EternalHour Apr 20 '15 at 22:32
Using meta_order will only work if you only want to match rows where "computer" (or whatever tag you're looking for) is always in the first position. In that case, you could use an OUTER JOIN looking for the id for "computer" in position 1 and NULL in position 2 (meaning there isn't another tag. – hpf Apr 20 '15 at 23:01
It will always be in the first position, second category will be in second position, etc. – EternalHour Apr 21 '15 at 00:09
I was really excited about this method, the query executed in 0.06s! But, I found out after trying to select more than one category at a time (animals, birds) that it didn't select any rows. – EternalHour Apr 23 '15 at 20:44
First, I think that smcjones has good points and this is a less-than-ideal solution. But, it will work fine if you're looking for two categories -- just make sure you change the posts.num_meta_rel=1 to posts.num_meta_rel=2. In any case, further optimization as above is probably the way to go. – hpf Apr 24 '15 at 21:22
I was changing it to 2. But unfortunately I think there is a much bigger problem here. My original query in the question is not working properly anymore and also has the same problem of only being able to select one category. Once I get that sorted out I think one of these answers may actually be what I need. – EternalHour Apr 24 '15 at 21:39

score 1 · Answer 3 · answered Apr 06 '15 at 00:14

1

See if this gives you the right answer, possibly faster:

SELECT  p.post_id, p.post_name,
        GROUP_CONCAT(IF(md.type = 'category', meta.meta_name, null)) AS category,
        GROUP_CONCAT(IF(md.type = 'tag', meta.meta_name, null)) AS tag
    FROM  
      ( SELECT  object_id
            FROM  meta_relation
            GROUP BY  object_id
            HAVING  count(*) = 1 
      ) AS x
    JOIN  meta_relation AS mr ON mr.object_id = x.object_id
    JOIN  posts AS p ON p.post_id = mr.object_id
    JOIN  meta_data AS md ON mr.meta_data_id = md.meta_data_id
    JOIN  meta ON md.meta_id = meta.meta_id
    WHERE  meta.meta_name = ?
    GROUP BY  mr.object_id

answered Apr 06 '15 at 00:14

Rick James

135,179
13
127
222

Unfortunately this query is only slightly faster than mine at 0.16s. It also still reads all rows in `meta_relation`. – EternalHour Apr 06 '15 at 08:28
I can't think of a way to do the HAVING without reading all the rows. Or at least all the rows of one index, which you seemed to have, since it said "Using index". – Rick James Apr 06 '15 at 14:46
Oh, I have another idea -- but it depends on `WHERE meta.meta_name = ?` being quite selective; is it? – Rick James Apr 06 '15 at 14:46
`WHERE meta.meta_name = ?` can contain multiple categories and tags. – EternalHour Apr 06 '15 at 19:23

score 1 · Answer 4 · answered Apr 20 '15 at 16:42

1

Unfortunately I have no possibility to test performance,

But try my query using your real data:

http://sqlfiddle.com/#!9/81b29/13

SELECT
posts.post_id,posts.post_name,
GROUP_CONCAT(IF(meta_data.type = 'category', meta.meta_name,null)) AS category,
GROUP_CONCAT(IF(meta_data.type = 'tag', meta.meta_name,null)) AS tag
FROM posts
INNER JOIN (
  SELECT meta_relationships.object_id
   FROM meta_relationships 
   GROUP BY meta_relationships.object_id 
   HAVING count(*) < 3
  ) mr ON mr.object_id = posts.post_id
LEFT JOIN meta_relationships ON mr.object_id = meta_relationships.object_id
LEFT JOIN meta_data ON meta_relationships.meta_data_id = meta_data.meta_data_id
INNER JOIN (
  SELECT * 
  FROM meta
  WHERE  meta.meta_name = 'health'
  ) meta ON meta_data.meta_id = meta.meta_id
GROUP BY posts.post_id

answered Apr 20 '15 at 16:42

Alex

16,739
1
28
51

Thanks for the answer Alex. This query is actually faster @ 0.11s, but there is still a table scan :( – EternalHour Apr 20 '15 at 17:59
Could you provide more data for fiddle and debug? – Alex Apr 20 '15 at 18:26
Sure, what data do you need? In the fiddle, you can see that there are 18 rows for `meta_relationships`. If you do an `EXPLAIN` it shows 18 rows were read (table scan), this is what I'm trying to avoid. The query returns 2 rows so I'm hoping to see 2 rows for `meta_relationships` in the `EXPLAIN`. – EternalHour Apr 20 '15 at 18:35
I just need more data sample to see speed improvements if any – Alex Apr 20 '15 at 18:42
Well there are 15,000 rows in that table, I won't be able to add them in the fiddle. If you can avoid the table scan, performance will improve automatically. – EternalHour Apr 20 '15 at 18:44
I also added more details to the question, hopefully it helps. – EternalHour Apr 20 '15 at 19:07

score 1 · Answer 5 · edited Apr 22 '15 at 11:00

1

Use

sum(1)

instead of

count(*)

edited Apr 22 '15 at 11:00

eebbesen

5,070
8
48
70

answered Apr 22 '15 at 10:24

Ratan K

64
4

This has the same result but is a little slower. – EternalHour Apr 22 '15 at 18:40

Alternative to COUNT for innodb to prevent table scan?

5 Answers5

Optimizing Your Query

A note on the count column

A note on the `count` column