0

I have an article table, it can be associated with many categories.

Here is solution 1 with FIND_IN_SET:

 CREATE TABLE `article` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `cat` varchar(512) DEFAULT NULL,
  `created_at` timestamp NULL DEFAULT NULL,
  `updated_at` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5090104 DEFAULT CHARSET=utf8mb4

And here is solution 2 with table join:

 CREATE TABLE `article_2` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `created_at` timestamp NULL DEFAULT NULL,
  `updated_at` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5511464 DEFAULT CHARSET=utf8mb4 


 CREATE TABLE `article_cat` (
  `cat` int(10) unsigned NOT NULL,
  `article_id` int(10) unsigned NOT NULL,
  `created_at` timestamp NULL DEFAULT NULL,
  `updated_at` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`cat`,`article_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 

Then, here is some query comparison:

mysql> select count(1) from article;
+----------+
| count(1) |
+----------+
|  5207355 |
+----------+
1 row in set (1.12 sec)

mysql> select count(1) from article_2;
+----------+
| count(1) |
+----------+
|  5207313 |
+----------+
1 row in set (0.74 sec)

mysql> select count(1) from article_cat;
+----------+
| count(1) |
+----------+
| 20589375 |
+----------+
1 row in set (2.85 sec)


mysql> select count(1) from `article` where FIND_IN_SET('2',`article`.`cat`);
+----------+
| count(1) |
+----------+
|  2157053 |
+----------+
1 row in set (1.22 sec)

mysql> 
mysql> select count(1) from `article_2` u INNER JOIN article_cat c ON c.article_id = u.id WHERE c.cat =2;
+----------+
| count(1) |
+----------+
|  2156622 |
+----------+
1 row in set (13.72 sec)


mysql> select * from `article` where FIND_IN_SET('2',`article`.`cat`) limit 2000000,10;
+---------+-----------------+---------------------+---------------------+
| id      | cat             | created_at          | updated_at          |
+---------+-----------------+---------------------+---------------------+
| 5132154 | 2,5,7           | 2019-05-13 15:24:58 | 2019-05-13 15:24:58 |
| 5132155 | 1,2,3,5,6,8,9   | 2019-05-13 15:24:58 | 2019-05-13 15:24:58 |
| 5132156 | 1,2,6,7         | 2019-05-13 15:24:58 | 2019-05-13 15:24:58 |
| 5132157 | 1,2,3,5,9       | 2019-05-13 15:24:58 | 2019-05-13 15:24:58 |
| 5132158 | 1,2,3,4,6,7,9   | 2019-05-13 15:24:58 | 2019-05-13 15:24:58 |
| 5132159 | 1,2,5,6,7       | 2019-05-13 15:24:58 | 2019-05-13 15:24:58 |
| 5132160 | 2,3,4,5,7,9     | 2019-05-13 15:24:58 | 2019-05-13 15:24:58 |
| 5132161 | 1,2,5,9         | 2019-05-13 15:24:58 | 2019-05-13 15:24:58 |
| 5132164 | 1,2,3,4,6,7,8,9 | 2019-05-13 15:24:58 | 2019-05-13 15:24:58 |
| 5132166 | 1,2,3,4,5,8     | 2019-05-13 15:24:58 | 2019-05-13 15:24:58 |
+---------+-----------------+---------------------+---------------------+
10 rows in set (1.28 sec)

mysql> select * from `article_2` u INNER JOIN article_cat c ON c.article_id = u.id WHERE c.cat =2 limit 2000000,10 ;
+---------+---------------------+---------------------+-----+------------+---------------------+---------------------+
| id      | created_at          | updated_at          | cat | article_id | created_at          | updated_at          |
+---------+---------------------+---------------------+-----+------------+---------------------+---------------------+
| 5133109 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |   2 |    5133109 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |
| 5133110 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |   2 |    5133110 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |
| 5133113 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |   2 |    5133113 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |
| 5133116 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |   2 |    5133116 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |
| 5133117 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |   2 |    5133117 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |
| 5133120 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |   2 |    5133120 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |
| 5133124 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |   2 |    5133124 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |
| 5133133 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |   2 |    5133133 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |
| 5133137 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |   2 |    5133137 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |
| 5133138 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |   2 |    5133138 | 2019-05-13 15:25:01 | 2019-05-13 15:25:01 |
+---------+---------------------+---------------------+-----+------------+---------------------+---------------------+
10 rows in set (14.01 sec)

From what I know, FIND_IN_SET does a full scan on columns, it doesn't benefit from an index, but here I see it has a better performance than join table solution, is this normal?

would it be a good idea to use FIND_IN_SET where the table article size increased to 10000000+? if not, then when it would be better to use FIND_IN_SET?

UPDATE:

FIND_IN_SET does perform better in some cases, the only problem is that is not relational database design style.

seaguest
  • 2,510
  • 5
  • 27
  • 45
  • 1
    Apples to oranges. The `FIND_IN_SET` query only has to do a single table scan, while the join may have to do many tables scans. These queries are not doing the same thing. But, you should avoid storing CSV data in your tables. Having the need to even use `FIND_IN_SET` is a smell in your code, IMHO. – Tim Biegeleisen May 13 '19 at 03:25
  • Use `EXPLAIN` to show index usage. Your `idx_c` index needs to swap its fields around (or better, use this as the primary key), then it ca use the `WHERE c.cat=2` and the `ON c.article_id` on that same index. – danblack May 13 '19 at 03:27
  • @danblack I add index on (cat, article_id), the article_cat table has 2123798 record, about 4 times of article's size. the new index is used, but table join performance gets worse. – seaguest May 13 '19 at 03:39
  • Try moving the `WHERE c.cat = 2` into the `JOIN` condition i.e. `ON c.article_id = u.id AND c.cat =2` so that MySQL doesn't attempt to do the `WHERE` on the entire output of the `JOIN`. That should also improve performance with the index you added. – Nick May 13 '19 at 03:48
  • @Nick both use idx_c1, both take 4.7s, while FIND_IN_SET takes 0.4s. – seaguest May 13 '19 at 03:54
  • Solution 1 isn't a solution. – Strawberry May 13 '19 at 05:32

1 Answers1

0

The join performance is slow because you're looking up a lot of rows in the index. and relooking them up to get timestamps and other fields (which aren't in the index.

Recomendations for article_cat.

  1. id isn't needed
  2. cat, article_id is a good primary key, especially for this query. If article_ids are looked up more, then swap the order and add cat,article_id as a UNIQUE KEY

remove created_at, updated_at, given these are in the article table these would be a duplicate, and a duplicate that is slowing down the JOIN queries.

danblack
  • 12,130
  • 2
  • 22
  • 41
  • I did with your advises, but it doesn't improve much (see the query stats in question). Especially when the limit is very big, the difference is huge. – seaguest May 13 '19 at 07:56
  • The raw query time of the query in question is the important one. Do you really want 5M articles returned? Selecting the count from the full table doesn't lead to much of a conclusion compared to the real query. The `EXPLAIN {query}` is a validation on ensuring the right indexes are in the query and also hints as to if other affects are in play. – danblack May 13 '19 at 08:51
  • in some cases I may need scan the full table. even though that is not a relational design, it indeed performs better. – seaguest May 14 '19 at 08:43