How to optimize query to find duplicates by 3 columns with 1.5 billion of rows

Question

I have table with over 1.5 billion of rows, high load project. I want to find duplicates by 3 columns using this query:

SELECT v.product_id
     , v.value_id
     , COUNT(*) counter 
  FROM ann2_product_value AS v
 GROUP 
    BY v.product_id
     , v.value_id
     , v.set 
HAVING counter > 1
 ORDER 
    BY v.product_id
     , v.value_id 
 LIMIT 20;

and the create table code is:

CREATE TABLE `ann2_product_value` (
  `product_id` bigint(20) unsigned NOT NULL,
  `value_id` int(11) NOT NULL,
  `index` int(11) NOT NULL DEFAULT '0',
  `set` int(11) NOT NULL,
  `fitment` enum('yes','no') NOT NULL DEFAULT 'yes',
  PRIMARY KEY (`product_id`,`value_id`,`index`,`set`,`fitment`),
  KEY `value_id` (`value_id`),
  KEY `product_fitment` (`product_id`,`fitment`),
  CONSTRAINT `ann2_product_value_ibfk_1` FOREIGN KEY (`product_id`) REFERENCES `xcart_products` (`productid`) ON DELETE CASCADE ON UPDATE CASCADE,
  CONSTRAINT `fk_value_id_to_value` FOREIGN KEY (`value_id`) REFERENCES `ann2_value` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 STATS_PERSISTENT=1 STATS_AUTO_RECALC=0 ROW_FORMAT=DYNAMIC

The problem is that query above is very and very slow and explain of query showed me these results:

+----+-------------+----------+-------+----------------------------------+----------+---------+------+------------+----------------------------------------------+
| id | select_type | table    | type  | possible_keys                    | key      | key_len | ref  | rows       | Extra                                        |
+----+-------------+----------+-------+----------------------------------+----------+---------+------+------------+----------------------------------------------+
|  1 | SIMPLE      | prod_val | index | PRIMARY,value_id,product_fitment | value_id | 4       | NULL | 1499189079 | Using index; Using temporary; Using filesort |
+----+-------------+----------+-------+----------------------------------+----------+---------+------+------------+----------------------------------------------+

And main thing I want is skip first record and select after the first record duplicates. How can I do this? Plesae help. Thanks.

Pretend that there are 12 rows, and see: [Why should I provide an MCVE for what seems to me to be a very simple SQL query?](https://meta.stackoverflow.com/questions/333952/why-should-i-provide-an-mcve-for-what-seems-to-me-to-be-a-very-simple-sql-query) — Strawberry, Aug 01 '18 at 13:06
I have optimized query adding composite index like this CREATE INDEX product_id_value_id_set ON ann2_product_value (product_id, value_id, `set`), and I wasted about 1 hour and 25 minutes) but I don't understand, how to choose correct duplicate rows skipping the first record of duplicate( — Анатолий Костяков, Aug 01 '18 at 13:18
Also, when you group by a column that is absent from the select, how do you know which count belongs to which `set`? — Strawberry, Aug 01 '18 at 13:28
1 testing server with 64 GB and 16 threads, but I am not one person who works on this server — Анатолий Костяков, Aug 01 '18 at 21:27
Is the EXPLAIN for your above query? It's on another table prod_val — Jacob, Aug 01 '18 at 23:21
Either get rid of `v.set` from the `GROUP BY`, or justify its existence. — Rick James, Aug 19 '18 at 03:05
For such a large table, consider using datatypes smaller than the 4-byte `INT`. — Rick James, Aug 19 '18 at 03:07
Is your goal to delete duplicate rows? If so, define "duplicate". The _current_ `PRIMARY KEY` defines "duplicate" as all 5 columns. You query seems to be defining it as three columns. — Rick James, Aug 19 '18 at 03:09

How to optimize query to find duplicates by 3 columns with 1.5 billion of rows

0 Answers0