MySQL update join query to solve duplicate Values

Question

I have a Categories table which has some duplicate Categories as described below,

`Categories`
+========+============+============+
| cat_id | cat_name   | item_count |
+========+============+============+
|      1 | Category 1 |         2  |
|      2 | Category 1 |         1  |
|      3 | Category 2 |         2  |
|      4 | Category 3 |         1  |
|      5 | Category 3 |         1  |
+--------+------------+------------+

Here is another junction table which relates to another Items table. The item_count in the first table is the total number of items per cat_id.

`Junction`
+========+=========+
| cat_id | item_id |
+========+=========+
|      1 |     100 |
|      1 |     101 |
|      2 |     102 |
|      3 |     103 |
|      3 |     104 |
|      4 |     105 |
|      5 |     106 |
+--------+---------+

How do I add or combine those items from the duplicate Categories into ones each having maximum item_count among their duplicates? (e.g. Category 1).

Also, if the item_count is the same for those duplicate ones, then the Category with maximum cat_id will be chosen and item_count will be combined to that record. (e.g. Category 3).

Note: Instead of removing the duplicate records, the item_count will be set to 0.

Below is the expected result.

+========+============+============+
| cat_id | cat_name   | item_count |
+========+============+============+
|      1 | Category 1 |         3  |
|      2 | Category 1 |         0  |
|      3 | Category 2 |         2  |
|      4 | Category 3 |         0  |
|      5 | Category 3 |         2  |
+--------+------------+------------+

+========+=========+
| cat_id | item_id |
+========+=========+
|      1 |     100 |
|      1 |     101 |
|      1 |     102 |
|      3 |     103 |
|      3 |     104 |
|      5 |     105 |
|      5 |     106 |
+--------+---------+

In the result, there are two duplicates Category 1 and Category 3. And we have 2 scenarios,

cat_id=2 is eliminated because its item_count=1 is less than that of cat_id=1 which is item_count=2.
cat_id=4 is eliminated even though its item_count is the same as that of cat_id=5 since 5 is the maximum among duplicate Category 3.

Please help me if any query that can join and update both tables in order to solve the duplicates.

Do you want minimum or maximum cat_id? what you write and what in the tables are not the same: Id=3 cat_id=1 — genespos, Nov 24 '15 at 14:49
@Strawberry `id` column has nothing to do with the result set but an autocrement primary column for indexing — Aung Myo Linn, Nov 24 '15 at 14:53
@genespos `cat_id` = 1 has the maximum `item_count` among its duplicates, thus the item related to the duplicate `cat_id` = 2 should be added to `cat_id` = 1. — Aung Myo Linn, Nov 24 '15 at 15:05
@Strawberry In the result, we have two cases, (1) `cat_id` = 2 is eliminated because its `item_count` = 1 is less than that of `cat_id` = 1 which is `item_count` = 2. (2) `cat_id` = 4 is eliminated even though its `item_count` is the same as that of `cat_id` = 5 since 5 is the maximum among duplicate `Category 3`. — Aung Myo Linn, Nov 24 '15 at 15:48

score 3 · Accepted Answer · answered Nov 24 '15 at 16:03

Here's a SELECT. You can figure out to adapt it to an UPDATE ;-)

I've ignored the jucntion table for simplicity

SELECT z.cat_id
     , z.cat_name
     , (z.cat_id = x.cat_id) * new_count item_count
  FROM categories x 
  LEFT 
  JOIN categories y 
    ON y.cat_name = x.cat_name 
   AND (y.item_count > x.item_count OR (y.item_count = x.item_count AND y.cat_id > x.cat_id))
  LEFT
  JOIN 
     ( SELECT a.cat_id, b.*
         FROM categories a
         JOIN 
            ( SELECT cat_name, SUM(item_count) new_count, MAX(item_count) max_count FROM categories GROUP BY cat_name) b
           ON b.cat_name = a.cat_name
     ) z
    ON z.cat_name = x.cat_name
 WHERE y.cat_id IS NULL;

+--------+------------+------------+
| cat_id | cat_name   | item_count |
+--------+------------+------------+
|      1 | Category 1 |          3 |
|      2 | Category 1 |          0 |
|      3 | Category 2 |          2 |
|      4 | Category 3 |          0 |
|      5 | Category 3 |          2 |
+--------+------------+------------+

+1 Thanks again, your query can solve the first table. Now I'm trying the update query part for the second one. :) — Aung Myo Linn, Nov 24 '15 at 16:36
The easiest way: `UPDATE categories x JOIN (the query) y ON y.cat_id = x.cat_id SET x.item_count = y.item_count;` but note that it's not usually a great idea to store derived data. — Strawberry, Nov 25 '15 at 07:16

alessalessio · Answer 2 · 2015-11-24T16:55:54.810

 DELIMITER $$
 DROP PROCEDURE IF EXISTS  cursor_proc $$
 CREATE PROCEDURE cursor_proc()
 BEGIN
   DECLARE @cat_id   INT;
   DECLARE @cat_name VARCHAR(255);
   DECLARE @item_count INT;

   DECLARE @prev_cat_Name VARCHAR(255);
   DECLARE @maxItemPerCategory INT;
   DECLARE @maxItemId INT DEFAULT 0;
   DECLARE @totalItemsCount INT;
   -- this flag will be set to true when cursor reaches end of table
   DECLARE exit_loop BOOLEAN;         
   -- Declare the cursor
   DECLARE categories_cursor CURSOR FOR
     SELECT select cat_id ,cat_name ,item_count from Categories Order By cat_name, cat_id;
   -- set exit_loop flag to true if there are no more rows
   DECLARE CONTINUE HANDLER FOR NOT FOUND SET exit_loop = TRUE;
   -- open the cursor
   OPEN categories_cursor;
   -- start looping
   categories_loop: LOOP
     -- read the name from next row into the variables 
     FETCH  categories_cursor INTO @cat_id, @cat_name, @item_count ;

     -- close the cursor and exit the loop if it has.
     IF exit_loop THEN
         CLOSE categories_loop;
         LEAVE categories_loop;
     END IF;

       IF(@prev_cat_Name <> @cat_name)
        THEN 
        -- Category has changed, set the item_count of the 'best' category with the total items count
        IF(@maxItemId > 0)
        THEN
          UPDATE Categories  
            SET Categories.item_count=@totalItemsCount
           WHERE Categories.cat_id=@maxItemId;         
        END IF;

       -- Reset Values with the actual row values                        
          SET @maxItemPerCategory = @item_count;
          SET @prev_cat_Name = @cat_name;
          SET @maxItemId = @cat_id
          SET @totalItemsCount = @item_count;
       ELSE
      -- increment the total items count
          SET @totalItemsCount = @totalItemsCount + @item_count

       -- if the actual row has the maximun item counts, then it is the 'best'
           IF (@maxIntPerCategory < @item_count)
           THEN 
             SET @maxIntPerCategory = @item_count
             SET @maxItemId = @cat_id
           ELSE
         -- else, this row is not the best of its Category
              UPDATE Categories  
                 SET Categories.item_count=0
               WHERE Categories.cat_id=@cat_id; 
           END IF; 

       END IF;


   END LOOP categories_loop;
 END $$
 DELIMITER ;

@ alessalessio @Mikey Could you please elaborate and shed me some light? :) — Aung Myo Linn, Nov 24 '15 at 15:21

score 1 · Answer 3 · edited Nov 25 '15 at 07:35

1

It's not pretty and copied in part from Strawberry's SELECT

UPDATE categories cat, 
    junction jun,
    (select 
    (z.cat_id = x.cat_id) * new_count c,
     x.cat_id newcatid,
     z.cat_id oldcatid
    from categories x 
      LEFT 
      JOIN categories y 
        ON y.cat_name = x.cat_name 
       AND (y.item_count > x.item_count OR (y.item_count = x.item_count AND y.cat_id > x.cat_id))
      LEFT
      JOIN 
         ( SELECT a.cat_id, b.*
             FROM categories a
             JOIN 
                ( SELECT cat_name, SUM(item_count) new_count, MAX(item_count) max_count FROM categories GROUP BY cat_name) b
               ON b.cat_name = a.cat_name
         ) z
        ON z.cat_name = x.cat_name

     WHERE
     y.cat_id IS NULL) sourceX

     SET cat.item_count = sourceX.c, jun.cat_id = sourceX.newcatid
     WHERE cat.cat_id = jun.cat_id and cat.cat_id = sourceX.oldcatid

edited Nov 25 '15 at 07:35

Aung Myo Linn

2,820
3
27
38

answered Nov 24 '15 at 17:29

Mikey

629
1
11
19

Yes, the adaptation could be prettier – Strawberry Nov 24 '15 at 17:33
I've tested your update query, it works. Thanks @Mikey but I have to give credit to @ Strawberry. – Aung Myo Linn Nov 25 '15 at 07:22
The credit belongs to @Strawberry ;). I just wanted to see if it all works witha single query :) – Mikey Nov 25 '15 at 07:33

genespos · Answer 4 · 2015-11-24T18:08:58.450

I think it's better to do what you want one step at time:

First, get data you need:

SELECT Max(`cat_id`), sum(`item_count`) FROM `Categories` GROUP BY `cat_name`

With these data you'll be able to check if update was correctly done.

Then, with a loop on acquired data, update:

update Categories set item_count =
    (
    Select Tot FROM (
        Select sum(`item_count`) as Tot
        FROM `Categories`
        WHERE `cat_name` = '@cat_name') as tmp1
    )
WHERE cat_id = (
    Select MaxId
    FROM (
        select max(cat_id) as MaxId
        FROM Categories
        WHERE `cat_name` = '@cat_name') as tmp2)

Pay attention, if you run twice this code the result will be wrong.

Finally, set others Ids to 0

UPDATE Categories set item_count = 0
WHERE `cat_name` = '@cat_name'
AND cat_id <> (
    Select MaxId
    FROM (
        select max(cat_id) as MaxId
        FROM items
        WHERE `cat_name` = '@cat_name0') as tmp2)

Your second update doesn't work, `items` in the sub query should be `Categories`? even if so, the result is still wrong. — Aung Myo Linn, Nov 25 '15 at 02:31

MySQL update join query to solve duplicate Values

4 Answers4