0

I have a Categories table which has some duplicate Categories as described below,

`Categories`
+========+============+============+
| cat_id | cat_name   | item_count |
+========+============+============+
|      1 | Category 1 |         2  |
|      2 | Category 1 |         1  |
|      3 | Category 2 |         2  |
|      4 | Category 3 |         1  |
|      5 | Category 3 |         1  |
+--------+------------+------------+

Here is another junction table which relates to another Items table. The item_count in the first table is the total number of items per cat_id.

`Junction`
+========+=========+
| cat_id | item_id |
+========+=========+
|      1 |     100 |
|      1 |     101 |
|      2 |     102 |
|      3 |     103 |
|      3 |     104 |
|      4 |     105 |
|      5 |     106 |
+--------+---------+

How do I add or combine those items from the duplicate Categories into ones each having maximum item_count among their duplicates? (e.g. Category 1).

Also, if the item_count is the same for those duplicate ones, then the Category with maximum cat_id will be chosen and item_count will be combined to that record. (e.g. Category 3).

Note: Instead of removing the duplicate records, the item_count will be set to 0.

Below is the expected result.

+========+============+============+
| cat_id | cat_name   | item_count |
+========+============+============+
|      1 | Category 1 |         3  |
|      2 | Category 1 |         0  |
|      3 | Category 2 |         2  |
|      4 | Category 3 |         0  |
|      5 | Category 3 |         2  |
+--------+------------+------------+

+========+=========+
| cat_id | item_id |
+========+=========+
|      1 |     100 |
|      1 |     101 |
|      1 |     102 |
|      3 |     103 |
|      3 |     104 |
|      5 |     105 |
|      5 |     106 |
+--------+---------+

In the result, there are two duplicates Category 1 and Category 3. And we have 2 scenarios,

  1. cat_id=2 is eliminated because its item_count=1 is less than that of cat_id=1 which is item_count=2.
  2. cat_id=4 is eliminated even though its item_count is the same as that of cat_id=5 since 5 is the maximum among duplicate Category 3.

Please help me if any query that can join and update both tables in order to solve the duplicates.

Aung Myo Linn
  • 2,820
  • 3
  • 27
  • 38
  • Do you want minimum or maximum cat_id? what you write and what in the tables are not the same: Id=3 cat_id=1 – genespos Nov 24 '15 at 14:49
  • @Strawberry `id` column has nothing to do with the result set but an autocrement primary column for indexing – Aung Myo Linn Nov 24 '15 at 14:53
  • 1
    @genespos `cat_id` = 1 has the maximum `item_count` among its duplicates, thus the item related to the duplicate `cat_id` = 2 should be added to `cat_id` = 1. – Aung Myo Linn Nov 24 '15 at 15:05
  • What if you have tied item_counts? – Strawberry Nov 24 '15 at 15:37
  • @Strawberry In the result, we have two cases, (1) `cat_id` = 2 is eliminated because its `item_count` = 1 is less than that of `cat_id` = 1 which is `item_count` = 2. (2) `cat_id` = 4 is eliminated even though its `item_count` is the same as that of `cat_id` = 5 since 5 is the maximum among duplicate `Category 3`. – Aung Myo Linn Nov 24 '15 at 15:48
  • @Strawberry Thanks for improving the question by the way :) – Aung Myo Linn Nov 24 '15 at 15:57

4 Answers4

3

Here's a SELECT. You can figure out to adapt it to an UPDATE ;-)

I've ignored the jucntion table for simplicity

SELECT z.cat_id
     , z.cat_name
     , (z.cat_id = x.cat_id) * new_count item_count
  FROM categories x 
  LEFT 
  JOIN categories y 
    ON y.cat_name = x.cat_name 
   AND (y.item_count > x.item_count OR (y.item_count = x.item_count AND y.cat_id > x.cat_id))
  LEFT
  JOIN 
     ( SELECT a.cat_id, b.*
         FROM categories a
         JOIN 
            ( SELECT cat_name, SUM(item_count) new_count, MAX(item_count) max_count FROM categories GROUP BY cat_name) b
           ON b.cat_name = a.cat_name
     ) z
    ON z.cat_name = x.cat_name
 WHERE y.cat_id IS NULL;

+--------+------------+------------+
| cat_id | cat_name   | item_count |
+--------+------------+------------+
|      1 | Category 1 |          3 |
|      2 | Category 1 |          0 |
|      3 | Category 2 |          2 |
|      4 | Category 3 |          0 |
|      5 | Category 3 |          2 |
+--------+------------+------------+
Strawberry
  • 33,750
  • 13
  • 40
  • 57
  • +1 Thanks again, your query can solve the first table. Now I'm trying the update query part for the second one. :) – Aung Myo Linn Nov 24 '15 at 16:36
  • 1
    The easiest way: `UPDATE categories x JOIN (the query) y ON y.cat_id = x.cat_id SET x.item_count = y.item_count;` but note that it's not usually a great idea to store derived data. – Strawberry Nov 25 '15 at 07:16
1
 DELIMITER $$
 DROP PROCEDURE IF EXISTS  cursor_proc $$
 CREATE PROCEDURE cursor_proc()
 BEGIN
   DECLARE @cat_id   INT;
   DECLARE @cat_name VARCHAR(255);
   DECLARE @item_count INT;

   DECLARE @prev_cat_Name VARCHAR(255);
   DECLARE @maxItemPerCategory INT;
   DECLARE @maxItemId INT DEFAULT 0;
   DECLARE @totalItemsCount INT;
   -- this flag will be set to true when cursor reaches end of table
   DECLARE exit_loop BOOLEAN;         
   -- Declare the cursor
   DECLARE categories_cursor CURSOR FOR
     SELECT select cat_id ,cat_name ,item_count from Categories Order By cat_name, cat_id;
   -- set exit_loop flag to true if there are no more rows
   DECLARE CONTINUE HANDLER FOR NOT FOUND SET exit_loop = TRUE;
   -- open the cursor
   OPEN categories_cursor;
   -- start looping
   categories_loop: LOOP
     -- read the name from next row into the variables 
     FETCH  categories_cursor INTO @cat_id, @cat_name, @item_count ;

     -- close the cursor and exit the loop if it has.
     IF exit_loop THEN
         CLOSE categories_loop;
         LEAVE categories_loop;
     END IF;

       IF(@prev_cat_Name <> @cat_name)
        THEN 
        -- Category has changed, set the item_count of the 'best' category with the total items count
        IF(@maxItemId > 0)
        THEN
          UPDATE Categories  
            SET Categories.item_count=@totalItemsCount
           WHERE Categories.cat_id=@maxItemId;         
        END IF;

       -- Reset Values with the actual row values                        
          SET @maxItemPerCategory = @item_count;
          SET @prev_cat_Name = @cat_name;
          SET @maxItemId = @cat_id
          SET @totalItemsCount = @item_count;
       ELSE
      -- increment the total items count
          SET @totalItemsCount = @totalItemsCount + @item_count

       -- if the actual row has the maximun item counts, then it is the 'best'
           IF (@maxIntPerCategory < @item_count)
           THEN 
             SET @maxIntPerCategory = @item_count
             SET @maxItemId = @cat_id
           ELSE
         -- else, this row is not the best of its Category
              UPDATE Categories  
                 SET Categories.item_count=0
               WHERE Categories.cat_id=@cat_id; 
           END IF; 

       END IF;


   END LOOP categories_loop;
 END $$
 DELIMITER ;
alessalessio
  • 1,224
  • 1
  • 15
  • 28
1

It's not pretty and copied in part from Strawberry's SELECT

UPDATE categories cat, 
    junction jun,
    (select 
    (z.cat_id = x.cat_id) * new_count c,
     x.cat_id newcatid,
     z.cat_id oldcatid
    from categories x 
      LEFT 
      JOIN categories y 
        ON y.cat_name = x.cat_name 
       AND (y.item_count > x.item_count OR (y.item_count = x.item_count AND y.cat_id > x.cat_id))
      LEFT
      JOIN 
         ( SELECT a.cat_id, b.*
             FROM categories a
             JOIN 
                ( SELECT cat_name, SUM(item_count) new_count, MAX(item_count) max_count FROM categories GROUP BY cat_name) b
               ON b.cat_name = a.cat_name
         ) z
        ON z.cat_name = x.cat_name

     WHERE
     y.cat_id IS NULL) sourceX

     SET cat.item_count = sourceX.c, jun.cat_id = sourceX.newcatid
     WHERE cat.cat_id = jun.cat_id and cat.cat_id = sourceX.oldcatid
Aung Myo Linn
  • 2,820
  • 3
  • 27
  • 38
Mikey
  • 629
  • 1
  • 11
  • 19
0

I think it's better to do what you want one step at time:

First, get data you need:

SELECT Max(`cat_id`), sum(`item_count`) FROM `Categories` GROUP BY `cat_name`

With these data you'll be able to check if update was correctly done.

Then, with a loop on acquired data, update:

update Categories set item_count =
    (
    Select Tot FROM (
        Select sum(`item_count`) as Tot
        FROM `Categories`
        WHERE `cat_name` = '@cat_name') as tmp1
    )
WHERE cat_id = (
    Select MaxId
    FROM (
        select max(cat_id) as MaxId
        FROM Categories
        WHERE `cat_name` = '@cat_name') as tmp2)

Pay attention, if you run twice this code the result will be wrong.

Finally, set others Ids to 0

UPDATE Categories set item_count = 0
WHERE `cat_name` = '@cat_name'
AND cat_id <> (
    Select MaxId
    FROM (
        select max(cat_id) as MaxId
        FROM items
        WHERE `cat_name` = '@cat_name0') as tmp2)
genespos
  • 3,211
  • 6
  • 38
  • 70