0

The problem

I have a sql database with a table for Hashtags, of which many are duplicates with regard to their names.

A statement like

SELECT     *  
FROM       HashTag  
ORDER BY   Name

returns something like

Id   |  Name
1947 |  test
1950 |  sample
1962 |  test
1963 |  sample
1986 |  test
2014 |  example

I want to keep only the hashtag with the lowest Id for each Name (1947 for 'test' and 1950 for 'sample') and update other tables with this Id, replacing the higher Ids (example: updating hashtag 'test'; lowest Id = 1947, higher Ids = 1962, 1986). These sql statements are updated manually as of now and would be as follows:

UPDATE             HashTaggedActivity
SET [HashTag_id]   = 1947
WHERE HashTag_id   in (1962, 1986)

Update             HashTaggedGroup
SET [HashTag_id]   = 1947
WHERE HashTag_id   in (1962, 1986)

DELETE             ht
FROM               HashTag ht
WHERE              ht.Id in (1962, 1986)

After this I have to do this for HashTag 'sample', which is an error prone and tedious process. The HashTag 'example' is not a duplicate and should not result in updating other tables.

Is there any way to write an sql statement for doing this for each occurence of duplicate names in the table HashTag?

What I've tried so far

I think I have to combine a statement for getting a duplicate count ordered by Id

select ht.Id, ht.Name, htc.dupeCount
from HashTag ht
inner join (
    SELECT ht.Name, COUNT(*) AS dupeCount
    FROM HashTag ht
    GROUP BY ht.Name
    HAVING COUNT(*) > 1
) htc on ht.Name = htc.Name
ORDER BY Id

which gives

Id   |  Name    | dupeCount
1947 |  test    | 3
1950 |  sample  | 2
1962 |  test    | 3
1963 |  sample  | 2
1986 |  test    | 3
2014 |  example | 1

with my UPDATE and DELETE statements according to the dupeCount, but I'm not sure how to do this ;-)

Thanks in advance and best regards,

Michael

  • Is there anything else unique to the lowest `id` hashtags that would make our search easier? Perhaps a field only they have filled or a creation date? I would also ask, what is the goal here. It appears that you are altering data in an activity log. This is usually not advisable. – Edward Jul 30 '18 at 16:35
  • @Edward The goal is to repair a database for a collaboration tool, which, due to a programming error, allowed for some time creating posts with already existent hashtags, leading to these duplicates. As a result, editing and saving of these posts result in an error, because saving a post with duplicate HashTags is not allowed. There is no other unique feature about these hashtags other than their ids. – Michael Boehnke Jul 30 '18 at 16:46

2 Answers2

0

The first two update statements first get the name based on the hashtag_id (innermost select), then get the minimum of all ids in hashtag that share the same name (next select) to then update the hashtag_id accordingly. In this case, it will also update the records with hashtag_id 1947 and 1950 - but the new value will be identical to the old value.

update HashTaggedGroup
 set hashtag_id = 
    (select min(id) 
     from hashtag h1 
     where (
        select name 
        from hashtag h2 
        where h2.id=HashTaggedGroup.hashtag_id)=h1.name);


update HashTaggedActivity
 set hashtag_id = 
    (select min(id) 
     from hashtag h1 
     where (
        select name 
        from hashtag h2 
        where h2.id=HashTaggedActivity.hashtag_id)=h1.name);

The delete as below will work for Mysql and SQLServer, it may need adjustment for other DBs (the idea remains the same though). If you are certain that all ids from hashtag are present in HashTaggedActivity, that would make it possible to have the query simpler.

delete h1 from hashtag as h1 
    inner join hashtag as h2 on 
              h1.name = h2.name and 
              h1.id > h2.id;

SQLFiddle for the above

Koen
  • 724
  • 4
  • 10
0

I would use window functions:

with ht as (
      select ht.*, min(id) over (partition by name) as minid
      from hashtag ht
     )
update hta
    set hashtag_id = ht.minid
    from HashTaggedActivity hta join
         ht
         on hta.hashtag_id = ht.id
    where ht.minid <> hta.hashtag_id;

And then do the delete in a similar way:

with ht as (
      select ht.*, min(id) over (partition by name) as minid
      from hashtag ht
     )
delete from ht
    where ht.minid <> id;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks for your idea. Once I got my head around this approach, I consider this a very elegant solution, combining WITH statement wit UPDATE/DELETE statements. – Michael Boehnke Jul 31 '18 at 05:03