De-duplicating similar but not identical URLs with a SQL query

Question

I have a dataset with thousands of URLs stored in a column called Website (type VARCHAR) in a table called WebsiteData. There are many pairs of URLs (stored in separate rows) that are identical except that one begins with www, e.g. www.google.com and the other does not, e.g., google.com. How would I design a SQL query that identifies these pseudo-duplicates and deletes the version that does not start with www?

You should know that those aren't necessarily duplicate URLs — Lamak, Sep 25 '14 at 15:40
Well, just because *commonly* `www.someurl.com` directs to `someurl.com` it doesn't means that this is always the case at all — Lamak, Sep 25 '14 at 15:53
Fair. For this dataset, I think it is a reasonable enough assumption that I won't be losing any critical information if I make it. — zgall1, Sep 25 '14 at 15:54

score 1 · Accepted Answer · answered Sep 25 '14 at 16:26

I derived 2 tables one with urls that have www. and one without. Join them together by adding www. to the urls without.

-- SELECT first to review the records.
select *
from
(select * from website where url not like 'www.%') wA
join 
(select * from website where url like 'www.%') wB
    on 'www.' + wa.url = wb.url


delete wA
from
(select * from website where url not like 'www.%') wA
join 
(select * from website where url like 'www.%') wB
    on 'www.' + wa.url = wb.url

score 0 · Answer 2 · answered Sep 25 '14 at 16:25

using SUBSTRING to get the website part after first . and matching with any duplicate entry and removing it.

DELETE  tableW
FROM tableW W
JOIN 
(
select W1.website
FROM tableW W1
inner join tableW W2
on W1.website = SUBSTRING(W2.website,CHARINDEX('.',W2.website, 0)+1, LEN(w2.website))
) T
on T.website = W.website

De-duplicating similar but not identical URLs with a SQL query

2 Answers2