I have a dataset with thousands of URLs stored in a column called Website
(type VARCHAR
) in a table called WebsiteData
. There are many pairs of URLs (stored in separate rows) that are identical except that one begins with www
, e.g. www.google.com
and the other does not, e.g., google.com
. How would I design a SQL query that identifies these pseudo-duplicates and deletes the version that does not start with www
?
Asked
Active
Viewed 56 times
0

zgall1
- 2,865
- 5
- 23
- 39
-
3You should know that those aren't necessarily duplicate URLs – Lamak Sep 25 '14 at 15:40
-
Can you please elaborate? – zgall1 Sep 25 '14 at 15:51
-
1Well, just because *commonly* `www.someurl.com` directs to `someurl.com` it doesn't means that this is always the case at all – Lamak Sep 25 '14 at 15:53
-
Fair. For this dataset, I think it is a reasonable enough assumption that I won't be losing any critical information if I make it. – zgall1 Sep 25 '14 at 15:54
2 Answers
1
I derived 2 tables one with urls that have www.
and one without. Join them together by adding www. to the urls without.
-- SELECT first to review the records.
select *
from
(select * from website where url not like 'www.%') wA
join
(select * from website where url like 'www.%') wB
on 'www.' + wa.url = wb.url
delete wA
from
(select * from website where url not like 'www.%') wA
join
(select * from website where url like 'www.%') wB
on 'www.' + wa.url = wb.url

SQLChao
- 7,709
- 1
- 17
- 32
0
using SUBSTRING to get the website part after first . and matching with any duplicate entry and removing it.
DELETE tableW
FROM tableW W
JOIN
(
select W1.website
FROM tableW W1
inner join tableW W2
on W1.website = SUBSTRING(W2.website,CHARINDEX('.',W2.website, 0)+1, LEN(w2.website))
) T
on T.website = W.website

radar
- 13,270
- 2
- 25
- 33