0

I have a dataset with thousands of URLs stored in a column called Website (type VARCHAR) in a table called WebsiteData. There are many pairs of URLs (stored in separate rows) that are identical except that one begins with www, e.g. www.google.com and the other does not, e.g., google.com. How would I design a SQL query that identifies these pseudo-duplicates and deletes the version that does not start with www?

zgall1
  • 2,865
  • 5
  • 23
  • 39

2 Answers2

1

I derived 2 tables one with urls that have www. and one without. Join them together by adding www. to the urls without.

-- SELECT first to review the records.
select *
from
(select * from website where url not like 'www.%') wA
join 
(select * from website where url like 'www.%') wB
    on 'www.' + wa.url = wb.url


delete wA
from
(select * from website where url not like 'www.%') wA
join 
(select * from website where url like 'www.%') wB
    on 'www.' + wa.url = wb.url
SQLChao
  • 7,709
  • 1
  • 17
  • 32
0

using SUBSTRING to get the website part after first . and matching with any duplicate entry and removing it.

DELETE  tableW
FROM tableW W
JOIN 
(
select W1.website
FROM tableW W1
inner join tableW W2
on W1.website = SUBSTRING(W2.website,CHARINDEX('.',W2.website, 0)+1, LEN(w2.website))
) T
on T.website = W.website
radar
  • 13,270
  • 2
  • 25
  • 33