-1

Hi I have a column of messy URL links within Google Sheets I'm trying to clean up, I want all formats of website links to be the same so that I can run a duplicate check on them.

For example, I have a list of URLs with various http, http://, https:// etc. I am trying to use the REGEXREPLACE tool to remove all http combination elements from the column entries, however cannot get it to work. This is what I have:

Before:

http://www.website1.com/

https://website2.com/

https://www.website3.com/

And I want - After:

website.com

website2.com

website3.com

It is ok if this takes place over a number of formulas and thus columns to the end result.

player0
  • 124,011
  • 12
  • 67
  • 124
Cal Nold
  • 13
  • 1
  • Be aware that http://example.net/ can differ from http://www.example.net/ and https://example.net - it's only by chance all leading to the same. – AmigoJack Aug 12 '20 at 02:20
  • 2
    @AmigoJack The "friendly" Stack Overflow comment formatting stripped off the www. and the http:// or https:// prefix so your comment looks like indeed they are identical. – tripleee Aug 12 '20 at 04:03

2 Answers2

1

try:

=ARRAYFORMULA(IFERROR(REGEXEXTRACT(INDEX(SPLIT(
 REGEXREPLACE(A1:A, "https?://www.|https?://|www.", ), "/"),,1), 
 "\.(.+\..+)"), INDEX(IFERROR(SPLIT(
 REGEXREPLACE(A1:A, "https?://www.|https?://|www.", ), "/")),,1)))

enter image description here


or shorter:

=INDEX(IFERROR(REGEXEXTRACT(A1:A, "^(?:https?:\/\/)?(?:www\.)?([^\/]+)")))

enter image description here

player0
  • 124,011
  • 12
  • 67
  • 124
0

You can try the following formula

=ArrayFormula(regexreplace(LEFT(P1:P3,LEN(P1:P3)-1),"(.*//www.)|(.*//)",""))

Please do adjust ranges as needed.

enter image description here

marikamitsos
  • 10,264
  • 20
  • 26