1

Below is a list of web addresses. However, I would like to print only hostname of each address.

http://www.askoxford.com
http://www.hydrogencarsnow.com
http://www.bnsf.com
http://web.archive.org

Expected result:

askoxford.com
hydrogencarsnow.com
bnsf.com
web.archive.org

My code:

import re
import codecs
raw = codecs.open("D:\Python\gg.txt",'r',encoding='utf-8')
string = raw.read()
link = re.findall(r'www\.(\w+\.com|\w+\.org)',string)
print(link)

Current Output:

['askoxford.com', 'askoxford.com', 'hydrogencarsnow.com', 'bnsf.com']

As of current output, it does not include hostname.org. I'm unsure of the way to the make OR condition for reg in front of the string.

My Tries: link = re.findall(r'(http://www\.|http://)(\w+\.com|\w+\.org)',string), but it does not work as it would collect http...with the hostname.

Sociopath
  • 13,068
  • 19
  • 47
  • 75
user234568
  • 741
  • 3
  • 11
  • 21
  • 2
    You already have [`urllib.parse`](https://docs.python.org/3/library/urllib.parse.html) to do that. Also, there already exist [regexes](https://stackoverflow.com/questions/27745/getting-parts-of-a-url-regex) that extract various parts of URLs, so you should probably use those instead of rolling your own. – ForceBru Oct 01 '18 at 09:06
  • Try `(?m)^(?:https?://)?(?:www\.)?(.*\.(?:com|org|org))$`, see https://regex101.com/r/Roh87C/1 – Wiktor Stribiżew Oct 01 '18 at 09:08

0 Answers0