Below is a list of web addresses. However, I would like to print only hostname of each address.
http://www.askoxford.com
http://www.hydrogencarsnow.com
http://www.bnsf.com
http://web.archive.org
Expected result:
askoxford.com
hydrogencarsnow.com
bnsf.com
web.archive.org
My code:
import re
import codecs
raw = codecs.open("D:\Python\gg.txt",'r',encoding='utf-8')
string = raw.read()
link = re.findall(r'www\.(\w+\.com|\w+\.org)',string)
print(link)
Current Output:
['askoxford.com', 'askoxford.com', 'hydrogencarsnow.com', 'bnsf.com']
As of current output, it does not include hostname.org. I'm unsure of the way to the make OR condition for reg in front of the string.
My Tries:
link = re.findall(r'(http://www\.|http://)(\w+\.com|\w+\.org)',string)
, but it does not work as it would collect http...with the hostname.