0

so I have multiple lists that contain links of websites:

['www.google.com', 'www.yahoo.com', 'www.amazon.com']

And I want to obtain a list as follows:

['google', 'yahoo', 'amazon']

How can I use urllib to retrieve this? I got the following:

from urllib.parse import urlparse
domain = urlparse('http://www.google.com').netloc
print(domain)

But I do not how to do it for a list and this gives as result www.google.com instead of just google.

Tobias
  • 137
  • 10
  • 1
    What's the purpose of the list? You're stripping away a lot of information by reducing it down to just, for example, `amazon`. Should `amazon.co.uk` and `amazon.net` be considered the same thing? Should `google.com` and `mail.google.com` be considered the same? – Kemp Aug 10 '21 at 10:55
  • I am going to combine the lists to create a hyperlink. So I only need the name of the website on which you can click in order to go to the website. – Tobias Aug 10 '21 at 10:57
  • If not bound to use urllib, then alternatively you can use regex or simple string split will also do just fine with it.(regex would be overkill in this case) – Abhishek Aug 10 '21 at 11:04

3 Answers3

1

If you have just normal links like you provided then just doing this will work for you.

links=['www.google.com', 'www.yahoo.com', 'www.amazon.com']
print([link.split(".")[1] for link in links])
# ["google","yahoo","amazon"]

But if link have multiple sub-domain then it won't work as expected.

I have found library will do your work as expected, tldextract:

links=['https://www.google.com/asdfl', 'translate.google.com', 'afe.amdfad.azon.com']
import tldextract

print([tldextract.extract(link).domain for link in links])
# ['google', 'google', 'azon']
imxitiz
  • 3,920
  • 3
  • 9
  • 33
0

The code below should work.

import re
domains = ['www.google.com', 'www.yahoo.com', 'www.amazon.com']
result = [re.findall(r'www\.(.*)\.\w{2,5}', domain)[0] for domain in domains]

Result:

['google', 'yahoo', 'amazon']
Edgg
  • 17
  • 1
0

I would suggest giving the entire netloc to the user, makes things a lot easier to understand. But, for your problem, if the data is consistent, you could also simply use split.

from urllib.parse import urlparse

domain = urlparse('http://www.google.com').netloc
print(domain.split('.')[1])

This method does have its downsides with subdomains though: translate.google.com -> google (which I think would be wrong)

veedata
  • 1,048
  • 1
  • 9
  • 15