Splitting a list of links into a list of domains

Question

so I have multiple lists that contain links of websites:

['www.google.com', 'www.yahoo.com', 'www.amazon.com']

And I want to obtain a list as follows:

['google', 'yahoo', 'amazon']

How can I use urllib to retrieve this? I got the following:

from urllib.parse import urlparse
domain = urlparse('http://www.google.com').netloc
print(domain)

But I do not how to do it for a list and this gives as result www.google.com instead of just google.

What's the purpose of the list? You're stripping away a lot of information by reducing it down to just, for example, `amazon`. Should `amazon.co.uk` and `amazon.net` be considered the same thing? Should `google.com` and `mail.google.com` be considered the same? — Kemp, Aug 10 '21 at 10:55
I am going to combine the lists to create a hyperlink. So I only need the name of the website on which you can click in order to go to the website. — Tobias, Aug 10 '21 at 10:57
If not bound to use urllib, then alternatively you can use regex or simple string split will also do just fine with it.(regex would be overkill in this case) — Abhishek, Aug 10 '21 at 11:04

imxitiz · Answer 1 · 2021-08-10T11:21:15.893

If you have just normal links like you provided then just doing this will work for you.

links=['www.google.com', 'www.yahoo.com', 'www.amazon.com']
print([link.split(".")[1] for link in links])
# ["google","yahoo","amazon"]

But if link have multiple sub-domain then it won't work as expected.

I have found library will do your work as expected, tldextract:

links=['https://www.google.com/asdfl', 'translate.google.com', 'afe.amdfad.azon.com']
import tldextract

print([tldextract.extract(link).domain for link in links])
# ['google', 'google', 'azon']

score 0 · Answer 2 · answered Aug 10 '21 at 11:09

0

The code below should work.

import re
domains = ['www.google.com', 'www.yahoo.com', 'www.amazon.com']
result = [re.findall(r'www\.(.*)\.\w{2,5}', domain)[0] for domain in domains]

Result:

['google', 'yahoo', 'amazon']

answered Aug 10 '21 at 11:09

Edgg

17
1

What about `google.com` or `translate.google.com`? – imxitiz Aug 10 '21 at 11:11
Well, it would return transalate.google if the URL is in the format that the author provided (www.translate.google.com). – Edgg Aug 10 '21 at 11:14
There is no such website like `www.translate.google.com`! – imxitiz Aug 10 '21 at 11:15
In that case this would work: result = [re.findall(r'\w*\.(.*)\.\w{2,5}', domain)[0] for domain in domains] – Edgg Aug 10 '21 at 11:18

score 0 · Answer 3 · answered Aug 10 '21 at 11:17

I would suggest giving the entire netloc to the user, makes things a lot easier to understand. But, for your problem, if the data is consistent, you could also simply use split.

from urllib.parse import urlparse

domain = urlparse('http://www.google.com').netloc
print(domain.split('.')[1])

This method does have its downsides with subdomains though: translate.google.com -> google (which I think would be wrong)

Splitting a list of links into a list of domains

3 Answers3