1

I'm having a bit of trouble trying to save the links from a website into a list without repeating urls with same domain

Example:
www.python.org/download and www.python.org/about

should only save the first one (www.python.org/download) and not repeat it later


This is what i've got so far

from bs4 import BeautifulSoup
import requests
from urllib.parse import urlparse

url = "https://docs.python.org/3/library/urllib.request.html#module-urllib.request"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
atag = doc.find_all('a', href=True)
links = []
#below should be some kind of for loop


Krosvick
  • 11
  • 3
  • Yes, you will need a for loop which parses the domain from the full URL then checks if you already have that domain. The last part can be made easier by using a `set` instead of a `list`. – Code-Apprentice Sep 26 '22 at 18:07

1 Answers1

1

As a one-liner:

links = {nl for a in doc.find_all('a', href=True) if (nl := urlparse(a["href"]).netloc) != ""}

Explained:

links = set()  # define empty set
for a in doc.find_all('a', href=True):  # loop over every <a> element
    nl = urlparse(a["href"]).netloc  # get netloc from url
    if nl:
        links.add(nl)  # add to set if exists

output:

{'www.w3.org', 'datatracker.ietf.org', 'www.python.org', 'requests.readthedocs.io', 'github.com', 'www.sphinx-doc.org'}
bitflip
  • 3,436
  • 1
  • 3
  • 22
  • This is a great solution but i forgot to be more clear with the result i wanted, edited the example on the original post to be more concise, but basically i want to keep the entire url but one per domain, sorry for the misunderstanding – Krosvick Sep 26 '22 at 19:24