Extract specific urls from HTML with BeautifulSoup

Question

From given HTML I need to extract specific urls. For example, <a> and attribute href looks like this:

<a href="https://hoster.com/some_description-specific_name-more_description.html">

I need to extract only urls that include "hoster.com" and "specific_name"

I have used BeautifulSoup on an Raspberry Pi but i only can the basic thing which extracts all ULRs of an HTML:

from bs4 import BeautifulSoup

with open("page.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
    for link in soup.find_all('a'):
        print(link.get('href'))

score 1 · Accepted Answer · answered Apr 04 '22 at 08:41

You could select your elements more specific with css selectors:

soup.select('a[href*="hoster.com"][href*="specific_name"]')

But in case that multiple patterns has to match I would recommend:

for link in soup.find_all('a'):
    if all(s in link['href'] for s in pattern):
        print(link.get('href'))

Example

html = '''
<a href="https://hoster.com/some_description-specific_name-more_description.html">
<a href="https://lobster.com/some_description-specific_name-more_description.html">
<a href="https://hipster.com/some_description-specific_name-more_description.html">
'''

soup = BeautifulSoup(html)

pattern = ['hoster.com','specific_name']

for link in soup.find_all('a'):
    if all(s in link['href'] for s in pattern):
        print(link.get('href'))

Output

https://hoster.com/some_description-specific_name-more_description.html

Extract specific urls from HTML with BeautifulSoup

1 Answers1

Example

Output