-1

From given HTML I need to extract specific urls. For example, <a> and attribute href looks like this:

<a href="https://hoster.com/some_description-specific_name-more_description.html">

I need to extract only urls that include "hoster.com" and "specific_name"

I have used BeautifulSoup on an Raspberry Pi but i only can the basic thing which extracts all ULRs of an HTML:

from bs4 import BeautifulSoup

with open("page.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
    for link in soup.find_all('a'):
        print(link.get('href'))
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
Fisch133
  • 3
  • 1

1 Answers1

1

You could select your elements more specific with css selectors:

soup.select('a[href*="hoster.com"][href*="specific_name"]')

But in case that multiple patterns has to match I would recommend:

for link in soup.find_all('a'):
    if all(s in link['href'] for s in pattern):
        print(link.get('href'))
Example
html = '''
<a href="https://hoster.com/some_description-specific_name-more_description.html">
<a href="https://lobster.com/some_description-specific_name-more_description.html">
<a href="https://hipster.com/some_description-specific_name-more_description.html">
'''

soup = BeautifulSoup(html)

pattern = ['hoster.com','specific_name']

for link in soup.find_all('a'):
    if all(s in link['href'] for s in pattern):
        print(link.get('href'))
Output
https://hoster.com/some_description-specific_name-more_description.html
HedgeHog
  • 22,146
  • 4
  • 14
  • 36