Scraping all URLs from search result page BeautifulSoup

Question

I'm trying to get 100 URLs from the following search result page:

https://www.willhaben.at/iad/kaufen-und-verkaufen/marktplatz/fahrraeder-radsport/fahrraeder-4552?rows=100&areaId=900

Here's the test code I have:

import requests
from bs4 import BeautifulSoup

urls = []

def get_urls(url):
   page = requests.get(url)
   soup = BeautifulSoup(page.content,'html.parser')
   s = soup.find('a', class_="header w-brk")
   urls.append(s)
   print(urls)

Unfortunately the list returns [None]. I've also tried using href=True in the soup.find or soup.find_all method but unfortunately that doesn't work either. I can see another problem with this:

The URL the page provides in the source is for example: a href="/iad/kaufen-und-verkaufen/d/fahrrad-429985104/" just the end of the willhaben.at URL. When I do get all of these URLs appended to my list, I won't be able to scrape them just like they are, I'll have to somehow add the root URL to it before my scraper can load it.

What is the most effective way I can solve this?

Thanks!

You're scraping for links (``) with a class `header w-brk`. The elements that have these classes assigned are `
`s though. `soup.find('div', class_="header w-brk")` holds your result. — Oryon, Dec 13 '20 at 14:06
I see! Thank you. The output of that is the text of the whole class. Is there a good way to only extract the `href= ` part? — scrape_noob, Dec 13 '20 at 14:10
if your variable is named `s`: it's just `s.a['href']` or `s.a.get('href', default)`. Maybe have a look at the [docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start) — Oryon, Dec 13 '20 at 14:12
That works! But only extracts one one URL. I've tried `soup.find_all` but I get an error that says `are you sure you're not meant to call find()` — scrape_noob, Dec 13 '20 at 14:15

score 4 · Answer 1 · answered Dec 13 '20 at 14:15

You can choose many ways to get anchor URLs.

soup.select elegant way:

urls.extend([a.attrs['href'] for a in soup.select('div.header.w-brk a')])

soup.select simpler way:

for a in soup.select('div.header.w-brk a'):
    urls.append(a.attrs['href'])

soup.find_all simpler way:

for div in soup.find_all('div', class_="header w-brk"):
    urls.append(div.find('a').attrs['href'])

soup.find_all elegant way:

urls.extend([div.find('a').attrs['href'] for div in soup.find_all('div', class_="header w-brk")])

score 2 · Answer 2 · answered Dec 13 '20 at 14:08

2

For the second part of your question you could use a simple list comprehension:

urls_with_base = [f"{base_url}/{url}" for url in urls]

answered Dec 13 '20 at 14:08

Oryon

131
1
3
11

score 2 · Accepted Answer · answered Dec 13 '20 at 14:29

Checkout:

import requests
from bs4 import BeautifulSoup

urls = []

url = "https://www.willhaben.at/iad/kaufen-und-verkaufen/marktplatz/fahrraeder-radsport/fahrraeder-4552?rows=100&areaId=900"

def get_urls(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content,'html.parser')
    s = soup.findAll("div", {"class": "w-brk"})
    for link in s:
        l = link.find("a")
        urls.append("https://www.willhaben.at"+l['href'])
    print(urls)

get_urls(url)

Manbir Judge · Answer 4 · 2022-01-10T03:07:17.620

1

This is the code you are looking for. I hope that you do not need any explanations for this code:

import requests
from bs4 import BeautifulSoup

urls = []

def get_urls(page_url):
    global urls

    page = requests.get(page_url)
    soup = BeautifulSoup(page.content, "html.parser")

    anchor_tags = soup.find_all("a", href=True)
    urls = [anchor_tag.get("href") for anchor_tag in anchor_tags]

edited Jan 10 '22 at 03:07

answered Dec 13 '20 at 14:19

Manbir Judge

113
1
13

Scraping all URLs from search result page BeautifulSoup

4 Answers4