1

I am trying to scrape some information from a website using BeautifulSoup and I am having huge trouble with it. I have been searching and trying to figure this out for hours now and I cannot figure it out. I am trying to scrape the title of the company from (https://www.duckduckgo.com/privacy) written with the bold red text along with the number of offers (the number at the bottom of the description). I am aware that the code is currently only looking for the "h2" and not for the paragraph and I am also aware that the exact match is a hyperlink "a" but I couldn't find a solution for searching multiple classes at once in one tag because the original classes for the hyperlink are "class="link ng-binding"" and I don't know how to reference to multiple of them at once so I am trying to point out to the single class "h2" title which contains the hyperlink itself inside of it. This is the code that I am having trouble with :

from urllib.request import urlopen
from bs4 import BeautifulSoup

# Scrape company names, offers

toScrape = "https://www.duckduckgo.com/privacy"

requestPage = urlopen(toScrape)
pageHTML = requestPage.read()
requestPage.close()

HTMLSoup = BeautifulSoup(pageHTML, 'html.parser')

scrapedItems = HTMLSoup.find_all('h2')

CSVExport = 'ConectHeader.csv'
save = open(CSVExport, 'w')

CSVHeaders = 'Price, stock\n'

for item in scrapedItems:
    company = item.find('h2', class_="title").text
    offers = item.find('p', class_="estates-cnt").text

    save.write(company + '' + stock)

I don't get any errors or even warning in my IDE. The process finishes with exit code 0 but when I open the final .csv file it doesn't contain any information what so ever. I cannot figure out why doesn't the output get saved into the csv file. I have also tried running it through print and print returned "[]" which probably means that the problem is not directly caused by the data being saved into the csv file. Thanks for anyone for any help with this I am tearing my hair off right now because of this!

541daw35d
  • 141
  • 2
  • 12

2 Answers2

0

BeautifulSoup doesn't see dynamically rendered content, as in this case. But, there's an API you can query that returns all the data you need.

Here's how:

import time

import requests

data = requests.get(f"https://www.sreality.cz/api/cs/v2/companies?page=2&tms={int(time.time() * 1000)}").json()

for company in data["_embedded"]["companies"]:
    print(f"{company['url']} - {company['locality']}")

This prints:

Sklady-cz-Praha-Stodulky - Praha, Stodůlky, Bucharova
PATOMA-Praha-Nove-Mesto - Praha, Nové Město, Washingtonova
Molik-reality-s-r-o-Most - Most, Moskevská
ERA-Reality-Praha-Holesovice - Praha, Holešovice, Jankovcova
LOKATIO-Praha-Zizkov - Praha, Žižkov, Kubelíkova
REAL-SPEKTRUM-Brno-Veveri - Brno, Veveří, Lidická
FARAON-reality-Praha-Vinohrady - Praha, Vinohrady, Polská
108-AGENCY-s-r-o-Praha-Zizkov - Praha, Žižkov, Příběnická
Realitni-spolecnost-Mgr-Jan-Vodenka-Praha-Nove-Mesto - Praha, Nové Město, Václavské náměstí
RapakCo-s-r-o-Praha-Zizkov - Praha, Žižkov, Žerotínova
Euro-Reality-Plzen-s-r-o-Plzen-Vychodni-Predmesti - Plzeň, Východní Předměstí, Šafaříkovy sady
Happy-House-Rentals-s-r-o-realitni-kancelar-Praha-Vinohrady - Praha, Vinohrady, Uruguayská
VIAGEM-servisni-s-r-o-Praha-Karlin - Praha, Karlín, Sokolovská
I-E-T-Reality-s-r-o-Brno-Brno-mesto - Brno, Brno-město, náměstí Svobody
RK-NIKA-realitni-kancelar-Semily - Semily, Sokolská
FF-Reality-2014-s-r-o-Praha-Karlin - Praha, Karlín, Pernerova
REALITY-PRORADOST-Breclav - Břeclav, Lidická
ORCA-ESTATE-a-s-Kyjov - Kyjov, Jungmannova
RAZKA-reality-Tachov - Tachov, náměstí Republiky
LUXENT-Exclusive-Properties-Praha-Josefov - Praha, Josefov, Pařížská

You can take that a bit further and first make a request that gets you the total count of results and then just loop over each page.

import time

import requests

api_endpoint = "https://www.sreality.cz/api/cs/v2/companies?"
query = f"tms={int(time.time() * 1000)}"

initial_request = requests.get(f"{api_endpoint}{query}").json()
total_results = initial_request["result_size"]

for page in range(1, total_results + 1):
    current_url = f"{api_endpoint}page={page}&tms={int(time.time() * 1000)}"
    data = requests.get(current_url).json()
    for company in data["_embedded"]["companies"]:
        print(f"{company['url']} - {company['locality']}")

baduker
  • 19,152
  • 9
  • 33
  • 56
-1

Selenium is a lot better and easier to use them BeautifulSoup

There are also way more support for selenium