How can I get information from a web site using BeautifulSoup in python?

Question

I have to take the publication date displayed in the following web page with BeautifulSoup in python:

https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410

The point is that when I search in the html code from 'inspect' the web page, I find the publication date fast, but when I search in the html code got with python, I cannot find it, even with the functions find() and find_all().

I tried this code:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content)

soup.find_all('span', id_= 'biblio-publication-number-content')

but it gives me [], while in the 'inspect' code of the online page, there is this tag.

What am I doing wrong to have the 'inspect' code that is different from the one I get with BeautifulSoup?

How can I solve this issue and get the number?

*Always and first of all, take a look at your soup to see if all the expected ingredients are there or additional info is present* What do you find in your `soup`? -> Espacenet may reject your requests if you are using any automated tools, perform too many queries per minute or generate queries that result in the system attempting to retrieve unusually large numbers of documents or unusually large documents. — HedgeHog, Jan 07 '23 at 13:48
Does this answer your question? [How to scrape dynamic content from a website?](https://stackoverflow.com/questions/55709463/how-to-scrape-dynamic-content-from-a-website) — gre_gor, Jan 07 '23 at 15:56

Booboo · Accepted Answer · 2023-01-07T14:29:21.110

The problem I believe is due to the content you are looking for being loaded by JavaScript after the initial page is loaded. requests will only show what the initial page content looked like before the DOM was modified by JavaScript.

For this you might try to install selenium and to then download a Selenium web driver for your specific browser. Install the driver in some directory that is in your path and then (here I am using Chrome):

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as bs

options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)

try:

    driver.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')

    # Wait (for up to 10 seconds) for the element we want to appear:
    driver.implicitly_wait(10)
    elem = driver.find_element(By.ID, 'biblio-publication-number-content')

    # Now we can use soup:
    soup = bs(driver.page_source, "html.parser")
    print(soup.find("span", {"id": "biblio-publication-number-content"}))
finally:
    driver.quit()

Prints:

<span id="biblio-publication-number-content"><span class="search">CN105030410</span>A·2015-11-11</span>

This helps me a lot, now it is working. Than you! I just have a doubt now: from another VPN, with the method I used in my question it gives me 'Blacklist': 'The IP address has been blacklisted.'. With your code it works perfectly but at this point I don't know if it is 'legal' or not. It should be public data but if you have an idea on how to find it out I would really apreciate. — Umberto, Jan 08 '23 at 11:57
In the first reference I gave you it says, "**Is the data publicly available?** If the data isn’t hidden behind a login, then the website’s terms and conditions aren’t enforceable, so you can legally scrape the public data." — Booboo, Jan 08 '23 at 12:07

score -1 · Answer 2 · edited Jan 07 '23 at 14:06

-1

Umberto if you are looking for an html element span use the following code:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')

results = soup.find_all('span')
[r for r in results]

if you are looking for an html with the id 'biblio-publication-number-content' use the following code


import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')

soup.find_all(id='biblio-publication-number-content')

in first case you are fetching all span html elements in second case you are fetching all elements with an id 'biblio-publication-number-content'

I suggest you look into html tags and elements for deeper understanding on how they work and what are the semantics behind them.

edited Jan 07 '23 at 14:06

HedgeHog

22,146
4
14
36

answered Jan 07 '23 at 13:52

J.D.

1,145
2
15
29

1

The point here is not his selection of the elements themselves, this is basically fine in my opinion, the crux is that the elements do not exist in the soup. So the proposed approach would not lead to the goal either, or does it, if you execute the code you used? – HedgeHog Jan 07 '23 at 14:05
I did, the site does not contain an element with an id='biblio-publication-number-content' a quick `r.text.find('biblio-publication-number-content')` shows that. – J.D. Jan 07 '23 at 14:07
Yea this is exactly the point, I didn't find the data I wanted in the html got with my code. But it was there in the 'inspect' code of the page. – Umberto Jan 08 '23 at 11:59

How can I get information from a web site using BeautifulSoup in python?

2 Answers2