Is there a way to render a HTML page without using a browser and then scrape it's content?

Question

I need to extract some text from a webpage but the webpage is dynamically built(plugin). i.e I need to include a javascript SDK

<div id="fb-root"></div>
<script async defer crossorigin="anonymous" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v11.0" nonce="4HbUqy4w"></script>

and then place the code where I want the plugin to appear on my page

<div class="fb-comments" data-href="https://developers.facebook.com/docs/plugins/comments#configurator" data-width="1" data-numposts="1"></div>

so in total, I have something like

<html>
    <body>
        <div id="fb-root"></div>
        <script async defer crossorigin="anonymous" src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v11.0" nonce="4HbUqy4w"></script>
        <div class="fb-comments" data-href="https://developers.facebook.com/docs/plugins/comments#configurator" data-width="1" data-numposts="1"></div>
    </body>
</html>

Rendering this page on a browser should automatically load in some data which I now want to scrape. Is there a way to render this HTML in python? I've tried using

from requests_html import HTML

doc = # the content above
html = HTML(html=doc)
page = html.render(keep_page=True, sleep=120)

but the page is always None

Ideally, I would like something like

html_code = #here
loaded_html_code = a_package.render(html_code) # This should render my HTML which in turn causes an Iframe to be loaded.

Beautiful Soup can help. You tagged it, but you haven't tried it yet. Read [this](https://realpython.com/beautiful-soup-web-scraper-python/#dynamic-websites) — Raptor, Aug 30 '21 at 02:02
Thanks, @Raptor for the link but I can't see a way to do it directly using Beautiful Soup. One of the suggestions given is `requests_html` which I'm using above. — E_K, Aug 30 '21 at 02:15

Raptor · Answer 1 · 2021-08-30T03:24:38.503

0

You can use Beautiful Soup and Selenium Web Driver to achieve your goal. Here is an example code:

import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup

URL = "https://example.com/"
driver = webdriver.Firefox()
driver.get(URL)

time.sleep(15) # in seconds. 15 seconds should be enough to load the contents from API, JS, AJAX, etc.
html = driver.page_source
soup  = BeautifulSoup(html)

# find elements by ID
results = soup.find(id="target_id")

edited Aug 30 '21 at 03:24

answered Aug 30 '21 at 03:05

Raptor

53,206
45
230
366

This will work if I'm getting the HTML content from a website but in my case, I already have the HTML content just that I want it to be executed in order for the plugin to load in new data which I will then read. – E_K Aug 30 '21 at 03:10
Please find the revised codes. Selenium Web Driver is added to serve your case. – Raptor Aug 30 '21 at 03:25

Is there a way to render a HTML page without using a browser and then scrape it's content?

1 Answers1