Get visible content of a page using selenium and BeautifulSoup

Question

I want to retrieve all visible content of a web page. Let say for example this webpage. I am using a headless firefox browser remotely with selenium.

The script I am using looks like this

driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities)
driver.get(url)
dom = BeautifulSoup(driver.page_source, parser)

f = dom.find('iframe', id='dsq-app1')
driver.switch_to_frame('dsq-app1')
s = driver.page_source
f.replace_with(BeautifulSoup(s, 'html.parser'))

with open('out.html', 'w') as fe:
    fe.write(dom.encode('utf-8'))

This is supposed to load the page, parse the dom, and then replace the iframe with id dsq-app1 with it's visible content. If I execute those commands one by one via my python command line it works as expected. I can then see the paragraphs with all the visible content. When instead I execute all those commands at once, either by executing the script or by pasting all this snippet in my interpreter, it behaves differently. The paragraphs are missing, the content still exists in json format, but it's not what I want.

Any idea why this may happening? Something to do with replace_with maybe?

score 1 · Accepted Answer · edited May 23 '17 at 10:34

1

Sounds like the dom elements are not yet loaded when your code try to reach them.

Try to wait for the elements to be fully loaded and just then replace.

This works for your when you run it command by command because then you let the driver load all the elements before you execute more commands.

edited May 23 '17 at 10:34

Community

1
1

answered Oct 04 '16 at 10:12

Or Duan

13,142
6
60
65

LetsPlayYahtzee · Answer 2 · 2016-10-04T15:12:15.997

To add to Or Duan's answer I provide what I ended up doing. The problem of finding whether a page or parts of a page have loaded completely is an intricate one. I tried to use implicit and explicit waits but again I ended up receiving half-loaded frames. My workaround is to check the readyState of the original document and the readyState of iframes.

Here is a sample function

def _check_if_load_complete(driver, timeout=10):
    elapsed_time = 1
    while True:
        if (driver.execute_script('return document.readyState') == 'complete' or
                elapsed_time == timeout):
            break
        else:
            sleep(0.0001)
        elapsed_time += 1

then I used that function right after I changed the focus of the driver to the iframe

driver.switch_to_frame('dsq-app1')
_check_if_load_complete(driver, timeout=10)

score 0 · Answer 3 · answered Oct 04 '16 at 13:50

Try to get the Page Source after detecting the required ID/CSS_SELECTOR/CLASS or LINK.

You can always use explicit wait of Selenium WebDriver.

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities)
driver.get(url)
f = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID,idName) 
# here 10 is time for which script will try to find given id
# provide the id name
dom = BeautifulSoup(driver.page_source, parser)

f = dom.find('iframe', id='dsq-app1')
driver.switch_to_frame('dsq-app1')
s = driver.page_source
f.replace_with(BeautifulSoup(s, 'html.parser'))

with open('out.html', 'w') as fe:
    fe.write(dom.encode('utf-8'))

Correct me if this not work

That was my first attemp, but it wasn't working properly, because the item may appear before it is actually truly fully loaded. I guess this whole subject of waiting until the page is loaded is an intricate one, but I quite solved it with checking the readyState of the iframes. — LetsPlayYahtzee, Oct 04 '16 at 13:53

Get visible content of a page using selenium and BeautifulSoup

3 Answers3