Scraping HTML code using Selenium with Python

Question

I was trying to scrape some image data from some stores. For example, I was looking at some images from Nordstrom (tested with 'https://www.nordstrom.com/browse/men/clothing/sweaters').

I had initially used requests.get() to get the code, but I noticed that I was getting some javascript -- and upon further researc I found that this occured because it was dynamically loaded in the html using javascript.

To remedy this issue, following this post (Python requests.get(url) returning javascript code instead of the page html), I tried to use selenium to get the html code. However, I still ran into issues trying to access all the html: it was still returning alot of javascript. Finally, I added in some time delay as I thought maybe it needed some time to load in all of the html, but this still failed. Is there a way to get all the html using selenium? I have attached the current code below:

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
def create_browser(webdriver_path):
    #create a selenium object that mimics the browser
    browser_options = Options()
    #headless tag created an invisible browser
    browser_options.add_argument("--headless")
    browser_options.add_argument('--no-sandbox')
    browser = webdriver.Chrome(webdriver_path, chrome_options=browser_options)
    print("Done Creating Browser")
    return browser
url = 'https://www.nordstrom.com/browse/men/clothing/sweaters'
browser = create_browser('path/to/chromedriver_win32/chromedriver.exe')
browser.implicitly_wait(10)
browser.get(url)
time.sleep(10)
html_source = browser.page_source
print(html_source)

Is there something that I am not doing properly to load in all of the html code?

It seems to work if I open up the browser without the --headless tag, which suggests to me that my viewing of the page is triggering some js on the backend, or there is some issue in selenium that I made somehow. — Frederick, Nov 25 '20 at 08:39

score -1 · Answer 1 · answered Nov 25 '20 at 09:00

-1

browser.page_source always returns initial HTML source but not current DOM state. Try

time.sleep(10)
html_source = browser.find_element_by_tag_name('html').get_attribute('outerHTML')

answered Nov 25 '20 at 09:00

JaSON

4,843
2
8
15

In general. is there a way to force load all the html? For instance, adding a wait time generates some of the html, but I have no guarantee to generate everything I need. Additionally, do you know any reason why this would work fine when I don't add the headless argument, but fail when I add headless? – Frederick Nov 25 '20 at 17:24

score -1 · Answer 2 · answered Nov 25 '20 at 09:17

I would recommend reading "Test-Driven Development with Python", you'll get an answer for your question and so many more. You can read it for free here: https://www.obeythetestinggoat.com/ (and then you can also buy it ;-) )

Regarding your question, you have to wait that the element you're looking for is effectively loaded. You may use time.sleep but you'll get strange behavior depending on the speed of your internet connection and browser.

A better solution is explained here in depth: https://www.obeythetestinggoat.com/book/chapter_organising_test_files.html#self.wait-for

You can use the proposed solution:

def wait_for(self, fn):
    start_time = time.time()
    while True:
        try:
            return fn()  
        except (AssertionError, WebDriverException) as e:
            if time.time() - start_time > MAX_WAIT:
                raise e
            time.sleep(0.5)

fn is just a function finding the element in the page.

Abhishek Rai · Answer 3 · 2020-11-26T04:43:10.123

Just add a user agent. Chrome's headless user agent says headless that is the problem.

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def create_browser(webdriver_path):
    #create a selenium object that mimics the browser
    browser_options = Options()
    browser_options.add_argument('--headless')
    browser_options.add_argument('--user-agent="Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"')
    browser = webdriver.Chrome(webdriver_path, options=browser_options)
    print("Done Creating Browser")
    return browser
url = 'https://www.nordstrom.com/browse/men/clothing/sweaters'
browser = create_browser('C:/bin/chromedriver.exe')
browser.implicitly_wait(10)
browser.get(url)
divs = browser.find_elements_by_tag_name('a')
for div in divs:
    print(div.text)

Output:- Displays all links on the page..

Patagonia Better Sweater® Quarter Zip Pullover
   (56)

Nordstrom Men's Shop Regular Fit Cashmere Quarter Zip Pullover (Regular & Tall)
(73)

Nordstrom Cashmere Crewneck Sweater
(51)

Cutter & Buck Lakemont Half Zip Sweater
(22)

Nordstrom Washable Merino Quarter Zip Sweater
(2)

ALLSAINTS Mode Slim Fit Merino Wool Sweater

Process finished with exit code -1

could you clarify what you mean by "Chrome's headless user agent says headless is the problem"? Additionally, when trying to follow your solution gives me a connection refused error. — Frederick, Nov 25 '20 at 17:14
@Frederick The user agent string includes HeadlessChrome instead of Chrome...when running in `--headless` mode.So, the site might be blocking it. I have added the output of the program. It works fine. Please re check your system or any copy/paste errors. — Abhishek Rai, Nov 26 '20 at 04:08
Request anyone to copy paste my code and let me know if it's not working. — Abhishek Rai, Nov 26 '20 at 05:37

Scraping HTML code using Selenium with Python

3 Answers3