I was trying to scrape some image data from some stores. For example, I was looking at some images from Nordstrom (tested with 'https://www.nordstrom.com/browse/men/clothing/sweaters').
I had initially used requests.get() to get the code, but I noticed that I was getting some javascript -- and upon further researc I found that this occured because it was dynamically loaded in the html using javascript.
To remedy this issue, following this post (Python requests.get(url) returning javascript code instead of the page html), I tried to use selenium to get the html code. However, I still ran into issues trying to access all the html: it was still returning alot of javascript. Finally, I added in some time delay as I thought maybe it needed some time to load in all of the html, but this still failed. Is there a way to get all the html using selenium? I have attached the current code below:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
def create_browser(webdriver_path):
#create a selenium object that mimics the browser
browser_options = Options()
#headless tag created an invisible browser
browser_options.add_argument("--headless")
browser_options.add_argument('--no-sandbox')
browser = webdriver.Chrome(webdriver_path, chrome_options=browser_options)
print("Done Creating Browser")
return browser
url = 'https://www.nordstrom.com/browse/men/clothing/sweaters'
browser = create_browser('path/to/chromedriver_win32/chromedriver.exe')
browser.implicitly_wait(10)
browser.get(url)
time.sleep(10)
html_source = browser.page_source
print(html_source)
Is there something that I am not doing properly to load in all of the html code?