2

I am trying to scrape with soup and am obtaining an empty set when I call findAll

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url='https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/SearchDisplayView?catalogId=10123&langId=44&storeId=10151&krypto=70KutR16JmLgr7Ka%2F385RFXrzDpOkSqx%2FRC3DnlU09%2BYcw0pR5cfIfC0kOlQywiD%2BTEe7ppq8ENXglbpqA8sDUtif1h3ZjrEoQkV29%2B90iqljHi2gm2T%2BDZHH2%2FCNeKB%2BkVglbz%2BNx1bKsSfE5L6SVtckHxg%2FM%2F%2FVieWp8vgaJTan0k1WrPjCrVuDs5WnbRN#langId=44&storeId=10151&catalogId=10123&categoryId=&parent_category_rn=&top_category=&pageSize=60&orderBy=RELEVANCE&searchTerm=milk&beginIndex=0&hideFilters=true&categoryFacetId1='

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html,'html.parser')

containers = page_soup.findAll("div",{"class":"product"}) 
containers

I also got empty datasets from these articles: findAll returning empty for html

and BeautifulSoup find_all() returns no data

Can anyone offer any help?

frank
  • 3,036
  • 7
  • 33
  • 65
  • 1
    I think you just got unlucky. Look at the page source. You'll notice for "product" there is a rogue space after the name: `class="product "`, which means you are referencing a class that doesn't exist. If you do Ctr+F for `class="product"`, you'll find 0 results, but for `class="product "`, you'll find 54. – Recessive Mar 27 '19 at 23:17
  • 1
    Please don't post pictures of code. Use the snippet tool via [edit] to include html and for python code, insert, select code and press Ctrl + K. – QHarr Mar 28 '19 at 03:02
  • noted. Removed pictures of code – frank Mar 28 '19 at 09:33

1 Answers1

3

The page content is loaded with javascript, so you can't just use BeautifulSoup to parse it. You have to use another module like selenium to simulate javacript execution.

Here is an exemple:

from bs4 import BeautifulSoup as soup
from selenium import webdriver

url='https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/SearchDisplayView?catalogId=10123&langId=44&storeId=10151&krypto=70KutR16JmLgr7Ka%2F385RFXrzDpOkSqx%2FRC3DnlU09%2BYcw0pR5cfIfC0kOlQywiD%2BTEe7ppq8ENXglbpqA8sDUtif1h3ZjrEoQkV29%2B90iqljHi2gm2T%2BDZHH2%2FCNeKB%2BkVglbz%2BNx1bKsSfE5L6SVtckHxg%2FM%2F%2FVieWp8vgaJTan0k1WrPjCrVuDs5WnbRN#langId=44&storeId=10151&catalogId=10123&categoryId=&parent_category_rn=&top_category=&pageSize=60&orderBy=RELEVANCE&searchTerm=milk&beginIndex=0&hideFilters=true&categoryFacetId1='

driver = webdriver.Firefox()
driver.get(url)

page = driver.page_source
page_soup = soup(page,'html.parser')

containers = page_soup.findAll("div",{"class":"product"})
print(containers)
print(len(containers))

OUTPUT:

[
<div class="product "> ...
...,
<div class="product hl-product hookLogic highlighted straplineRow" ...    
]

64
Maaz
  • 2,405
  • 1
  • 15
  • 21
  • Unfortunately, I am having issues installing selenium: WebDriverException: Message: 'geckodriver' executable needs to be in PATH. Hoping once I solve that, I can accept your answer – frank Mar 28 '19 at 23:29
  • You can check here for this: https://stackoverflow.com/questions/40208051/selenium-using-python-geckodriver-executable-needs-to-be-in-path – Maaz Mar 29 '19 at 07:45
  • selenium is a heavy feature. hard to deploy. os based and complicated. hope someone gave better answer – greendino Apr 22 '22 at 04:55