0

I am new to using selenium. I previously wrote a scrapper using Beautiful Soup and it was working fine until I ran into "accept cookie". enter image description here

I attempted to use Selenium to click on the "X" button, and then I wanted to pass the page_source to Beautifulsoup to reuse my previous script. But my soup is still showing the page with the "accept cookie", resulting in none of the class to be able to be found.

This is the website I want to scrape: https://sturents.com/s/newcastle/newcastle?ne=54.9972%2C-1.5544&sw=54.9546%2C-1.6508

Here is the script:

driver = webdriver.Chrome(PATH)

driver.get(link)

soup = BeautifulSoup(r_more_housing.text, 'html.parser')

element = driver.find_element(By.CLASS_NAME, "new--icon-cross")

element.click()

time.sleep(10)

driver.refresh()

soup = BeautifulSoup(driver.page_source)

rooms = soup.find_all('a', class_="new--listing-item js-listing-item")

rooms would return empty string.

tried to return soup where it showed the page without clicking on button

  • Does SO thread [Handling "Accept Cookies" popup with Selenium in Python](https://stackoverflow.com/questions/64032271/handling-accept-cookies-popup-with-selenium-in-python) help? – Heelara Nov 27 '22 at 11:11

1 Answers1

0

But my soup is still showing the page with the "accept cookie", resulting in none of the class to be able to be found

But the listings should show up even without closing cookies notice, and the part of your code to close the cookies notice looks fine anyway (maybe refreshing brings it back....?)


You might have noticed that at first the page looks like loading page

So, I think you need to give some time for the listings to load after refreshing. (About that, why are you refreshing? If you don't refresh, the 10sec wait should be enough time to load...)

# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC

## [ BEFORE BeautifulSoup ] ##

maxWait = 10 # adjust as preferred
wait =  WebDriverWait(driver, maxWait)
wait.until(EC.visibility_of_all_elements_located((By.CLASS_NAME, 'js-listing-item')))

## [ NOW YOU CAN PARSE WITH BeautifulSoup ] ##

that should force the program to wait until the listings are loaded.


Btw, .find_all('a', class_="new--listing-item js-listing-item") might not be the most reliable way to get rooms, since the new--listing-item class only shows up for wide screens; it might be better to use .find_all('a', {'class': 'js-listing-item'}) or .select('a.js-listing-item') instead.


rooms would return empty string

Are you sure? find_all is supposed to return a ResultSet, which is closer to a list...


Also, [this isn't very essential, but] it's better that the identifiers passed to find_element are as unique as possible; there are at least 3 other elements with the new--icon-cross class (even though close-cookies is the first, so that bit of your code should work anyway), but only the close-cookies button has the js-cookie-close class.

Driftr95
  • 4,572
  • 2
  • 9
  • 21