I'm using selenium and BeautifulSoup to scrape data from a website (http://www.grownjkids.gov/ParentsFamilies/ProviderSearch) with a next button, which I'm clicking in a loop. I was struggling with StaleElementReferenceException previously but overcame this by looping to refind the element on the page. However, I ran into a new problem - it's able to click all the way to the end now. But when I check the csv file it's written to, even though the majority of the data looks good, there's often duplicate rows in batches of 5 (which is the number of results that each page shows).
Pictoral example of what I mean: https://www.dropbox.com/s/ecsew52a25ihym7/Screen%20Shot%202019-02-13%20at%2011.06.41%20AM.png?dl=0
I have a hunch this is due to my program re-extracting the current data on the page every time I attempt to find the next button. I was confused why this would happen, since from my understanding, the actual scraping part happens only after you break out of the inner while loop which attempts to find the next button and into the larger one. (Let me know if I'm not understanding this correctly as I'm comparatively new to this stuff.)
Additionally, the data I output after every run of my program is different (which makes sense considering the error, since in the past, the StaleElementReferenceExceptions were occurring at sporadic locations. If it duplicates results every time this exception occurs, it would make sense for duplications to occur sporadically as well. Even worse, a different batch of results ends up being skipped each time I run the program as well - I cross-compared results from 2 different outputs and there were some results that were present in one and not the other.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from bs4 import BeautifulSoup
import csv
chrome_options = Options()
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("--headless")
url = "http://www.grownjkids.gov/ParentsFamilies/ProviderSearch"
driver = webdriver.Chrome('###location###')
driver.implicitly_wait(10)
driver.get(url)
#clears text box
driver.find_element_by_class_name("form-control").clear()
#clicks on search button without putting in any parameters, getting all the results
search_button = driver.find_element_by_id("searchButton")
search_button.click()
df_list = []
headers = ["Rating", "Distance", "Program Type", "County", "License", "Program Name", "Address", "Phone", "Latitude", "Longitude"]
while True:
#keeps on clicking next button to fetch each group of 5 results
try:
nextButton = driver.find_element_by_class_name("next")
nextButton.send_keys('\n')
except NoSuchElementException:
break
except StaleElementReferenceException:
attempts = 0
while (attempts < 100):
try:
nextButton = driver.find_element_by_class_name("next")
if nextButton:
nextButton.send_keys('\n')
break
except NoSuchElementException:
break
except StaleElementReferenceException:
attempts += 1
#finds table of center data on the page
table = driver.find_element_by_id("results")
html_source = table.get_attribute('innerHTML')
soup = BeautifulSoup(html_source, "lxml")
#iterates through centers, extracting the data
for center in soup.find_all("div", {"class": "col-sm-7 fields"}):
mini_list = []
#all fields except latlong
for row in center.find_all("div", {"class": "field"}):
material = row.find("div", {"class": "value"})
if material is not None:
mini_list.append(material.getText().encode("utf8").strip())
#parses latlong from link
for link in center.find_all('a', href = True):
content = link['href']
latlong = content[34:-1].split(',')
mini_list.append(latlong[0])
mini_list.append(latlong[1])
df_list.append(mini_list)
#writes content into csv
with open ('output_file.csv', "wb") as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(row for row in df_list if row)
Anything would help! If there's other recommendations you have about the way I've used selenium/BeautifulSoup/python in order to improve my programming for the future, I would appreciate it.
Thanks so much!