0

I'm using selenium and BeautifulSoup to scrape data from a website (http://www.grownjkids.gov/ParentsFamilies/ProviderSearch) with a next button, which I'm clicking in a loop. I was struggling with StaleElementReferenceException previously but overcame this by looping to refind the element on the page. However, I ran into a new problem - it's able to click all the way to the end now. But when I check the csv file it's written to, even though the majority of the data looks good, there's often duplicate rows in batches of 5 (which is the number of results that each page shows).

Pictoral example of what I mean: https://www.dropbox.com/s/ecsew52a25ihym7/Screen%20Shot%202019-02-13%20at%2011.06.41%20AM.png?dl=0

I have a hunch this is due to my program re-extracting the current data on the page every time I attempt to find the next button. I was confused why this would happen, since from my understanding, the actual scraping part happens only after you break out of the inner while loop which attempts to find the next button and into the larger one. (Let me know if I'm not understanding this correctly as I'm comparatively new to this stuff.)

Additionally, the data I output after every run of my program is different (which makes sense considering the error, since in the past, the StaleElementReferenceExceptions were occurring at sporadic locations. If it duplicates results every time this exception occurs, it would make sense for duplications to occur sporadically as well. Even worse, a different batch of results ends up being skipped each time I run the program as well - I cross-compared results from 2 different outputs and there were some results that were present in one and not the other.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options 
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from bs4 import BeautifulSoup
import csv


chrome_options = Options()  
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("--headless")  

url = "http://www.grownjkids.gov/ParentsFamilies/ProviderSearch"

driver = webdriver.Chrome('###location###')
driver.implicitly_wait(10)

driver.get(url)

#clears text box 
driver.find_element_by_class_name("form-control").clear()

#clicks on search button without putting in any parameters, getting all the results
search_button = driver.find_element_by_id("searchButton")
search_button.click()

df_list = []
headers = ["Rating", "Distance", "Program Type", "County", "License", "Program Name", "Address", "Phone", "Latitude", "Longitude"]

while True: 
    #keeps on clicking next button to fetch each group of 5 results 
    try:
        nextButton = driver.find_element_by_class_name("next")
        nextButton.send_keys('\n') 
    except NoSuchElementException: 
        break
    except StaleElementReferenceException:
        attempts = 0
        while (attempts < 100):
            try: 
                nextButton = driver.find_element_by_class_name("next")
                if nextButton:
                    nextButton.send_keys('\n') 
                    break
            except NoSuchElementException: 
                break
            except StaleElementReferenceException:
                attempts += 1

    #finds table of center data on the page
    table = driver.find_element_by_id("results")
    html_source = table.get_attribute('innerHTML')
    soup = BeautifulSoup(html_source, "lxml")

    #iterates through centers, extracting the data
    for center in soup.find_all("div", {"class": "col-sm-7 fields"}):
        mini_list = []
        #all fields except latlong
        for row in center.find_all("div", {"class": "field"}):
            material = row.find("div", {"class": "value"})
            if material is not None:
                mini_list.append(material.getText().encode("utf8").strip())
        #parses latlong from link
        for link in center.find_all('a', href = True):
            content = link['href']
            latlong = content[34:-1].split(',')
            mini_list.append(latlong[0])
            mini_list.append(latlong[1])

        df_list.append(mini_list)

#writes content into csv
with open ('output_file.csv', "wb") as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    writer.writerows(row for row in df_list if row)

Anything would help! If there's other recommendations you have about the way I've used selenium/BeautifulSoup/python in order to improve my programming for the future, I would appreciate it.

Thanks so much!

Ella Bei
  • 13
  • 4

2 Answers2

0

I would use selenium to grab the results count then do an API call to get the actual results. You can either, in case result count is greater than limit for pageSize argument of queryString for API, loop in batches and increment the currentPage argument until you have reached the total count, or, as I do below, simply request all results in one go. Then extract what you want from the json.

import requests
import json
from bs4 import BeautifulSoup as bs
from selenium import webdriver

initUrl = 'http://www.grownjkids.gov/ParentsFamilies/ProviderSearch'
driver = webdriver.Chrome()
driver.get(initUrl)
numResults = driver.find_element_by_css_selector('#totalCount').text
driver.quit()
newURL = 'http://www.grownjkids.gov/Services/GetProviders?latitude=40.2171&longitude=-74.7429&distance=10&county=&toddlers=false&preschool=false&infants=false&rating=&programTypes=&pageSize=' + numResults + '&currentPage=0'
data = requests.get(newURL).json()

You have a collection of dictionaries to iterate over in the response:

An example of writing out some values:

if(len(data)) > 0:
    for item in data:
        print(item['Name'], '\n' , item['Address'])

If you are worried about lat and long values you can grab them from one of the script tags when using selenium:

enter image description here

The alternate URL I use for XHR jQuery GET you can find by using dev tools (F12) on the page then refreshing the page with F5 and inspect the jquery requests made in the network tab:

enter image description here

QHarr
  • 83,427
  • 12
  • 54
  • 101
0

You should read HTML contents inside every iteration of while loop. example below:

while counter < oage_number_limit:
   counter = counter + 1
   new_data = driver.page_source
   page_contents = BeautifulSoup(new_data, 'lxml')