1

I am currently web scraping an online database with selenium in Python. The format of the database requires navigating between pages in order to scrape the data I am interested in, and every time I run the code, I invariably run into a 502 Bad Gateway Error (picture below).

Bad Gateway error screen

This error message seems to go away sometimes, but it seems to depend on where in the loop this 502 pops up. Any advice on how to avoid this would be greatly appreciated. I also have attached the portion of my code which interacts with Chrome below for reference:

# ! Final !
#### Define Driver & Starting URL ####
# Location of chromedriver
driver_path = "/Users/shrey/Desktop/Python Projects/Selenium/chromedriver"

# Beginning url & initialize driver
url = "https://tamu.libguides.com/az.php"
driver = webdriver.Chrome()

# Make driver wait for elements to load when find_element() is run for the rest of our code
driver.implicitly_wait(10)

# Launch driver
driver.get(url)

# Press "Ancestry Database" link
driver.find_element(By.LINK_TEXT,
                    "Ancestry Library").click()

# Give time for user to login to database
time.sleep(30)

# Go to link where we can search from
home = "https://www.ancestrylibrary.com/search/collections/1742/"
driver.get(home)

# Switch to first tab (Search tab we just opened)
driver.switch_to.window(driver.window_handles[0])

#### Loop through each year present in the data ####
for yr in range(1886, 1952):
    # Go to search home
    driver.get(home)
    
    # Find textbox & Input Year --------
    year_input = driver.find_element(By.CSS_SELECTOR, "#sfs_SelfCivilYear")
    year_input.send_keys(str(yr))

    # Press "search" button
    driver.find_element(By.CSS_SELECTOR, "#searchButton").click()

    # Determine number of times we need to loop --------
    # Find text which includes total number of results (formatted as "Results 1–20 of 1,351")
    n_raw = driver.find_element(By.XPATH,
                                '//*[@id="results-header"]/h3').text

    # Isolate the important number (1,351)
    n_num = (tot_results.split()[-1]) # pulls the last word from the string - our desired number

    # Remove comma and convert to number ("1,351" >>> 1351)
    n_total = int(re.sub(",", "", n_num))

    # Determine number of loops we need to do to scrape all the data
    loop_count = math.floor(n_total/20) + 1

    # Loop thru pages and collect links --------
    # Init empty list
    links = []
    
    # Loop n times (calc'd earlier)
    for i in range(loop_count):
        
        # If we are on our last iter, do the same but do not click "next page" button
        if i == range(loop_count)[-1]: 
            # Find & Store all "View Result" links
            current_pg_links = driver.find_elements(By.CSS_SELECTOR, 
                                                    ".srchFoundDB a")

            # Loop through all links pulled & append
            for link in current_pg_links:
                # Get actual url from 'href' attribute
                url = link.get_attribute('href')

                # Append URL to final list
                links.append(url)

        else:
            # Find & Store all "View Result" links
            current_pg_links = driver.find_elements(By.CSS_SELECTOR, 
                                                    ".srchFoundDB a")

            for link in current_pg_links:
                # Get actual url from 'href' attribute
                url = link.get_attribute('href')

                # Append URL to final list
                links.append(url)

            # Press "next page" button
            driver.find_element(By.CSS_SELECTOR,
                                "a.ancBtn.sml.green.icon.iconArrowRight").click()

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • this seems like a server-side error (something wrong with on of their servers, not your machine or code). – ocean moist Jun 10 '23 at 23:24
  • @oceanmoat that's what my (very very uninformed) research led me to believe as well, but is there no workaround? Is this data just "unscrapeable" due to their server issues? – shrey_shankar Jun 10 '23 at 23:29
  • yeah, just try until the error disappears. 502 is not really something you would see if you were getting rate limited. – ocean moist Jun 10 '23 at 23:35
  • @oceanmoist is there any way to except for this error occurring in the code? Maybe if it does, wait x seconds and then reload the page? Or does the existence of this kind of error just mean this data is unscrapeable? – shrey_shankar Jun 11 '23 at 18:20
  • Your idea is correct. Wait some time and try again (reload). – ocean moist Jun 11 '23 at 19:49

1 Answers1

0

502 Bad Gateway Cloudflare Error

502-Bad-Gateway-Cloudflare-Error

A 502 Bad Gateway Cloudflare Error occurs when Cloudflare cannot establish a valid connection with your website’s origin web server. While this error message relates to the server-side (i.e. your web host), it can also happen if Cloudflare service is down or not correctly configured.


Details

When you visit a website the client sends a request to a web server. The web server receives and processes the request and then sends back the requested resources along with an HTTP header and HTTP status code. Generally a HTTP status code isn't seen unless something goes wrong. But when you’re using Cloudflare on your website, the request is sent to Cloudflare before it reaches the client. A 502 Bad Gateway Cloudflare error occurs when Cloudflare cannot establish a valid connection with your website's origin web server. While this error message relates to the server-side, it can also happen if Cloudflare service is down or not configured correctly. It's the servers way of notifying you that something has gone wrong along with the code on how to diagnose it.

An example:

502 Bad Gateway Error

Based on your web server and browser you might see a different 502 error, but they all mean the same thing:

  • 502 Bad Gateway
  • Error 502
  • 502 Proxy Server
  • HTTP 502
  • 502 Proxy Error
  • Temporary Error (502)
  • HTTP Error 502 – Bad Gateway
  • 502 Bad Gateway Nginx
  • 502 Server Error: The web server encountered a temporary error and could not complete your request
  • 502. That’s an error
  • 502 Service Temporarily Overloaded

Some websites can also customize how a bad 502 gateway error looks. However, all variations have the same meaning that the server acting as a proxy has not received a valid response from the origin server.


Reason

The two possible causes for this 502 Bad Gateway Cloudflare Error are:

  • 502 status code from the origin web server
  • 502 error from Cloudflare

Solution

502 Bad Gateway Cloudflare error being a problem with the network/server issue but at times it can also be a client-side issue. So some common steps from the client side to fix the error to get back up and running are as follows:

  • Clear the Browser Cache and reload the page.
  • Check for DNS Server issues.
  • Check with the Host machine.
  • Temporarily disable Cloudflare Proxy.
  • Temporarily disable CDN or Firewall.
  • Check for Plugin/Theme conflict.

tl; dr

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352