0

could you find an error in my code? I haven't been able to get over this code for a week now, so I am forced to ask a community. I am trying to download 14.000 html pages into a folder (I use selenium), I have a long list of ids that I paste into a webpage address. Because the website I am downloading htmls from is protected with captcha, I am using a proxy (first, I scrap free proxies from an online source and try to find a working one - when proxy fails I am telling my driver to close). The problem I am facing is the following:

  1. using a working driver (with a working proxy credentials) for every id in my list, I get the page. (works fine)
  2. I am inspecting the page for a table - if it is there, I can download it, if driver.get returns me a captcha I want to close the driver. BUT IT DOES NOT CLOSE. For whatever reason, selenium is perfectly fine downloading pages with no captcha, but when it gets captcha it just doesn't do anything! As if PyCharm is stuck. I am confused. The code part with proxies and finding a workable driver is okay, I just think the error is in the last lines of my code. Please see the code below:
#function to find an element. returns 1 if it finds and 0 if not

def find_element(driver, test_xpath = 'restab') -> int:
    if driver.find_elements(By.ID, test_xpath):
        var = 1
    else:
        var = 0
    return var

#function to download pages if the element is found and close driver if element is not located

def data_fill(id_list: str, driver) -> int:
    for id in id_list:
        author_page = "https://www.elibrary.ru/author_profile_new_titles.asp?id={}".format(id)
        driver.implicitly_wait(300)
        driver.get(author_page)
        result = find_element(driver)
        if result == 0:
            driver.close()
        else:
            n = os.path.join(f"/Users/dariagerashchenko/PycharmProjects/python_practice/hist/j_profile{id}.html")
            f = codecs.open(n, "w", "utf−8")
            h = driver.page_source
            f.write(h)
    return 1

# calling a function to get the code running
k = 0
while True:

    if k % 5 == 0:
        proxy_list = get_proxies()
    k += 1
    driver = get_best_driver(driver_path = driver_path, proxy_list = proxy_list) # find the working driver
    if driver is None:
        continue
    session_result = data_fill(id_list = id_list, driver=driver)
    if session_result == 1:  # data is collected
        print("Data collected.")

I tried multiple constellations to tell the driver to close, but failed many times. Previously I worked in R, and just recently switched to python, so maybe it is just my lack of knowledge.

Daria
  • 13
  • 4

3 Answers3

1

Thanks to Nikhil Devadiga for his ideas, eventually I found an answer myself. Here it is:

k = 0
while True:
if k % 5 == 0:
    proxy_list = get_proxies()
k += 1
driver = get_best_driver(driver_path = driver_path, proxy_list = proxy_list)
for id in id_list:
    session_result = data_fill(id_list = id, driver=driver)
    if session_result == 0:
        driver.close()
        break
    continue
print('done')

But before I modified another part of my code:

def data_fill(id_list: str, driver) -> int:
    author_page = "https://www.elibrary.ru/author_profile_new_titles.asp?id={}".format(id_list)
    driver.get(author_page)
    result = find_element(driver)
    if result == 0:
        output = 0
    else:
        n = os.path.join(f"/Users/dariagerashchenko/PycharmProjects/python_practice/hist/j_profile{id_list}.html")
        f = codecs.open(n, "w", "utf−8")
        h = driver.page_source
        f.write(h)
        output = 1
    return output
Daria
  • 13
  • 4
0

To close a selenium driver it is a manual method you have to call check out this website if you need: https://www.geeksforgeeks.org/close-driver-method-selenium-python/

0

Looks like the code enters an infinite loop. The while loop is to blame for this.

while True:
    ## ...
    driver = get_best_driver(driver_path = driver_path, proxy_list = proxy_list) # find the working driver
    if driver is None:
        continue
    ## ...

Notice that when the driver is None then the loop never ends. So have an limit for the maximum tries by using a for loop instead of while loop.

for _ in range(100):
  ## ...

Could please post the code for the function get_best_driver for further clarity?

Nikhil Devadiga
  • 428
  • 2
  • 9
  • thank you for your answer! here is my function for the best driver: # returns a webdriver object ready for scraping (or None if all proxies failed): def get_best_driver(driver_path: str, proxy_list: list): for proxy in proxy_list: session = launch_driver(driver_path = driver_path, proxy = proxy) driver, result = session[0], session[1] if result == 1: return driver driver.close() #close driver if connection wasnt established return None – Daria Mar 13 '23 at 07:37
  • Looks like the function `get_best_driver` is actually returning `None`. Try adding the line `print("driver is None")` in the if condition to know for sure if the loop is continued due to reason. If you do see this being printed repeatedly then the best possible explanation is `get_best_driver` is running through the whole `proxy_list` and couldn't find any which could connect. If you don't find this text being printed then something else must have gone wrong and you would have to further debug this. – Nikhil Devadiga Mar 16 '23 at 18:13