I'm a beginner and have a lot to learn, so please be patient with me.
Using Python and Selenium, I'm trying to scrap table data from a website while navigating through different pages. As I navigate through different pages, the table shows the updated data, but it doesn't refresh the page, and the URL remains the same.
To get the refreshed data from the table and avoid stale element exception, I used WebDriverWait and expected_conditions (tr elements). Even with the wait, my code didn't get the refreshed data. It was getting the old data from the previous page and was giving the exception. So, I added time.sleep() after I clicked the next page button, which solved the problem.
However, I noticed my code was getting slower as I was navigating more and more pages. So, at around page 120, it gave me the stale element exception and was not able to get the refreshed data. I'm assuming it is because I'm using a for loop within a while loop that slows down the performance.
I tried implicit wait and increased time.sleep() gradually to avoid staleness exception, but nothing was working. There are 100 table rows in each page and around 3,100 pages total.
The followings are the problems:
- Why do I get the stale element exception and how to avoid it
- How to increase the efficiency of the code
I searched a lot and really tried to fix it on my own before I decided to write here. I'm stuck here and don't know what to do. Please help, and thank you so much for your time.
while True:
# waits until the table elements are visible when the page is loaded
# this is a must step for Selenium to scrap data from the dynamic table when we navigate through different pages
tr = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//*[@id='erdashboard']/tbody/tr")))
for record in tr:
count += 1
posted_date = datetime.strptime(record.find_element(By.XPATH, './td[7]').text, "%m/%d/%Y").date()
exclusion_request_dict["ID"].append(int(record.find_element(By.XPATH, './td[1]').text))
exclusion_request_dict["Company"].append(record.find_element(By.XPATH, './td[2]').text)
exclusion_request_dict["Product"].append(record.find_element(By.XPATH, './td[3]').text)
exclusion_request_dict["HTSUSCode"].append(record.find_element(By.XPATH, './td[4]').text)
exclusion_request_dict["Status"].append(record.find_element(By.XPATH, './td[5]').text)
exclusion_request_dict["Posted Date"].append(posted_date)
next_button = driver.find_element(By.ID, "erdashboard_next")
next_button_clickable = driver.find_element(By.ID, "erdashboard_next").get_attribute("class").split(" ")
print(next_button_clickable)
print("Current Page:", page, "Total Counts:", count)
if next_button_clickable[-1] == "disabled":
break
next_button.click() # goes to the next page
time.sleep(wait + 0.01)