I need to get the innerText
of multiple pages that look like this: website.com/abc/xyz/page-{}
.
So far I'm using Selenium
and undetected_chromedriver
to bypass CloudFlare, get the website and get content using javascript, then click on next button to move to the next page.
The problem is, this method is quite slow. It took quite some time to load the page, and a little bit more time to get and write the content. So ~2000 pages took me an hour.
Also, for some reason, sometimes the written content of multiple next files are duplicates of one page. Though this would not happen if I don't use my laptop while it's running.
Is there a faster way to do this, i.e. another library. Or can I handle multiple tabs simultaneously? Because I don't want to open ~10 different windows at once or use 10 different driver, especially when I also need to load an adblocker to block ads embedded in the "next" button
Currently, I use
WebDriverWait(driver, 3).until(
EC.presence_of_element_located((By.CLASS_NAME, "class_name")
))
to wait for the element to load, then execute return document.getElementsByClassName('className')[0].innerText
to get the content, and write it to a file.
Then find element by CLASS_NAME of the next button and click it.
I used to execute location.href = {link.format(i+1)}
(where link
is the variable storing the link, and i
is of the loop) to move to the next page. Not sure if it's faster or not.
Edit:
- By "duplicate" I mean for example, page 1 content is
abc
, page 2 content isdef
, ... But for some reason, the code stays at page 1, continuously writting it content topage_2.txt
, ...page_n.txt
. It just don't move to the next page. - Managed to fix the "duplicate" issue by reverting back to
location.href
method.