0

I need to get the innerText of multiple pages that look like this: website.com/abc/xyz/page-{}.

So far I'm using Selenium and undetected_chromedriver to bypass CloudFlare, get the website and get content using javascript, then click on next button to move to the next page.

The problem is, this method is quite slow. It took quite some time to load the page, and a little bit more time to get and write the content. So ~2000 pages took me an hour.

Also, for some reason, sometimes the written content of multiple next files are duplicates of one page. Though this would not happen if I don't use my laptop while it's running.

Is there a faster way to do this, i.e. another library. Or can I handle multiple tabs simultaneously? Because I don't want to open ~10 different windows at once or use 10 different driver, especially when I also need to load an adblocker to block ads embedded in the "next" button

Currently, I use

WebDriverWait(driver, 3).until(
       EC.presence_of_element_located((By.CLASS_NAME, "class_name")
))

to wait for the element to load, then execute return document.getElementsByClassName('className')[0].innerText to get the content, and write it to a file.

Then find element by CLASS_NAME of the next button and click it.

I used to execute location.href = {link.format(i+1)} (where link is the variable storing the link, and i is of the loop) to move to the next page. Not sure if it's faster or not.

Edit:

  1. By "duplicate" I mean for example, page 1 content is abc, page 2 content is def, ... But for some reason, the code stays at page 1, continuously writting it content to page_2.txt, ... page_n.txt. It just don't move to the next page.
  2. Managed to fix the "duplicate" issue by reverting back to location.href method.
PhanLong
  • 43
  • 6
  • 1
    I don't think you can easily handle multiple tabs simultaneously, it would be easier and faster to open multiple independent windows of the browser and run the same code simultaneously in each window. With 10 windows you can do 2000 pages in 6 minutes instead of 1 hour (200 pages in each window) – sound wave Jan 02 '23 at 10:31

0 Answers0