0

I am trying to scrape different websites of the one unique domain. I have the following URL structure:

URL = 'https://somewebsite.eu/id/{}'.format(ID) where the variable ID takes many many values. This website is protected by Cloudflare system, therefore I decided to use selenium and undetected chrome driver to bypass it. All the other methods such as requests with sessions and cfcscrape do not work with the website.

Since I need to parse many pages with similar URL structure, I decided to use a loop over all values of ID variable.

import pandas as pd
import numpy as np
import requests
import selenium

from undetected_chromedriver import Chrome 
from selenium.webdriver.chrome.options import Options
import time

def extracting_html_files_v11(ids):
    options = Options()
    options.add_argument("start-maximized")
    for x in ids:
        start_time = time.time()
        browser = Chrome(option = options)
        print('initialization of the browser')
        url = 'https://somewebsite.eu/id/{}/'.format(x)
        print(url)
        browser.get(url) 
        print('the page was downloaded')
        
        time_to_wait = np.random.uniform(low = 7, high = 10)
        time.sleep(time_to_wait)

        file_name = 'data_8000_9000/case_{}.html'.format(x)
        with open(file_name, 'w', encoding="utf-8") as f:
            f.write(browser.page_source)
        print('the file was saved')
        browser.quit()
        print('the browser was quited')
        print("--- %s seconds ---" % (time.time() - start_time))
        for i in range(3):
            print('_____')

However, this process takes too long. After each launch of the browser I need to wait roughly 5 seconds for Cloudflare to let me download the page (that's why I have time.sleep(time_to_wait)). Can the code be optimized? And should I think about parallel programming or something like that? (I am completely a beginner in parallel processes).

James Z
  • 12,209
  • 10
  • 24
  • 44
Ozymandix
  • 1
  • 2
  • Would not recommend multi-threading or processing, website might think you are DDoS'ing them and trigger more protections – SPYBUG96 Apr 24 '22 at 14:22

1 Answers1

0

why do this multiple times? browser = Chrome(option = options)

just do it once outside the routine, and pass browser as an argument

also: something you can investigate, although maybe too much work. open new tabs on say 10 pages, without waiting for results, then cycle back thru each tab and do what you need to do. Should get overlapped download of each tab then?

selenium 4 has new stuff for starting tabs and switching tabs, you'd have to read up on that.