I am trying to scrape different websites of the one unique domain. I have the following URL structure:
URL = 'https://somewebsite.eu/id/{}'.format(ID)
where the variable ID takes many many values. This website is protected by Cloudflare system, therefore I decided to use selenium and undetected chrome driver to bypass it. All the other methods such as requests with sessions and cfcscrape do not work with the website.
Since I need to parse many pages with similar URL structure, I decided to use a loop over all values of ID variable.
import pandas as pd
import numpy as np
import requests
import selenium
from undetected_chromedriver import Chrome
from selenium.webdriver.chrome.options import Options
import time
def extracting_html_files_v11(ids):
options = Options()
options.add_argument("start-maximized")
for x in ids:
start_time = time.time()
browser = Chrome(option = options)
print('initialization of the browser')
url = 'https://somewebsite.eu/id/{}/'.format(x)
print(url)
browser.get(url)
print('the page was downloaded')
time_to_wait = np.random.uniform(low = 7, high = 10)
time.sleep(time_to_wait)
file_name = 'data_8000_9000/case_{}.html'.format(x)
with open(file_name, 'w', encoding="utf-8") as f:
f.write(browser.page_source)
print('the file was saved')
browser.quit()
print('the browser was quited')
print("--- %s seconds ---" % (time.time() - start_time))
for i in range(3):
print('_____')
However, this process takes too long. After each launch of the browser I need to wait roughly 5 seconds for Cloudflare to let me download the page (that's why I have time.sleep(time_to_wait)
). Can the code be optimized? And should I think about parallel programming or something like that? (I am completely a beginner in parallel processes).