Problem:
My problem is that I wrote few selenium Scrapy web spiders just for school task purposes and I wanted to crawl politely (DOWNLOAD_DELAY = 5 *per page), but I even don't have to, because it take too much time to crawl one page. For finding all elements in one page I wait even 30 seconds, where in every page I am looking for 13 elements that could or not be present on the page.
The problem I found is between Pycharm IDE terminal from where I run python script and web bot spider selenium browser during selecting data elements by xpath.
Behavior:
What my spider do at all:
- load 500 URLs from .txt file to dictionary
- proceed URLs one by one
- on every URL check 13 elements
- if elements exist - gather data, if not, set as default
- at the end write gathered data to short .csv
The terminal will make demand with POST method on selenium chrome browser to find specific one element by xpath, and if this element is not present on web page, selenium chrome browser will respond always with delay of 5 seconds per one xpath element search.
The page in selenium browser will load quickly - in one second.
Documentation:
If element is not found, an exception occurs, which I treat like this in code of spider (IDE waits 5 seconds on selenium chrome to throw exception):
# 1. name
try:
# If not found element, it will fill with sth
element = self.driver.find_element_by_xpath('// *[ @ id = "h1c"] / h1')
# get data - but if element is not found, throw exception - because get attr.
name = str(element.get_attribute('innerHTML'))
except:
name = "empty"
Loading URLs to crawl [*Updated]:
def start_requests(self):
temp_dictionary = []
# OPEN FILE AND LOAD URLS HERE
with open("products_urls_en.txt") as file:
for line in file:
temp_dictionary.append({'url': line})
# REMOVE DUPLICATES - if any https://stackoverflow.com/questions/8749158/removing-duplicates-from-dictionary
products_url_links = []
for value in temp_dictionary:
if value not in products_url_links:
products_url_links.append({'url': value.get('url')})
print("NUM OF LINKS: " + str(len(products_url_links)))
self.counter_all = int(len(products_url_links))
for url in products_url_links:
yield scrapy.Request(url=url.get('url'), callback=self.parse)
During crawling I have upper described terminal output as follows:
I would even like to give here some links of similar problems, but I did not find any. People were talking about problems mainly on server side [1] [2], but I think the problem is on my side.
Settings and versions
- Python - 3.6, pip 18.0
- Pycharm - 2018.1.5
- Selenium - 3.14.0 (I think - latest - was downloaded through Pycharm IDE)
- Scrapy - 1.5.1 (I think - latest - was downloaded through Pycharm IDE)
- Windows - Win10 Pro 2018
- SpiderSettings - all default (I tried polite settings - not changed problem)
Q:
Could you someone explain me, why it takes so much time, and how to repair it - reduce that explicit time delay please?