1

Problem:

My problem is that I wrote few selenium Scrapy web spiders just for school task purposes and I wanted to crawl politely (DOWNLOAD_DELAY = 5 *per page), but I even don't have to, because it take too much time to crawl one page. For finding all elements in one page I wait even 30 seconds, where in every page I am looking for 13 elements that could or not be present on the page.

The problem I found is between Pycharm IDE terminal from where I run python script and web bot spider selenium browser during selecting data elements by xpath.

Behavior:

What my spider do at all:

  1. load 500 URLs from .txt file to dictionary
  2. proceed URLs one by one
  3. on every URL check 13 elements
  4. if elements exist - gather data, if not, set as default
  5. at the end write gathered data to short .csv

The terminal will make demand with POST method on selenium chrome browser to find specific one element by xpath, and if this element is not present on web page, selenium chrome browser will respond always with delay of 5 seconds per one xpath element search.

The page in selenium browser will load quickly - in one second.

Documentation:

If element is not found, an exception occurs, which I treat like this in code of spider (IDE waits 5 seconds on selenium chrome to throw exception):

# 1. name
try:
    # If not found element, it will fill with sth
    element = self.driver.find_element_by_xpath('// *[ @ id = "h1c"] / h1')
    # get data - but if element is not found, throw exception - because get attr.
    name = str(element.get_attribute('innerHTML'))
except:
    name = "empty"

Loading URLs to crawl [*Updated]:

def start_requests(self):

    temp_dictionary = []
    # OPEN FILE AND LOAD URLS HERE
    with open("products_urls_en.txt") as file:
        for line in file:
            temp_dictionary.append({'url': line})

    # REMOVE DUPLICATES - if any  https://stackoverflow.com/questions/8749158/removing-duplicates-from-dictionary

    products_url_links = []

    for value in temp_dictionary:
        if value not in products_url_links:
            products_url_links.append({'url': value.get('url')})

    print("NUM OF LINKS: " + str(len(products_url_links)))
    self.counter_all = int(len(products_url_links))

    for url in products_url_links:
        yield scrapy.Request(url=url.get('url'), callback=self.parse)

During crawling I have upper described terminal output as follows:

enter image description here

I would even like to give here some links of similar problems, but I did not find any. People were talking about problems mainly on server side [1] [2], but I think the problem is on my side.

Settings and versions

  • Python - 3.6, pip 18.0
  • Pycharm - 2018.1.5
  • Selenium - 3.14.0 (I think - latest - was downloaded through Pycharm IDE)
  • Scrapy - 1.5.1 (I think - latest - was downloaded through Pycharm IDE)
  • Windows - Win10 Pro 2018
  • SpiderSettings - all default (I tried polite settings - not changed problem)

Q:

Could you someone explain me, why it takes so much time, and how to repair it - reduce that explicit time delay please?

Community
  • 1
  • 1
Marek Bernád
  • 481
  • 2
  • 8
  • 28
  • Could you post the actual code? Seems like every xpath you are downloading new page and `DOWNLOAD_DELAY` is kicking in. – Granitosaurus Oct 01 '18 at 01:05
  • Can you give us an urll, which you want to crawl and the expected item. – Yash Pokar Oct 01 '18 at 02:48
  • You should read this https://medium.com/@yashpokar/scrape-any-website-in-the-internet-without-using-splash-or-selenium-68a6c9733369, It might be helpful to you – Yash Pokar Oct 01 '18 at 02:49
  • @Granitosaurus I cant post here whole code due to my copyrights, and other school mates could copy it, but all special what I have I posted here ... just in code I made class of spider, set name, after that I loaded it with urls that I want to traverse - I can update it ... and after that I look for elements as I wrote up here... My code resemble to: https://doc.scrapy.org/en/latest/intro/tutorial.html – Marek Bernád Oct 01 '18 at 06:51
  • @YashPokar Example page I was trying it on is: https://www.alza.sk/sony-mdr-ex110lpb-d1481391.htm?catid=18843602 For example I want this element here: //*[@id="detailText"]/div[2] I read that article in medium ... it means due to JS ajax on page it is painful to use selenium xpath with scrapy...? But I have hundreds of URL pages, maybe for another tasks - thousands ... for every product one... for example now I have 36695 not identical URLs in .txt to crawl, one url - one product ... it means it is impossible to do it with scrapy on selenium? .. – Marek Bernád Oct 01 '18 at 06:57
  • Do you use `implicit_wait`? Also, I suggest that you write the simplest program to reproduce the problem and share it's code. This program should only open the browser, navigate directly to the URL, and search for the element with the mentioned Xpath. – Arnon Axelrod Oct 01 '18 at 07:08
  • @ArnonAxelrod I posted almost whole my code, and as you see, I have no explicit waits there, because if you read my post at the begin I wrote it .... that I even do not need it ... ou, yes.. that's all what my code do ... open browser, navigate to the URL for all URLs, search element and save ... as you can see in code – Marek Bernád Oct 01 '18 at 07:13
  • 1
    @Marek, You posted a whole lot of code, but I don't see where you initialize `driver`. Also, your code does many things that seem irrelevant to the problem, so I suggest to you to create this small program so it will be easier for you to investigate the problem. – Arnon Axelrod Oct 01 '18 at 07:20
  • @ArnonAxelrod .... I am tired ... you ... are absolutely right.. I am very sorry... that was the problem... after driver init I have explicit wait for 5 seconds set... sorry again and thanks ...I dont understand how I overlooked it.. would you post it as answer or I should delete this question? – Marek Bernád Oct 01 '18 at 07:25
  • Everything fine. I posted it as answer. – Arnon Axelrod Oct 01 '18 at 07:59

1 Answers1

1

You're probably using implicit_wait of 5 seconds. Because of that, when find_element doesn't find anything it waits for 5 seconds to give it a chance to appear...

Arnon Axelrod
  • 1,444
  • 2
  • 13
  • 21