Cause of slow Scrapy scraper

Question

I have created a new Scrapy spider that is extremely slow. It only scrapes around two pages per second, whereas the other Scrapy crawlers that I have created have been crawling a lot faster.

I was wondering what is it that could cause this issue, and how to possibly fix that. The code is not very different from the other spiders and I am not sure if it is related to the issue, but I'll add it if you think it may be involved.

In fact, I have the impression that the requests are not asynchronous. I have never run into this kind of problem, and I am fairly new to Scrapy.

EDIT

Here's the spider :

class DatamineSpider(scrapy.Spider):
    name = "Datamine"
    allowed_domains = ["domain.com"]
    start_urls = (
        'http://www.example.com/en/search/results/smth/smth/r101/m2108m',
    )

    def parse(self, response):
        for href in response.css('.searchListing_details .search_listing_title .searchListing_title a::attr("href")'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_stuff)
        next_page = response.css('.pagination .next a::attr("href")')
        next_url = response.urljoin(next_page.extract()[0])
        yield scrapy.Request(next_url, callback=self.parse)

    def parse_stuff(self, response):
        item = Item()
        item['value'] = float(response.xpath('//*[text()="Price" and not(@class)]/../../div[2]/span/text()').extract()[0].split(' ')[1].replace(',',''))
        item['size'] =  float(response.xpath('//*[text()="Area" and not(@class)]/../../div[2]/text()').extract()[0].split(' ')[0].replace(',', '.'))
        try:
            item['yep'] = float(response.xpath('//*[text()="yep" and not(@class)]/../../div[2]/text()').extract()[0])
        except IndexError:
            print "NO YEP"
        else:
            yield item

There are a lot of things that might cause this. Could you provide spider source and crawl logs? You can do `scrapy crawl spider 2>&1 spider.log` and then post that log here if you are running unix system. — Granitosaurus, Jul 23 '16 at 10:28
I added the spider, and I'll add the log as soon as possible, (I have it running currently). When I look at the logs, the speed is between 45 and 80 pages/min. ;( — AimiHat, Jul 23 '16 at 10:37
You can try to discover were the code is slow profiling the code, with something like https://github.com/rkern/line_profiler — Ceppo93, Jul 23 '16 at 10:59
@Granitosaurus you command does not work :( It say `running 'scrapy crawl' with more than one spider is no longer supported ` — AimiHat, Jul 23 '16 at 11:37
Two pages per second is pretty fast, many websites don't allow this request rate. Are you scraping a single website? — Mikhail Korobov, Jul 24 '16 at 01:26

neverlastn · Accepted Answer · 2016-07-24T16:00:37.790

There are only two potential reasons, given that your spiders indicate that you're quite careful/experienced.

Your target site's response time is very low
Every page has only 1-2 listing pages (the ones that you parse with parse_stuff()).

Highly likely the latter is the reason. It's reasonable to have a response time of half a second. This means that by following the pagination (next) link, you will be effectively be crawling 2 index pages per second. Since you're browsing - I guess - as single domain, your maximum concurrency will be ~ min(CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN). This typically is 8 for the default settings. But you won't be able to utilise this concurrency because you don't create listing URLs fast enough. If your .searchListing_details .search_listing_title .searchListing_title a::attr("href") expression creates just a single URL, the rate with which you create listing URLs is just 2/second whereas to fully utilise your downloader with a concurrency level of 8 you should be creating at least 7 URLs/index page.

The only good solution is to "shard" the index and start crawling e.g. multiple categories in parallel by setting many non-overlaping start_urls. E.g. you might want to crawl TVs, Washing machines, Stereos or whatever other categories in parallel. If you have 4 such categories and Scrapy "clicks" their 'next' button for each one of them 2 times a second, you will be creating 8 listing pages/second and roughly speaking, you would utilise much better your downloader.

P.S. next_page.extract()[0] == next_page.extract_first()

Update after discussing this offline: Yes... I don't see anything extra-weird on this website apart from that it's slow (either due to throttling or due to their server capacity). Some specific tricks to go faster. Hit the indices 4x as fast by settings 4 start_urls instead of 1.

start_urls = (
    'http://www.domain.com/en/search/results/smth/sale/r176/m3685m',
    'http://www.domain.com/en/search/results/smth/smth/r176/m3685m/offset_200',
    'http://www.domain.com/en/search/results/smth/smth/r176/m3685m/offset_400',
    'http://www.domain.com/en/search/results/smth/smth/r176/m3685m/offset_600'
)

Then use higher concurrency to allow for more URLs to be loaded in parallel. Essentially "deactivate" CONCURRENT_REQUESTS_PER_DOMAIN by setting it to a large value e.g. 1000 and then tune your concurrency by setting CONCURRENT_REQUESTS to 30. By default your concurrency is limited by CONCURRENT_REQUESTS_PER_DOMAIN to 8 which in, for example, your case where the response time for listing pages is >1.2 sec, means a max of 6 listing pages per second crawling speed. So call your spider like this:

scrapy crawl MySpider -s CONCURRENT_REQUESTS_PER_DOMAIN=1000 -s CONCURRENT_REQUESTS=30

and it should do better.

One more thing. I observe from your target site, that you can get all the information you need including Price, Area and yep from the index pages themselves without having to "hit" any listing pages. This would instantly 10x your crawling speed since you don't need to download all these listing pages in with the for href... loop. Just parse the listings from the index page.

Thank you for you long and detailed answer. The problem however is that each page has around 10 listing pages, and the spider still manages to achieve 1page per min rates.. Is it possible that the site itself limits my requests? I can't find a reasonable explanation to this — AimiHat, Jul 24 '16 at 08:06
`CONCURRENT_REQUESTS_PER_DOMAIN` was surprise for me. Considering that it is not included id default scrapy config created with `scrapy startproject` — PATAPOsha, Mar 16 '21 at 20:14

Cause of slow Scrapy scraper

1 Answers1

Linked