Scrapy process less than succesfully crawled

Question

I have 2 problems with my scraper:

It get's a lot of 302s after a while, despite the fact I use 'COOKIES_ENABLED': False, and rotating proxy which should provide different IP for each request. I solved it by restarting scraper after several 302s
I see that scraper successfully crawls much more than it process, and I can't do anything with it. In the example below I've got 121 200s responses, but only 27 was processed.

Spider

class MySpider(Spider):
    name = 'MySpider'
    custom_settings = {
        'DOWNLOAD_DELAY': 0,
        'RETRY_TIMES': 1,
        'LOG_LEVEL': 'DEBUG',
        'CLOSESPIDER_ERRORCOUNT': 3,
        'COOKIES_ENABLED': False,
    }
    # I need to manually control when spider to stop, otherwise it runs forever
    handle_httpstatus_list = [301, 302]

    def start_requests(self):
        for row in self.df.itertuples():
            yield Request(
                url=row.link,
                callback=self.parse,
                priority=100
            )

    def close(self, reason):
        self.logger.info('TOTAL ADDED: %s' % self.added)

    def parse(self, r):
        if r.status == 302:
            # I need to manually control when spider to stop, otherwise it runs forever
            raise CloseSpider("")
        else:
            # do parsing stuff
                self.added += 1
                self.logger.info('{} left'.format(len(self.df[self.df['status'] == 0])))

Output

2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url1> (referer: None)
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url2> (referer: None)
2018-08-08 12:24:24 [MySpider] INFO: 52451 left
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url3> (referer: None)
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url4> (referer: None)
2018-08-08 12:24:24 [MySpider] INFO: 52450 left
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url4> (referer: None)


2018-08-08 12:24:37 [MySpider] INFO: TOTAL ADDED: 27
2018-08-08 12:24:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
...
...
 'downloader/response_status_count/200': 121,
 'downloader/response_status_count/302': 4,

It succesfully crawls much (3x or 4x times more than crawls). How can I force to process everything that was crawled?

I can sacrifice the speed, but I don't want to waste what was successfully crawled 200s

If you open the page the 302 is pointing to, what do you see? — Aankhen, Aug 08 '18 at 10:50
Okay, so that’s not really something in your control, I’d think. Re: the missing 200s, are you sure you’re not returning early in your parsing, before you increment `self.added`? — Aankhen, Aug 08 '18 at 11:23
1. I'm 100% sure it's not parsed. 2. It's really to strange to scrape, but exit without parsing. I believe there must be the way to control it — Bendeberia, Aug 08 '18 at 11:29
Sorry, I can’t see anything in the code that would cause this! — Aankhen, Aug 09 '18 at 20:45
* If the skipped 200s are duplicates, use `dont_filter=True`. See https://stackoverflow.com/questions/23131283/how-to-force-scrapy-to-crawl-duplicate-url * Not sure about redirects, could it be a rate-limiting thing preventing overload on the server? In that case, add some gap. — Ghasem Naddaf, Aug 16 '18 at 04:16
try to yield request with another callback (don't override ```parse``` method). As long as i remember, ```parse``` method are used by Spider class and can lead to conflicts in some configuration cases, and also add ```dont_filter=True``` option (as @GhasemNaddaf wrote) — dorintufar, Aug 16 '18 at 11:58

score 1 · Answer 1 · answered Oct 24 '18 at 15:37

1

The scheduler may not have delivered all the 200 responses to the parse() method when you CloseSpider().

Log and ignore the 302s, and let the spider finish.

answered Oct 24 '18 at 15:37

Apalala

9,017
3
30
48

Scrapy process less than succesfully crawled

1 Answers1