7

I have 2 problems with my scraper:

  1. It get's a lot of 302s after a while, despite the fact I use 'COOKIES_ENABLED': False, and rotating proxy which should provide different IP for each request. I solved it by restarting scraper after several 302s

  2. I see that scraper successfully crawls much more than it process, and I can't do anything with it. In the example below I've got 121 200s responses, but only 27 was processed.

Spider

class MySpider(Spider):
    name = 'MySpider'
    custom_settings = {
        'DOWNLOAD_DELAY': 0,
        'RETRY_TIMES': 1,
        'LOG_LEVEL': 'DEBUG',
        'CLOSESPIDER_ERRORCOUNT': 3,
        'COOKIES_ENABLED': False,
    }
    # I need to manually control when spider to stop, otherwise it runs forever
    handle_httpstatus_list = [301, 302]

    def start_requests(self):
        for row in self.df.itertuples():
            yield Request(
                url=row.link,
                callback=self.parse,
                priority=100
            )

    def close(self, reason):
        self.logger.info('TOTAL ADDED: %s' % self.added)

    def parse(self, r):
        if r.status == 302:
            # I need to manually control when spider to stop, otherwise it runs forever
            raise CloseSpider("")
        else:
            # do parsing stuff
                self.added += 1
                self.logger.info('{} left'.format(len(self.df[self.df['status'] == 0])))

Output

2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url1> (referer: None)
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url2> (referer: None)
2018-08-08 12:24:24 [MySpider] INFO: 52451 left
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url3> (referer: None)
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url4> (referer: None)
2018-08-08 12:24:24 [MySpider] INFO: 52450 left
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url4> (referer: None)


2018-08-08 12:24:37 [MySpider] INFO: TOTAL ADDED: 27
2018-08-08 12:24:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
...
...
 'downloader/response_status_count/200': 121,
 'downloader/response_status_count/302': 4,

It succesfully crawls much (3x or 4x times more than crawls). How can I force to process everything that was crawled?

I can sacrifice the speed, but I don't want to waste what was successfully crawled 200s

Bendeberia
  • 116
  • 7
  • If you open the page the 302 is pointing to, what do you see? – Aankhen Aug 08 '18 at 10:50
  • It redirects to CAPTCHA page – Bendeberia Aug 08 '18 at 11:16
  • Okay, so that’s not really something in your control, I’d think. Re: the missing 200s, are you sure you’re not returning early in your parsing, before you increment `self.added`? – Aankhen Aug 08 '18 at 11:23
  • 1. I'm 100% sure it's not parsed. 2. It's really to strange to scrape, but exit without parsing. I believe there must be the way to control it – Bendeberia Aug 08 '18 at 11:29
  • Sorry, I can’t see anything in the code that would cause this! – Aankhen Aug 09 '18 at 20:45
  • * If the skipped 200s are duplicates, use `dont_filter=True`. See https://stackoverflow.com/questions/23131283/how-to-force-scrapy-to-crawl-duplicate-url * Not sure about redirects, could it be a rate-limiting thing preventing overload on the server? In that case, add some gap. – Ghasem Naddaf Aug 16 '18 at 04:16
  • try to yield request with another callback (don't override ```parse``` method). As long as i remember, ```parse``` method are used by Spider class and can lead to conflicts in some configuration cases, and also add ```dont_filter=True``` option (as @GhasemNaddaf wrote) – dorintufar Aug 16 '18 at 11:58

1 Answers1

1

The scheduler may not have delivered all the 200 responses to the parse() method when you CloseSpider().

Log and ignore the 302s, and let the spider finish.

Apalala
  • 9,017
  • 3
  • 30
  • 48