Unable to make my script stop when some urls are scraped

Question

I'v created a script in scrapy to parse the titles of different sites listed in start_urls. The script is doing it's job flawlessly.

What I wish to do now is let my script stop after two of the urls are parsed no matter how many urls are there.

I've tried so far with:

import scrapy
from scrapy.crawler import CrawlerProcess

class TitleSpider(scrapy.Spider):
    name = "title_bot"
    start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"]

    def parse(self, response):
        yield {'title':response.css('title::text').get()}

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0', 
    })
    c.crawl(TitleSpider)
    c.start()

How can I make my script stop when two of the listed urls are scraped?

Which two? the first in the sequence? – DirtyBit Apr 22 '19 at 09:22 — DirtyBit, Apr 22 '19 at 09:22
I'm unfamiliar with scrapy. How *do* you stop a spider? – quamrana Apr 22 '19 at 10:57 — quamrana, Apr 22 '19 at 10:57

score 1 · Answer 1 · answered Apr 25 '19 at 06:03

As Gallaecio proposed, you can add a counter, but the difference here is that you export an item after the if statement. This way, it will almost always end up exporting 2 items.

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import CloseSpider


class TitleSpider(scrapy.Spider):
    name = "title_bot"
    start_urls = ["https://www.google.com/", "https://www.yahoo.com/", "https://www.bing.com/"]
    item_limit = 2

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.counter = 0

    def parse(self, response):
        self.counter += 1
        if self.counter > self.item_limit:
            raise CloseSpider

        yield {'title': response.css('title::text').get()}

Why almost always? you may ask. It has to do with race condition in parse method.

Imagine that self.counter is currently equal to 1, which means that one more item is expected to be exported. But now Scrapy receives two responses at the same moment and invokes the parse method for both of them. If two threads running the parse method will increase the counter simultaneously, they will both have self.counter equal to 3 and thus will both raise the CloseSpider exception.

In this case (which is very unlikely, but still can happen), spider will export only one item.

score 1 · Accepted Answer · edited Apr 27 '19 at 12:59

Currently I see the only one way to immediately stop this script - usage of os._exit force exit function:

import os
import scrapy
from scrapy.crawler import CrawlerProcess

class TitleSpider(scrapy.Spider):
    name = "title_bot"
    start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"]
    item_counter =0

    def parse(self, response):
        yield {'title':response.css('title::text').get()}
        self.item_counter+=1
        print(self.item_counter)
        if self.item_counter >=2:
            self.crawler.stats.close_spider(self,"2 items")
            os._exit(0)

if __name__ == "__main__":
    c = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0' })
    c.crawl(TitleSpider)
    c.start()

Another things that I tried.
But I didn't received required result (immediately stop script afted 2 scraped items with only 3 urls in start_urls):

Transfer CrawlerProcess instance into spider settings and calling CrawlerProcess.stop ,(reactor.stop), etc.. and other methods from parse method.

Usage of CloseSpider extension docs source ) with following CrawlerProcess definition:

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    'EXTENSIONS' : {

'scrapy.extensions.closespider.CloseSpider': 500,
                    },
"CLOSESPIDER_ITEMCOUNT":2 })

Reducing CONCURRENT_REQUESTS setting to 1 (with raise CloseSpider condition in parse method).
When application scraped 2 items and it reaches code line with raise ClosesSpider - 3rd request already started in another thread.
In case of usage conventional way to stop spider, application will be active until it process previously sent requests and process their responses and only after that - it closes.

As your application has relatively low numbers of urls in start_urls, application starts process all urls a long before it reaches raise CloseSpider.

It seems to be the perfect solution I was looking for. I find it difficult to understand the parameter within the following `stats.close_spider(self,"2 items")`. Could you clarify it in comment? Thanks. — MITHU, Apr 27 '19 at 05:23
Usually scrapy prints stats data on ending (log lines with stats data starting with `INFO: Dumping Scrapy stats:`) with this function. This function requires spider and reason as arguments. As usage of `os._exit` stops process immediately application will not print stats data as usually scrapy does. So I manually added this function call before `os._exit` Usage of `self.crawler.stats.close_spider(self,"2 items")` is optional. — Georgiy, Apr 27 '19 at 06:50

Gallaecio · Answer 3 · 2019-04-22T12:06:35.023

Constructing on top of https://stackoverflow.com/a/38331733/939364, you can define a counter in the constructor of your spider, and use parse to increase it and raise CloseSpider when it reaches 2:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import CloseSpider  # 1. Import CloseSpider

class TitleSpider(scrapy.Spider):
    name = "title_bot"
    start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"]

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.counter = 0  # 2. Define a self.counter property

    def parse(self, response):
        yield {'title':response.css('title::text').get()}
        self.counter += 1  # 3. Increase the count on each parsed URL
        if self.counter >= 2:
            raise CloseSpider  # 4. Raise CloseSpider after 2 URLs are parsed

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0', 
    })
    c.crawl(TitleSpider)
    c.start()

I am not 100% certain that it will prevent a third URL to be parsed, because I think CloseSpider stops new requests from start but waits for started requests to finish.

If you want to prevent more than 2 items from being scraped, you can edit parse not to yield items when self.counter > 2.

I tried your script but that really didn't help. I put 6 urls within `start_urls` and they all are being parsed accordingly. There is a typo in your import. Care to fix that replacing with `from scrapy.exceptions import CloseSpider`. Thanks for your input @Gallaecio. — MITHU, Apr 22 '19 at 11:52
@MITHU With 6 URLs I think that is expected, as per my paragraph after the code. After raising CloseSpider, I believe that up to [`CONCURRENT_REQUESTS`](https://docs.scrapy.org/en/latest/topics/settings.html#concurrent-requests)-1 responses may be received before the spider stops. Try setting `CONCURRENT_REQUESTS = 2`, for example. — Gallaecio, Apr 22 '19 at 12:08
I set `CONCURRENT_REQUESTS` to `2` but that still couldn't fix the issue. I'm getting all six titles against six urls @Gallaecio. — MITHU, Apr 22 '19 at 17:03

score -1 · Answer 4 · answered Apr 25 '19 at 06:26

-1

Enumerate do jobs fine. Some changes in architecture and

for cnt, url in enumerate(start_urls):
    if cnt > 1:
        break
    else:
        parse(url)

answered Apr 25 '19 at 06:26

EvGEN Levakov

23
1
8

Unable to make my script stop when some urls are scraped

4 Answers4

Linked