Scrapy: non-blocking pause

Question

I have a problem. I need to stop the execution of a function for a while, but not stop the implementation of parsing as a whole. That is, I need a non-blocking pause.

It's looks like:

class ScrapySpider(Spider):
    name = 'live_function'

    def start_requests(self):
        yield Request('some url', callback=self.non_stop_function)

    def non_stop_function(self, response):
        for url in ['url1', 'url2', 'url3', 'more urls']:
            yield Request(url, callback=self.second_parse_function)

        # Here I need some function for sleep only this function like time.sleep(10)

        yield Request('some url', callback=self.non_stop_function)  # Call itself

    def second_parse_function(self, response):
        pass

Function non_stop_function needs to be stopped for a while, but it should not block the rest of the output.

If I insert time.sleep() - it will stop the whole parser, but I don't need it. Is it possible to stop one function using twisted or something else?

Reason: I need to create a non-blocking function that will parse the page of the website every n seconds. There she will get urls and fill for 10 seconds. URLs that have been obtained will continue to work, but the main feature needs to sleep.

UPDATE:

Thanks to TkTech and viach. One answer helped me to understand how to make a pending Request, and the second is how to activate it. Both answers complement each other and I made an excellent non-blocking pause for Scrapy:

def call_after_pause(self, response):
    d = Deferred()
    reactor.callLater(10.0, d.callback, Request(
        'https://example.com/',
        callback=self.non_stop_function,
        dont_filter=True))
    return d

And use this function for my request:

yield Request('https://example.com/', callback=self.call_after_pause, dont_filter=True)

Would this approach help? http://stackoverflow.com/questions/37002742/calling-the-same-spider-programmatically/37007619#37007619 — Rafael Almeida, May 04 '16 at 16:41
@RafaelAlmeida It's not a very convenient way. I want to use this pause in the future without compromising the architecture of the parser. — JRazor, May 04 '16 at 17:36
do you want to pause it to not make a request? or just pause inside the method? it would be very helpful if you could specify a reason of this pause. — eLRuLL, May 05 '16 at 12:58
so if you have a page with say 100 links inside, then you want to send 10 at a time right? what about sending the 100 requests, and after throttling them 10 at a time? — eLRuLL, May 05 '16 at 14:07
@eLRuLL No, you do not understand. I want to find, for example, 100 links, send it to parsing. Pause should not stop this parsing. In this case the main function needs to sleep 10 seconds and repeat it again. — JRazor, May 05 '16 at 14:11
Logically, that method will cause the spider to scrape the URL once before scraping with delay — Aminah Nuraini, Feb 23 '17 at 02:09
Is this REALLY the only way to pause? Actively requesting a random website just for the sake of completing the functions? BTW, I'd suggest you add your full "updated code" as a reply and mark it as answer, since it takes some guessing to get it right just by following your "EDIT" line — IgorMF, Dec 11 '20 at 20:47
@IgorMF I asked this question four years ago. At the time, it was the only way out. I'm not sure if anything has changed since then. No, it's not a random site listed there, it's just a link replaced with an example. And the code from the `update block` worked fine at that time. — JRazor, Dec 16 '20 at 16:05

score 8 · Accepted Answer · answered May 05 '16 at 11:41

Request object has callback parameter, try to use that one for the purpose. I mean, create a Deferred which wraps self.second_parse_function and pause.

Here is my dirty and not tested example, changed lines are marked.

class ScrapySpider(Spider):
    name = 'live_function'

    def start_requests(self):
        yield Request('some url', callback=self.non_stop_function)

    def non_stop_function(self, response):

        parse_and_pause = Deferred()  # changed
        parse_and_pause.addCallback(self.second_parse_function) # changed
        parse_and_pause.addCallback(pause, seconds=10)  # changed

        for url in ['url1', 'url2', 'url3', 'more urls']:
            yield Request(url, callback=parse_and_pause)  # changed

        yield Request('some url', callback=self.non_stop_function)  # Call itself

    def second_parse_function(self, response):
        pass

If the approach works for you then you can create a function which constructs a Deferred object according to the rule. It could be implemented in the way like the following:

def get_perform_and_pause_deferred(seconds, fn, *args, **kwargs):
    d = Deferred()
    d.addCallback(fn, *args, **kwargs)
    d.addCallback(pause, seconds=seconds)
    return d

And here is possible usage:

class ScrapySpider(Spider):
    name = 'live_function'

    def start_requests(self):
        yield Request('some url', callback=self.non_stop_function)

    def non_stop_function(self, response):
        for url in ['url1', 'url2', 'url3', 'more urls']:
            # changed
            yield Request(url, callback=get_perform_and_pause_deferred(10, self.second_parse_function))

        yield Request('some url', callback=self.non_stop_function)  # Call itself

    def second_parse_function(self, response):
        pass

I don't understand what the pause function is and where it comes from. — Honza Javorek, Sep 22 '20 at 10:45
Where does the pause come from in `d.addCallback(pause, seconds=seconds)`? I can see it mentioned in the twisted docs but I can't find where to import it from. — Further Reading, Mar 26 '21 at 12:04
This won't work, scrapy has an internal pause and unpause method. However, with this set-up, it would pause right at the start before the requests are sent. Additionally, the delay on the pause will not work, you will get a permanent pause unless you callback unpause. — Emil11, Aug 01 '22 at 12:33

score 6 · Answer 2 · answered May 02 '16 at 14:44

6

If you're attempting to use this for rate limiting, you probably just want to use DOWNLOAD_DELAY instead.

Scrapy is just a framework on top of Twisted. For the most part, you can treat it the same as any other twisted app. Instead of calling sleep, just return the next request to make and tell twisted to wait a bit. Ex:

from twisted.internet import reactor, defer

def non_stop_function(self, response)
    d = defer.Deferred()
    reactor.callLater(10.0, d.callback, Request(
        'some url',
        callback=self.non_stop_function
    ))
    return d

answered May 02 '16 at 14:44

TkTech

4,729
1
24
32

Python can't use not empty `return` and `yield` for one function. I know about `download_delay`, but it's not my situation. – JRazor May 02 '16 at 14:49
1

@JRazor Simply structure your method better. – TkTech May 02 '16 at 14:51
1

try to use `return` with value and `yield` for one function. – JRazor May 02 '16 at 14:52
How I can restructure calls Requests? – JRazor May 02 '16 at 22:35

score 1 · Answer 3 · answered Feb 23 '17 at 03:45

The asker already provides an answer in the question's update, but I want to give a slightly better version so it's reusable for any request.

# removed...
from twisted.internet import reactor, defer

class MySpider(scrapy.Spider):
    # removed...

    def request_with_pause(self, response):
        d = defer.Deferred()
        reactor.callLater(response.meta['time'], d.callback, scrapy.Request(
            response.url,
            callback=response.meta['callback'],
            dont_filter=True, meta={'dont_proxy':response.meta['dont_proxy']}))
        return d

    def parse(self, response):
        # removed....
        yield scrapy.Request(the_url, meta={
                            'time': 86400, 
                            'callback': self.the_parse, 
                            'dont_proxy': True
                            }, callback=self.request_with_pause)

For explanation, Scrapy use Twisted to manage the request asynchronously, so we need Twisted's tool to do a delayed request too.

I am getting this error message ´ERROR: Spider must return Request, BaseItem, dict or None, got 'Deferred'´ when trying this method. Has there been a change in recent versions possibly? — Richard Löwenström, Oct 11 '17 at 21:43

Scrapy: non-blocking pause

3 Answers3

Linked