0

I am able to call a scrapy spider from another Python script using either CrawlerRunner or CrawlerProcess. But, when I try to call the same spider calling class from a pywikibot robot, I get a ReactorNotRestartable error. Why is this and how can I fix it?

Here is the error:

  File ".\scripts\userscripts\ReplicationWiki\RWLoad.py", line 161, in format_new_page
    aea = AEAMetadata(url=DOI_url)
  File ".\scripts\userscripts\ReplicationWiki\GetAEAMetadata.py", line 39, in __init__
    reactor.run() # the script will block here until all crawling jobs are finished
  File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 1282, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 1262, in startRunning
    ReactorBase.startRunning(self)
  File "C:\Users\lextr\.conda\envs\py37\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
CRITICAL: Exiting due to uncaught exception <class 'twisted.internet.error.ReactorNotRestartable'>

Here is the script which calls my scrapy spider. It runs fine if I just call the class from main.

from twisted.internet import reactor, defer
from scrapy import signals
from scrapy.crawler import Crawler, CrawlerProcess, CrawlerRunner
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings

from Scrapers.spiders.ScrapeAEA import ScrapeaeaSpider

class AEAMetadata:
    """
    Helper to run ScrapeAEA spider and return JEL codes and data links
    for a given AEA article link.
    """

    def __init__(self, *args, **kwargs):
        """Initializer"""

        url = kwargs.get('url')
        if not url:
            raise ValueError('No article url given')

        self.items = []
        def collect_items(item, response, spider):
            self.items.append(item)

        settings = get_project_settings()
        crawler = Crawler(ScrapeaeaSpider, settings)
        crawler.signals.connect(collect_items, signals.item_scraped)

        runner = CrawlerRunner(settings)
        d = runner.crawl(crawler, url=url)
        d.addBoth(lambda _: reactor.stop())
        reactor.run() # the script will block here until all crawling jobs are finished

         #process = CrawlerProcess(settings)
         #process.crawl(crawler, url=url)
         #process.start()  # the script will block here until the crawling is finished

    def get_jelcodes(self):
        jelcodes = self.items[0]['jelcodes']
        return jelcodes

def main():
    aea = AEAMetadata(url='https://doi.org/10.1257/app.20180286')
    jelcodes = aea.get_jelcodes()
    print(jelcodes)

if __name__ == '__main__':
    main()

Updated simple Test that instantiates the AEAMetadata class twice. Here is the calling code in my pywikibot bot which fails:

from GetAEAMetadata import AEAMetadata

def main(*args):
    for _ in [1,2]:
        print('Top')
        url = 'https://doi.org/10.1257/app.20170442'
        aea = AEAMetadata(url=url)
        print('After AEAMetadata')
        jelcodes = aea.get_jelcodes()
        print(jelcodes)


if __name__ == '__main__':
    main()
Scott
  • 1
  • 3
  • Is `reactor.run()` being called more than once? Could you provide some minimal self-contained code that allows reproducing the issue? (https://stackoverflow.com/help/mcve) – Gallaecio Nov 06 '19 at 13:02
  • Yes, "is reactor.run() called more than once?". It must be, but where? Since my main script depends on the pywikibot framework, it's tough to extract a self contained piece that still has the failure. I'm new to Python and find it hard to navigate the libraries so I may have missed a use of reactor by pywikibot or the habanero libraries which my script uses. The other thing is that AEAMetadata is called by an __iter__ which shouldn't be a problem because the reactor is created and deleted along with the class on each iteration. Also, the failure occurs on the first iteration. – Scott Nov 07 '19 at 18:34
  • Also, I will try to build a self contained test which still fails. I think Gallaecio is right that I haven't provided enough context. – Scott Nov 07 '19 at 18:38

1 Answers1

0

My call to AEAMetadata was embedded in a larger script which fooled me into thinking the AEAMetadata class was only instantiated once before failure. In fact, AEAMetadata was called twice.

And, I also thought that the script would block after the reactor.run() because the comment in all the scrapy examples stated that was the case. However, the second deferred callback is reactor.stop() which unblocks the reactor.run().

A more basic incorrect assumption was that the reactor was deleted and recreated on each iteration. In fact, the reactor is instantiated and initialized when it is first imported. And, it is a global object which lives as long as the underlying process and was not designed to be restarted. The extremes actually needed to delete and restart a reactor are described here: http://www.blog.pythonlibrary.org/2016/09/14/restarting-a-twisted-reactor/

So, I guess I've answered my own question. And, I'm rewriting my script so it doesn't try to use the reactor in a way it was never intended to be used.

And, thanks Gallaecio for getting me thinking in the right direction.

Scott
  • 1
  • 3