3

I am trying to run my scrapy from script. I am using CrawlerProcess and I only have one spider to run.

I've been stuck from this error for a while now, and I've tried a lot of things to change the settings, but every time I run the spider, I get

twisted.internet.error.ReactorNotRestartable

I've been searching to solve this error, and I believe you should only get this error when you try to call process.start() more than once. But I didn't.

Here's my code:

import scrapy
from scrapy.utils.log import configure_logging

from scrapyprefect.items import ScrapyprefectItem
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    start_urls = ['http://www.nigeria-law.org/A.A.%20Macaulay%20v.%20NAL%20Merchant%20Bank%20Ltd..htm']

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def parse(self, response):
        item = ScrapyprefectItem()
        ...

        yield item


process = CrawlerProcess(settings=get_project_settings())
process.crawl('spider')
process.start()

Error:

Traceback (most recent call last):
  File "/Users/pluggle/Documents/Upwork/scrapyprefect/scrapyprefect/spiders/spider.py", line 59, in <module>
    process.start()
  File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/scrapy/crawler.py", line 309, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/twisted/internet/base.py", line 1282, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/twisted/internet/base.py", line 1262, in startRunning
    ReactorBase.startRunning(self)
  File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/twisted/internet/base.py", line 765, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

I notice that this only happens when I'm trying to save my items to mongodb. pipeline.py:

import logging
import pymongo


class ScrapyprefectPipeline(object):
    collection_name = 'SupremeCourt'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        # pull in information from settings.py
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        # initializing spider
        # opening db connection
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        # clean up when spider is closed
        self.client.close()

    def process_item(self, item, spider):
        # how to handle each post
        self.db[self.collection_name].insert(dict(item))
        logging.debug("Post added to MongoDB")
        return item

If I change the pipeline.py to the default, which is...

import logging
import pymongo

class ScrapyprefectPipeline(object):
    def process_item(self, item, spider):
        return item

...the script runs fine. I'm thinking this has something to do with how I setup the pycharm settings to run the code. So for referece, I'm also including my pycharm settings enter image description here

I hope someone can help me. Let me know if need more details

3 Answers3

4

Reynaldo,

thanks a lot - you saved my project!

And you pushed me to the idea, that possibly this occurs because you have this piece if script starting the process in the same file with your spider definition. As a result, it is executed each time scrapy imports your spider definition. I am not a big expert in scrapy, but possibly it does it few times internally and thus we run into this error problem.

Your suggestion obviusly solves the problem!

Another approach could be is to separate the spider class definition and the script running it. Possibly, this is the approach scrapy assumes and that is why in it's Running spider from script documentation it does not even mention this __name__ check.

So what I did is following:

  • in the project root I have sites folder and in it I have site_spec.py file. This is just a utility file with some target site specific information (URL structure, etc.). I mention it here just to show you how you can import your various utility modules into your spider class definition;

  • in the project root I have spiders folder and my_spider.py class definition in it. And in that file I import site_spec.py file with directive:

from sites import site_spec

It is important to mention, that the script, running the spider (the one that you presented) IS REMOVED from the class definition my_spider.py file. Also, note, that I import my site_spec.py file with the path related to run.py file (see below), but not related to the class definition file, where this directive is issued as one could expect (python relative import, I guess)

  • finally, in the project root I have run.py file, runnig the scrapy from script:
from scrapy.crawler import CrawlerProcess
from spiders.my_spider  import MySpider # this is our friend in subfolder **spiders**
from scrapy.utils.project import get_project_settings

# Run that thing!

process = CrawlerProcess(get_project_settings())
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

With this setup finally I was able to get rid of this twisted.internet.error.ReactorNotRestartable

Thank you very much !!!

Arregator
  • 119
  • 1
  • 4
3

Okay. I solved it. So I think, in the pipeline, when the scraper enters the open_spider, it runs the spider.py again, and calling the process.start() the second time.

To solve the problem, I add this in the spider so process.start() will only be executed when you run the spider:

if __name__ == '__main__':
    process = CrawlerProcess(settings=get_project_settings())
    process.crawl('spider')
    process.start()
0

Try to change Scrapy and Twisted version. It isnt the solution, but worked.

pip install Twisted==22.1.0 pip install Scrapy==2.5.1