0

I've defined a Crawler class for crawling multiple spiders from script.
For spiders, instead of using pipelines, I defined a class, CrawlerPipeline and used signals for connecting methods.
In CrawlerPipeline, some methods require to use class variables such as __ERRORS.
I'm unable to implement the correct way for the same. Any suggestions or ideas will be very helpful.
For reference, I'm attaching the code snippet

from scrapy import signals
from scrapy.crawler import CrawlerProcess
from .pipeline import CrawlerPipeline


class Crawler:

    def __init__(self) -> None:
        self.process = CrawlerProcess(settings={
            'ROBOTSTXT_OBEY': False,
            'REDIRECT_ENABLED': True,
            'SPIDER_MODULES': ['engine.crawler.spiders'],
            'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
            'USER_AGENT': 'Mozilla/5.0 (Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
        })

    def spawn(self, spider: str, **kwargs) -> None:
        self.process.crawl(spider, **kwargs)
        self.__connect_signals(spider)

    def run(self) -> None:
        self.process.start()

    def __connect_signals(self, spider: str) -> None:
        pipe = CrawlerPipeline()

        for crawler in self.process.crawlers:
            _set_signal = crawler.signals.connect

            if spider == 'a':
                _set_signal(pipe.add_meta_urls, signal=signals.spider_opened)

            if spider == 'b':
                ...

            if spider == 'c':
                _set_signal(pipe.add_meta_urls, signal=signals.spider_opened)

            if spider == 'd':
                ...
            
            # These lines are not working, above two also not working
            _set_signal(pipe.process_item, signal=signals.item_scraped)
            _set_signal(pipe.spider_closed, signal=signals.spider_closed)
            _set_signal(pipe.spider_error, signal=signals.spider_error)
import json
from pathlib import Path
from collections import defaultdict

from api.database import Mongo


class CrawlerPipeline:

    __ITEMS = defaultdict(list)
    __ERRORS = list

    def process_item(self, item, spider):
        self.__ITEMS[spider.name].append(item)
        return item

    def add_meta_urls(self, spider):
        spider.start_urls = ['https://www.example.com']

    def spider_error(self, failure, response, spider):
        self.__ERRORS.append({
            'spider': spider.name,
            'url': response.url,
            'status': response.status,
            'error': failure.getErrorMessage(),
            'traceback': failure.getTraceback(),
        })

    def spider_closed(self, spider, reason):
        print(self.__ERRORS)
        Path("logs").mkdir(parents=True, exist_ok=True)
        ...
rish_hyun
  • 451
  • 1
  • 7
  • 13
  • I am not really sure what your trying to achieve here, but your assignment for `CrawlerPipeline.__ERRORS` should probably be `list()` or `[]`. – Alexander Feb 15 '23 at 21:36
  • Thanks for pointing this! I know there are many things that needs to be fix, before that I need to connect the methods which I'm unable to. Currently, not a single method of `CrawlerPipeline` is listened or received by signals irrespective of their definition. – rish_hyun Feb 15 '23 at 22:01
  • @Alexander I've created multiple spiders and want to use specific functions for each spider. So, instead of creating multiple pipelines, I defined single class for connecting specific methods according to each spider – rish_hyun Feb 15 '23 at 22:05
  • 1
    I still don't fully understand, but what I can see is that your pipeline is likely being garbage collected as soon as it leaves the `__connect_signals` method, since you are not keeping a reference to it anywhere else in your `Crawler` class. – Alexander Feb 15 '23 at 22:44
  • 1
    @Alexander Thank you for helping me once again! You're right. After keeping a reference of `CrawlerPipeline` instance inside `Crawler` class, it is listening to signals and working as I expected! – rish_hyun Feb 16 '23 at 06:18
  • 1
    @Alexander Finally, I came at the conclusion of defining `from_cralwer` method in Pipeline instead of creating mess like this, therefore, I do not need to define methods in `Crawler` class for connecting signals – rish_hyun Feb 16 '23 at 06:21

0 Answers0