0

I want to create JOBDIR setting from Spider __init__ or dynamically when I call that spider . I want to create different JOBDIR for different spiders , like FEED_URI in the below example

    class QtsSpider(scrapy.Spider):
    name = 'qts'
    custom_settings = {
        'FEED_URI': 'data_files/' + '%(site_name)s.csv',
        'FEED_FORMAT': "csv",
        #'JOBDIR': 'resume/' + '%(site_name2)s'
    }
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com']


    def __init__(self, **kw):
        super(QtsSpider, self).__init__(**kw)
        self.site_name = kw.get('site_name')

    def parse(self, response):
        #our rest part of code 

and we are calling that script from this way

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


def main_function():
    all_spiders = ['spider1','spider2','spider3'] # 3 different spiders
    process = CrawlerProcess(get_project_settings())
    for spider_name in all_spiders:
        process.crawl('qts', site_name = spider_name )

    process.start()

main_function()

How to achieve that dynamic creation of JOBDIR for different Spider like FEED_URI ?? Help will be appreciated.

Laxman
  • 17
  • 1
  • 5

2 Answers2

1

I found myself needing the same sort of functionality, mostly due to not wanting to repetitively add a custom JOBDIR to each spider's custom_settings property. So, I created a simple extension that subclasses the original SpiderState extension that Scrapy utilizes to save the state of crawls.

from scrapy import signals
from scrapy.exceptions import NotConfigured
from scrapy.extensions.spiderstate import SpiderState
import os


class SpiderStateManager(SpiderState):
    """
    SpiderState Purpose: Store and load spider state during a scraping job
    Added Purpose: Create a unique subdirectory within JOBDIR for each spider based on spider.name property
    Reasoning: Reduces repetitive code
    Usage: Instead of needing to add subdirectory paths in each spider.custom_settings dict
        Simply specify the base JOBDIR in settings.py and the subdirectories are automatically managed
    """

    def __init__(self, jobdir=None):
        self.jobdir = jobdir
        super(SpiderStateManager, self).__init__(jobdir=self.jobdir)

    @classmethod
    def from_crawler(cls, crawler):
        base_jobdir = crawler.settings['JOBDIR']
        if not base_jobdir:
            raise NotConfigured
        spider_jobdir = os.path.join(base_jobdir, crawler.spidercls.name)
        if not os.path.exists(spider_jobdir):
            os.makedirs(spider_jobdir)

        obj = cls(spider_jobdir)
        crawler.signals.connect(obj.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(obj.spider_opened, signal=signals.spider_opened)
        return obj

To enable it, remember to add the proper settings to your settings.py like so

EXTENSIONS = {
    # We want to disable the original SpiderState extension and use our own
    "scrapy.extensions.spiderstate.SpiderState": None,
    "spins.extensions.SpiderStateManager": 0
}
JOBDIR = "C:/Users/CaffeinatedMike/PycharmProjects/ScrapyDapyDoo/jobs"
Dharman
  • 30,962
  • 25
  • 85
  • 135
CaffeinatedMike
  • 1,537
  • 2
  • 25
  • 57
0

Exactly how you have set the site_name, you can pass another argument,

process.crawl('qts', site_name=spider_name, jobdir='dirname that you want to keep')

will be available as a spiders attribute so you can write

def __init__(self):
    jobdir = getattr(self, 'jobdir', None)    

    if jobdir:
        self.custom_settings['JOBDIR'] = jobdir
Yash Pokar
  • 4,939
  • 1
  • 12
  • 25