2

I use scrapy and scrapyd and send some custom settings via api (with Postman software).
Photo of the request: enter image description here

For example, I send the value of start_urls through api and it works correctly.
Now the problem is that I cannot apply the settings that I send through the api in my crawl.
For example, I send the CONCURRENT_REQUESTS value, but it is not applied. If we can bring self in the update_settings function, the problem will be solved, but an error will occur.
My code:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from kavoush.lxmlhtml import LxmlLinkExtractor as LinkExtractor
from kavoush.items import PageLevelItem

my_settings = {}

class PageSpider(CrawlSpider):
 name = 'github'

 def __init__(self, *args, **kwargs):
   self.start_urls = kwargs.get('host_name')
   self.allowed_domains = [self.start_urls]
   my_settings['CONCURRENT_REQUESTS']= int(kwargs.get('num_con_req'))
   self.logger.info(f'CONCURRENT_REQUESTS? {my_settings}')

   self.rules = (
        Rule(LinkExtractor(allow=(self.start_urls),deny=('\.webp'),unique=True),
        callback='parse',
        follow=True),
    )
   super(PageSpider, self).__init__(*args, **kwargs)

   #custom_settings = {
   #  'CONCURRENT_REQUESTS': 4,
   #}

 @classmethod
 def update_settings(cls, settings):
    cls.custom_settings.update(my_settings)
    settings.setdict(cls.custom_settings or {}, priority='spider')

 def parse(self,response):
    loader = ItemLoader(item=PageLevelItem(), response=response)
    loader.add_xpath('page_source_html_lang', "//html/@lang")
    yield loader.load_item()

 def errback_domain(self, failure):
    self.logger.error(repr(failure))

Expectation:
How can I change the settings through api and Postman?
I brought CONCURRENT_REQUESTS settings as an example in the above example, in some cases up to 10 settings may need to be changed through api.

Update:
If we remove my_settings = {} and update_settings and the commands are as follows, an error occurs (KeyError: 'CONCURRENT_REQUESTS') when running scrapyd-deploy because CONCURRENT_REQUESTS does not have a value at that moment.
Part of the above scenario code:

class PageSpider(CrawlSpider):
 name = 'github'

 def __init__(self, *args, **kwargs):
  self.start_urls = kwargs.get('host_name')
  self.allowed_domains = [self.start_urls]
  my_settings['CONCURRENT_REQUESTS']= int(kwargs.get('num_con_req'))
  self.logger.info(f'CONCURRENT_REQUESTS? {my_settings}')

  self.rules = (
        Rule(LinkExtractor(allow=(self.start_urls),deny=('\.webp'),unique=True),
        callback='parse',
        follow=True),
  )
  super(PageSpider, self).__init__(*args, **kwargs)


 custom_settings = {
  'CONCURRENT_REQUESTS': my_settings['CONCURRENT_REQUESTS'],
  }


thanks to everyone

Sardar
  • 524
  • 1
  • 6
  • 19
  • Seems you would use `kwargs.get()` for each setting you want to change. Have you done `print(kwargs.get('num_con_req'))` to debug that outside of `CONCURRENT_REQUESTS`? – OneCricketeer Mar 02 '23 at 18:16
  • @OneCricketeer `kwargs.get()` values cannot be called in `custom_settings` or I don't know. – Sardar Mar 02 '23 at 18:18
  • Sure it can. That's only a dictionary. `self.custom_settings = { 'example': kwargs.get('value') }` – OneCricketeer Mar 02 '23 at 18:22
  • @OneCricketeer Thank you, but it does not know kwargs. Please see the screenshot. https://i.stack.imgur.com/7mT0G.png – Sardar Mar 02 '23 at 18:30
  • Don't unindent that line... Also, `super()` call should be the very first line after `def __init__` – OneCricketeer Mar 02 '23 at 18:45

2 Answers2

1

I can with 100% confidence say that in scrapy user does't have possibility to update spider settings during runtime (from spider.__init as attepmted on code from question)

By the moment when spider.__init__ method called scrapy application already initialised the process using settings received earlier: from base settings, project settings, spider's custom_settings(that hardcoded in spider's source code.

Related issues on scrapy github issue tracker:

According to scrapyd docs to transfer scrapy settings is require to set setting=DOWNLOAD_DELAY=2 in query to scrapyd/schedule. As far as I know this is the only supported way to transfer settings in scrapyd.

Georgiy
  • 3,158
  • 1
  • 6
  • 18
  • For this I'd try to set `setting=DOWNLOADER_MIDDLEWARES={downloader middlewars as json dict string}`. If `DOWNLOADER_MIDDLEWARES` defined as string - scrapy will try to apply `json.loads` to convert it to dict. Hovewer I am not an expert on scrapyd I`d also check scrapyd github tracker https://github.com/scrapy/scrapyd or other scrapy community channels – Georgiy Mar 03 '23 at 20:23
  • You're Great. For example, should it be like this? `setting=DOWNLOADER_MIDDLEWARES={'kavoush.middlewares.IgnoreQueryFragmentRequestMiddleware': 400,}` Because this is not how these settings work. – Sardar Mar 03 '23 at 20:34
  • 1
    No. try to separately call `json.loads` on this first. At least change `'` to double quotes`"` and remove extra `,` after 400. – Georgiy Mar 03 '23 at 20:58
  • Thank. It used to give an error, but now I send it as ‍`setting=DOWNLOADER_MIDDLEWARES={"kavoush.middlewares.IgnoreExternalDomainResponseMiddleWare": 300}` and it doesn't give an error, but it doesn't apply either. You said to call separately, I did not understand this. – Sardar Mar 03 '23 at 21:34
  • Excuse me, do you have a comment about the last comment? – Sardar Mar 06 '23 at 06:29
0

Remove my_settings = {} and update_settings class function.

Did you try this?

class PageSpider(CrawlSpider):
  name = 'github'

  def __init__(self, *args, **kwargs):
    super(PageSpider, self).__init__(*args, **kwargs)
    self.start_urls = kwargs.get('host_name')
    self.allowed_domains = [self.start_urls]

    self.rules = (
        Rule(LinkExtractor(allow=(self.start_urls),deny=('\.webp'),unique=True),
        callback='parse',
        follow=True),
    )

    self.my_settings = {
        'CONCURRENT_REQUESTS': int(kwargs.get('num_con_req'))
    } 
    self.logger.info(f'CONCURRENT_REQUESTS? {self.my_settings["CONCURRENT_REQUESTS"}')

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • ‍`custom_settings` is a dictionary that the scrapy framework recognizes but does not recognize `my_settings`. `my_settings` takes the value, but how to use it in `custom_settings`? – Sardar Mar 02 '23 at 19:02
  • See the photo, ‍`CONCURRENT_REQUESTS` was printed as a log, but there is no `CONCURRENT_REQUESTS` in the Overridden settings section. https://i.stack.imgur.com/LwRK2.png – Sardar Mar 02 '23 at 19:04
  • 2
    I haven't used scrapy, so I don't know how `custom_settings` works – OneCricketeer Mar 03 '23 at 16:49