0

I am trying to scrape a website using the crawl spider. When I am running crawl on command line I am getting type error - start_requests() accepts 1 positional argument, 3 were given. I checked the the middleware settings where def process_start_requests(self, start_requests, spider) has 3 arguments. I had referred to this problem- scrapy project middleware -TypeError: process_start_requests() takes 2 positional arguments but 3 were given but am not being able to solve the issue.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy import Request


class FpSpider(CrawlSpider):
    name = 'fp'
    allowed_domains = 'foodpanda.com.bd'

    rules = (Rule(LinkExtractor(allow=('product', 'pandamart')),
             callback='parse_items', follow=True, process_request='start_requests'),)

    def start_requests(self):
        yield Request(url='https://www.foodpanda.com.bd/darkstore/vbpl/pandamart-gulshan-2', meta=dict(playwright=True),
                      headers={
            'sec-ch-ua': '"Google Chrome";v="105", "Not)A;Brand";v="8", "Chromium";v="105"',
            'Accept': 'application/json, text/plain, */*',
            'Referer': 'https://www.foodpanda.com.bd/',
            'sec-ch-ua-mobile': '?0',
            'X-FP-API-KEY': 'volo',
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
            'sec-ch-ua-platform': '"macOS"'
        }
        )

    def parse_items(self, response):
        item = {}
        item['name'] = response.css('h1.name::text').get()
        item['price'] = response.css('div.price::text').get()
        item['original_price'] = response.css('div.original-price::text').get()
        yield item

The error looks like this: Scrapy type error

1 Answers1

1

The problem is this statement: process_request='start_requests'.

start_request is reserved and used for the first request. If you want to enable Playwright for the subsequent requests, which I assume you are trying to do using process_requests, you would need to use a different name for that function.

See the following code:

def enable_playwright(request, response):
    request.meta["playwright"] = True
    return request

class FpSpider(CrawlSpider):
    name = "fp"
    allowed_domains = ["foodpanda.com.bd"]

    rules = (Rule(LinkExtractor(allow=('product', 'pandamart')),
            callback='parse_items',
            follow=True, 
            process_request=enable_playwright # Note a different function name
            # process_request='start_requests' #THIS was the problem
            ),)
    # Rest of the code here

Also note that allowed_domains is a list, not a string.

Upendra
  • 716
  • 9
  • 17