Scrapy Playwright: execute CrawlSpider using scrapy playwright

Question

Is it possible to execute CrawlSpider using Playwright integration for Scrapy? I am trying the following script to execute a CrawlSpider but it does not scrape anything. It also does not show any error!

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class GumtreeCrawlSpider(CrawlSpider):
    name = 'gumtree_crawl'
    allowed_domains = ['www.gumtree.com']
    def start_requests(self):
        yield scrapy.Request(
            url='https://www.gumtree.com/property-for-sale/london/page',
            meta={"playwright": True}
        )
        return super().start_requests()

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//div[@class='grid-col-12']/ul[1]/li/article/a"), callback='parse_item', follow=False),
    )

    async def parse_item(self, response):
        yield {
            'Title': response.xpath("//div[@class='css-w50tn5 e1pt9h6u11']/h1/text()").get(),
            'Price': response.xpath("//h3[@itemprop='price']/text()").get(),
            'Add Posted': response.xpath("//dl[@class='css-16xsajr elf7h8q4'][1]/dd/text()").get(),
            'Links': response.url
        }

elacuesta · Accepted Answer · 2022-03-15T18:14:56.630

Requests extracted from the rule do not have the playwright=True meta key, that's a problem if they need to be rendered by the browser to have useful content. You could solve that by using Rule.process_request, something like:

def set_playwright_true(request, response):
    request.meta["playwright"] = True
    return request

class MyCrawlSpider(CrawlSpider):
    ...
    rules = (
        Rule(LinkExtractor(...), callback='parse_item', follow=False, process_request=set_playwright_true),
    )

Update after comment

Make sure your URL is correct, I get no results for that particular one (remove /page?).
Bring back your start_requests method, seems like the first page also needs to be downloaded using the browser
Unless marked explicitly (e.g. @classmethod, @staticmethod) Python instance methods receive the calling object as implicit first argument. The convention is to call this self (e.g. def set_playwright_true(self, request, response)). However, if you do this, you will need to change the way you create the rule, either:
- Rule(..., process_request=self.set_playwright_true)
- Rule(..., process_request="set_playwright_true")
From the docs: "process_request is a callable (or a string, in which case a method from the spider object with that name will be used)"

My original example defines the processing function outside of the spider, so it's not an instance method.

as you suggested, I tried something like this. But failed to execute https://pastebin.com/raw/pyYzwa6v — Raisul Islam, Mar 14 '22 at 16:06
If I only need the html page dynamically loaded by the chromium browser it works perfect. I need to interact with the page, clicking buttons, waiting, to then take the final html. I tried to mix your solution with the one from documentation which passes the page to the parse method by adding to the meta dict playwright_include_page = True and them taking it from response object page = response.meta["playwright_page"]. No error but the "playwright_page" is not defined. link-> https://github.com/scrapy-plugins/scrapy-playwright#:~:text=produces%20the%20same,.close() — Danilo Marques, May 12 '22 at 20:54

CwnAnnwn · Answer 2 · 2022-03-16T19:43:23.033

0

As suggested by elacuesta, I'd only add change your "parse_item" def from an async to a standard def.

def parse_item(self, response):

It defies what all I've read too, but that got me through.

edited Mar 16 '22 at 19:43

answered Mar 14 '22 at 23:27

CwnAnnwn

13
1
4

Scrapy Playwright: execute CrawlSpider using scrapy playwright

2 Answers2

Linked