-1

Is it possible to execute CrawlSpider using Playwright integration for Scrapy? I am trying the following script to execute a CrawlSpider but it does not scrape anything. It also does not show any error!

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class GumtreeCrawlSpider(CrawlSpider):
    name = 'gumtree_crawl'
    allowed_domains = ['www.gumtree.com']
    def start_requests(self):
        yield scrapy.Request(
            url='https://www.gumtree.com/property-for-sale/london/page',
            meta={"playwright": True}
        )
        return super().start_requests()

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//div[@class='grid-col-12']/ul[1]/li/article/a"), callback='parse_item', follow=False),
    )

    async def parse_item(self, response):
        yield {
            'Title': response.xpath("//div[@class='css-w50tn5 e1pt9h6u11']/h1/text()").get(),
            'Price': response.xpath("//h3[@itemprop='price']/text()").get(),
            'Add Posted': response.xpath("//dl[@class='css-16xsajr elf7h8q4'][1]/dd/text()").get(),
            'Links': response.url
        }
Raisul Islam
  • 277
  • 2
  • 19

2 Answers2

3

Requests extracted from the rule do not have the playwright=True meta key, that's a problem if they need to be rendered by the browser to have useful content. You could solve that by using Rule.process_request, something like:

def set_playwright_true(request, response):
    request.meta["playwright"] = True
    return request

class MyCrawlSpider(CrawlSpider):
    ...
    rules = (
        Rule(LinkExtractor(...), callback='parse_item', follow=False, process_request=set_playwright_true),
    )

Update after comment

  1. Make sure your URL is correct, I get no results for that particular one (remove /page?).

  2. Bring back your start_requests method, seems like the first page also needs to be downloaded using the browser

  3. Unless marked explicitly (e.g. @classmethod, @staticmethod) Python instance methods receive the calling object as implicit first argument. The convention is to call this self (e.g. def set_playwright_true(self, request, response)). However, if you do this, you will need to change the way you create the rule, either:

    • Rule(..., process_request=self.set_playwright_true)
    • Rule(..., process_request="set_playwright_true")

    From the docs: "process_request is a callable (or a string, in which case a method from the spider object with that name will be used)"

    My original example defines the processing function outside of the spider, so it's not an instance method.

elacuesta
  • 891
  • 5
  • 20
  • as you suggested, I tried something like this. But failed to execute https://pastebin.com/raw/pyYzwa6v – Raisul Islam Mar 14 '22 at 16:06
  • 1
    Updated the answer in response to the above comment – elacuesta Mar 15 '22 at 18:15
  • If I only need the html page dynamically loaded by the chromium browser it works perfect. I need to interact with the page, clicking buttons, waiting, to then take the final html. I tried to mix your solution with the one from documentation which passes the page to the parse method by adding to the meta dict playwright_include_page = True and them taking it from response object page = response.meta["playwright_page"]. No error but the "playwright_page" is not defined. link-> https://github.com/scrapy-plugins/scrapy-playwright#:~:text=produces%20the%20same,.close() – Danilo Marques May 12 '22 at 20:54
0

As suggested by elacuesta, I'd only add change your "parse_item" def from an async to a standard def.

def parse_item(self, response):

It defies what all I've read too, but that got me through.

CwnAnnwn
  • 13
  • 1
  • 4