Integrating Playwright with Scrapy scrapes only a single item

Question

I'm practicing the integration of Playwright and Scrapy, however, my scraper would only return a single item. I'm not sure whether I have my xpath wrong? because I get the following output:

2022-01-04 21:41:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobsite.co.uk/jobs/Degree-Accounting-and-Finance>
{'items': 'Up to £26,000 per annum'}

I'm trying to scrape salaries from a dynamic website, here's the script I have tried:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy_playwright.page import PageCoroutine
from scrapy.item import Field
from itemloaders.processors import TakeFirst
from scrapy.loader import ItemLoader

class EtsyItem(scrapy.Item):
    items = Field(output_processor = TakeFirst())

class EtsySpider(scrapy.Spider):
    name = 'job'
    start_urls = ['https://www.jobsite.co.uk/jobs/Degree-Accounting-and-Finance']
    
    custom_settings = {
        'USER_AGENT':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15'
    }
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url = url,
                callback = self.parse,
                meta= dict(
                    playwright = True,
                    playwright_include_page = True,
                    playwright_page_coroutines = [
                        PageCoroutine('wait_for_selector', 'div.row.job-results-row')
                        ]
                )
            )
    def parse(self, response):
       stuff = response.xpath("//div[@class='ResultsSectionContainer-sc-gdhf14-0 kteggz']")
       for items in stuff:
           loaders = ItemLoader(EtsyItem(), selector = items)
           loaders.add_xpath('items', '//dl[normalize-space()]//text()')
           yield loaders.load_item()

if __name__ == "__main__":
    process = CrawlerProcess(settings={
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        }, })
    process.crawl(EtsySpider)
    process.start()

score 0 · Answer 1 · answered Jan 26 '22 at 17:43

0

You will have to add a dot in your selector.

loaders.add_xpath('items', './/dl[normalize-space()]//text()')

Added a dot

answered Jan 26 '22 at 17:43

Tristan

1
1
6

Integrating Playwright with Scrapy scrapes only a single item

1 Answers1