3

I tried to extract some data from dynamically loaded javascript website using scrapy-playwright but I stuck at the very beginning.

From where I'm facing trubles in settings.py file is as follows:

#playwright

 DOWNLOAD_HANDLERS = {
        "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    }

#TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
#ASYNCIO_EVENT_LOOP = 'uvloop.Loop'

When I inject the following scrapy-playwright hanndler:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

Then I got:

scrapy.exceptions.NotSupported: Unsupported URL scheme 'https': The installed reactor 
(twisted.internet.selectreactor.SelectReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)

When I inject TWISTED_REACTOR"

TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

Then I got:

 raise TypeError(
TypeError: SelectorEventLoop required, instead got: <ProactorEventLoop running=False closed=False debug=False>

After all,When I inject ASYNCIO_EVENT_LOOP

Then I got:

ModuleNotFoundError: No module named 'uvloop'

At last, fail to install 'uvloop'

pip install uvloop

Script

import scrapy
from scrapy_playwright.page import PageCoroutine

class ProductSpider(scrapy.Spider):
    name = 'product'

    def start_requests(self):
        yield scrapy.Request(
            'https://shoppable-campaign-demo.netlify.app/#/',
            meta={
                'playwright': True,
                'playwright_include_page': True,
                'playwright_page_coroutines': [
                    PageCoroutine("wait_for_selector", "div#productListing"),
                ]
            }
        )

    async def parse(self, response):
        pass
        # parses content

3 Answers3

2

It's been suggested by the developers of scrapy_playwright to instantiate the DOWNLOAD_HANDLERS and TWISTER_REACTOR into your script.

A similar comment is provided here

Here's a working script implementing just this:

import scrapy
from scrapy_playwright.page import PageCoroutine
from scrapy.crawler import CrawlerProcess

class ProductSpider(scrapy.Spider):
    name = 'product'

    def start_requests(self):
        yield scrapy.Request(
            'https://shoppable-campaign-demo.netlify.app/#/',
            callback = self.parse,
            meta={
                'playwright': True,
                'playwright_include_page': True,
                'playwright_page_coroutines': [
                    PageCoroutine("wait_for_selector", "div#productListing"),
                ]
            }
        )

    async def parse(self, response):
        container = response.xpath("(//div[@class='col-md-6'])[1]")
        for items in container:
            yield {
                'products':items.xpath("(//h3[@class='card-title'])[1]//text()").get()
            }
        # parses content

if __name__ == "__main__":
    process = CrawlerProcess(
        settings={
            "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
            "DOWNLOAD_HANDLERS": {
                "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
                "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            },
            "CONCURRENT_REQUESTS": 32,
            "FEED_URI":'Products.jl',
            "FEED_FORMAT":'jsonlines',
        }
    )
    process.crawl(ProductSpider)
    process.start()

And we get the following output:

{'products': 'Oxford Loafers'}

me.limes
  • 441
  • 1
  • 13
1

You can call this scrapy's function:

from scrapy.utils.reactor import install_reactor

install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
JM_
  • 51
  • 2
0

If you are using Windows then your problem is that Playwright doesn't support Windows. Check it out here https://github.com/scrapy-plugins/scrapy-playwright/issues/154

Yunnosch
  • 26,130
  • 9
  • 42
  • 54