2

I am trying to use the scrapy-playwright library for parsing/scraping JavsScript-based websites. While working, I learned this isn't compatible with the windows system known issue. I am putting out the minimum reproducible here

import scrapy
from asyncio.windows_events import *
from scrapy.crawler import CrawlerProcess


class Play1Spider(scrapy.Spider):
    name = 'play1'
 
    def start_requests(self):
        yield scrapy.Request("http://testphp.vulnweb.com/",
                             callback=self.parse,
                             meta={'playwright': True,
                                   'playwright_include_page': True,
                                   
                                       })

    async def parse(self, response):
        yield{
            'text': response.text
        }

if __name__ == "__main__":
    process = CrawlerProcess(
        settings={
            "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
            "DOWNLOAD_HANDLERS": {
                "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
                "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            },
            "CONCURRENT_REQUESTS": 32,
            "FEED_URI":'Products.jl',
            "FEED_FORMAT":'jsonlines',
        }
    )
    process.crawl(Play1Spider)
    process.start()

And following is the error stack trace

2022-07-12 16:58:42 [scrapy.core.engine] INFO: Spider opened
2022-07-12 16:58:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-07-12 16:58:43 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-07-12 16:58:43 [scrapy-playwright] INFO: Starting download handler
2022-07-12 16:58:43 [scrapy-playwright] INFO: Starting download handler
2022-07-12 16:58:43 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-3' coro=<Connection.run() done, defined at C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_connection.py:212> exception=NotImplementedError()>
Traceback (most recent call last):
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_connection.py", line 219, in run
    await self._transport.connect()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 139, in connect
    raise exc
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 127, in connect
    self._proc = await asyncio.create_subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec   
    transport, protocol = await loop.subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec        
    transport = await self._make_subprocess_transport(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport
    raise NotImplementedError
NotImplementedError
2022-07-12 16:58:43 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-4' coro=<Connection.run() done, defined at C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_connection.py:212> exception=NotImplementedError()>
Traceback (most recent call last):
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_connection.py", line 219, in run
    await self._transport.connect()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 139, in connect
    raise exc
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 127, in connect
    self._proc = await asyncio.create_subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec   
    transport, protocol = await loop.subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec        
    transport = await self._make_subprocess_transport(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport
    raise NotImplementedError
NotImplementedError
2022-07-12 16:58:43 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ScrapyPlaywrightDownloadHandler._engine_started of <scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler object at 0x000001B089014970>>
Traceback (most recent call last):
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\internet\defer.py", line 1030, in adapt
    extracted = result.result()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 127, in _launch
    self.playwright = await self.playwright_context_manager.start()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\async_api\_context_manager.py", line 51, in start     
    return await self.__aenter__()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\async_api\_context_manager.py", line 46, in __aenter__    playwright = AsyncPlaywright(next(iter(done)).result())
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_connection.py", line 219, in run
    await self._transport.connect()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 139, in connect
    raise exc
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 127, in connect
    self._proc = await asyncio.create_subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec   
    transport, protocol = await loop.subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec        
    transport = await self._make_subprocess_transport(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport
    raise NotImplementedError
NotImplementedError
2022-07-12 16:58:43 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ScrapyPlaywrightDownloadHandler._engine_started of <scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler object at 0x000001B08964B8E0>>
Traceback (most recent call last):
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\internet\defer.py", line 1030, in adapt
    extracted = result.result()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 127, in _launch
    self.playwright = await self.playwright_context_manager.start()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\async_api\_context_manager.py", line 51, in start     
    return await self.__aenter__()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\async_api\_context_manager.py", line 46, in __aenter__    playwright = AsyncPlaywright(next(iter(done)).result())
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_connection.py", line 219, in run
    await self._transport.connect()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 139, in connect
    raise exc
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 127, in connect
    self._proc = await asyncio.create_subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec   
    transport, protocol = await loop.subprocess_exec(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec        
    transport = await self._make_subprocess_transport(
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport
    raise NotImplementedError
NotImplementedError
2022-07-12 16:58:48 [scrapy.core.scraper] ERROR: Error downloading <GET http://testphp.vulnweb.com/>
Traceback (most recent call last):
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\internet\defer.py", line 1656, in _inlineCallbacks       
    result = current_context.run(
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\python\failure.py", line 514, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy\core\downloader\middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\internet\defer.py", line 1030, in adapt
    extracted = result.result()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 247, in _download_request    
    page = await self._create_page(request)
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 185, in _create_page
    context = await self._create_browser_context(
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 160, in _create_browser_context
    await self._maybe_launch_browser()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 144, in _maybe_launch_browser    logger.info(f"Launching browser {self.browser_type.name}")
AttributeError: 'ScrapyPlaywrightDownloadHandler' object has no attribute 'browser_type'
2022-07-12 16:58:48 [scrapy.core.engine] INFO: Closing spider (finished)
2022-07-12 16:58:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/builtins.AttributeError': 1,
 'downloader/request_bytes': 229,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'elapsed_time_seconds': 5.260977,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 7, 12, 11, 28, 48, 293797),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 5,
 'log_count/INFO': 12,
 'log_count/WARNING': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 7, 12, 11, 28, 43, 32820)}
2022-07-12 16:58:48 [scrapy.core.engine] INFO: Spider closed (finished)
2022-07-12 16:58:48 [scrapy-playwright] INFO: Closing download handler
2022-07-12 16:58:48 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method DownloadHandlers._close of <scrapy.core.downloader.handlers.DownloadHandlers object at 0x000001B088FDA920>>
Traceback (most recent call last):
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\internet\defer.py", line 1656, in _inlineCallbacks       
    result = current_context.run(
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\python\failure.py", line 514, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy\core\downloader\handlers\__init__.py", line 81, in _close 
    yield dh.close()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\internet\defer.py", line 1656, in _inlineCallbacks       
    result = current_context.run(
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 229, in close       ntoGenera
    yield deferred_from_coro(self._close())
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\twisted\internet\defer.py", line 1030, in adapt
    extracted = result.result()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\scrapy_playwright\handler.py", line 237, in _close      
    await self.playwright_context_manager.__aexit__()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\async_api\_context_manager.py", line 54, in __aexit__
    await self._connection.stop_async()
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_connection.py", line 230, in stop_asyn_aexit__ c
    self._transport.request_stop()                                                                                                c        
  File "C:\Users\user\Scrapy Projects\play10\lib\site-packages\playwright\_impl\_transport.py", line 107, in request_stop                                                                                                                                op       
    assert self._output
AttributeError: 'PipeTransport' object has no attribute '_output'

I already went through the similar questions other solution but couldn't get anything conclusive. I know that it might work on WSL or MacOS...but I need to build a solution for windows at the moment. I am looking for all sorts of suggestions/workaround if anyone has faced the similar problem. Also, I am open to try out other libraries as well, if any.

PS: Already went through Selenium, Scrapy-puppeteer(Similar problem), and Scrapy-Splash.

Looking forward to hearing out some suggestions and feedback.TIA

hs27
  • 67
  • 1
  • 8
  • Did you install playwright? i.e. ```playwright install```, as from your first error it looks like the browser is not available. Perhaps specify which browser you are using in the settings.py. The second error looks to me that you cannot get an output because the browser was not opened, so no data was extracted. – joe_bill.dollar Jul 12 '22 at 20:37
  • Yes, I have already installed playwright and the browsers required, I can confirm that because both scrapy and playwright individually are working fine. Only when I try to integrate both via scrapy-playwright, the problem begins. – hs27 Jul 14 '22 at 07:24

2 Answers2

0

The Windows implementation of asyncio can use two event loop implementations: SelectorEventLoop, default before Python 3.8, required when using Twisted. ProactorEventLoop, default since Python 3.8, cannot work with Twisted.

So on Python 3.8+ the event loop class needs to be changed.

Changed in version 2.6.0: The event loop class is changed automatically when you change the TWISTED_REACTOR setting or call install_reactor().

To change the event loop class manually, call the following code before installing the reactor:

import asyncio
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

You can put this in the same function that installs the reactor, if you do that yourself, or in some code that runs before the reactor is installed, e.g. settings.py.

Documentation: The Windows implementation of asyncio

Dmytro
  • 1
  • 2
0

I had the same issue today, playwright actually has a problem running on Windows. I don't know the specific reason why it doesn't work on Windows 10. Some say that after downloading the latest version of Node js it starts working but for me, I tried running it on WSL in VS code. WSL is Windows subsystem for Linux, its kind of like running vs code in a virtual machine but you have to allocate separate memory and processing for the VM but WSL works like Docker. I can't explain all the steps here so I am pasting a link to the video https://www.youtube.com/watch?v=oF6gLyhQDdw that will guide you how to use it. scrapy-playwright works fine on WSL.