I wrote a simple program that only logs requests and responses, once with pyppeteer in Python, and (after I ran into the issues I will describe next) once with puppeteer in JavaScript. Here is the JS code:
const puppeteer = require('puppeteer');
const url = 'https://www.twitch.tv/';
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', request => {
console.log("REQUEST: " + request.url());
request.continue();
});
page.on('response', response => {
console.log("RESPONSE: " + response.url());
});
await page.goto(url, {waitUntil: ["networkidle0", "domcontentloaded"]});
await browser.close();
})();
And here is the Python code:
import asyncio
from pyppeteer import launch
url = "https://www.twitch.tv/"
async def handle_request(request):
print("REQUEST: ", request.url)
await request.continue_()
async def handle_response(response):
print("RESPONSE: ", response.url)
async def main():
browser = await launch()
page = await browser.newPage()
await page.setRequestInterception(True)
page.on('response', handle_response)
page.on('request', handle_request)
await page.goto(url, waitUntil=["networkidle0", "domcontentloaded"])
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
I then compare their output using:
> python3 script.py | grep "wasm"
REQUEST: https://static.twitchcdn.net/assets/wasmworker.min-[redacted].js
REQUEST: https://static.twitchcdn.net/assets/wasmworker.min-[redacted].wasm
> node script.js | grep "wasm"
(nothing)
My issues with this:
(1) Why am I getting different results at all? Shouldn't Puppeteer and Pyppeteer use the exact same browser in the background, and (hopefully) the same default settings (such as viewport... etc.)?
(2) Even though the Python version workks better (subjectively, for my use case), as it logs the requests, why doesn't it log the corresponding responses? When running in non-headless mode, in the developer console, both requests will show up with a response code of 200. What could cause the responses to not be logged by pyppeteer?
I tried using different viewport sizes and enabling/disabling the cache, to no avail.
EDIT: Okay, the reason for (1) seems to be that pyppeteer is just outdated. Regarding (2): twitch.tv does not serve the file I am grepping for when running with puppeteer (also the streams just do not work); Even though I set up puppeteer to use the same chrome executable and UserAgent string as when I manually visit the page, where it works. I thought it might have something to do with puppeteer disabling extensions, as the debug console shows some errors with cast_sender.js
from the chrome cast extension, but even starting chrome with the exact saame arguments as puppeteer does load the files of interest.