1

I'm using crawlee@3.0.3 (not released yet, from github), and I'm trying to block specific resources from loading with playwrightUtils.blockRequests (which isn't available in previous versions). When I try the code suggested in the official repo, it works as expected:

import { launchPlaywright, playwrightUtils } from 'crawlee';

const browser = await launchPlaywright();
const page = await browser.newPage();
await playwrightUtils.blockRequests(page, {
    // extraUrlPatterns: ['adsbygoogle.js'],
});
await page.goto('https://cnn.com');
await page.screenshot({ path: 'cnn_no_images.png' });
await browser.close();

I can see that the images aren't loaded from the screenshot. My problem has to do with the fact that I'm using PlaywrightCrawler:

const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 3,
    async requestHandler({ page, request }) {
        console.log(`Processing: ${request.url}`);
        await playwrightUtils.blockRequests(page);
        await page.screenshot({ path: 'cnn_no_images2.png' });
    },
});

This way, I'm not able to block specific resources, and my guess is that blockRequests needs launchPlaywright to work, and I don't see a way to pass that to PlaywrightCrawler.blockRequests has been available for puppeteer, so maybe someone has tried this before.

Also, i've tried "route interception", but again, I couldn't make it work with PlaywrightCrawler.

Martin Adámek
  • 16,771
  • 5
  • 45
  • 64
matrs
  • 59
  • 1
  • 6

1 Answers1

1

you can set any listeners or code before navigation by using preNavigationHooks like this:


const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 3,
    preNavigationHooks: [async ({ page }) => {
        await playwrightUtils.blockRequests(page);
    }],
    async requestHandler({ page, request }) {
        console.log(`Processing: ${request.url}`);
        await page.screenshot({ path: 'cnn_no_images2.png' });
    },
});
pocesar
  • 6,860
  • 6
  • 56
  • 88