I'm using crawlee@3.0.4
, following the quick tutorial here to spin up a scraper.
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, enqueueLinks }) => {
console.log(`Processing: ${request.url}`)
await page.waitForSelector('.ActorStorePagination-pages a');
await enqueueLinks({
selector: '.ActorStorePagination-pages > a',
label: 'LIST',
})
},
});
I now need to extend the enqueLinks
function which is passed to the requestHandler
. The goal is to add some custom logics whenever I add new urls to the queue. An example usecase is keeping track of how many links of a certain type I have found, so that I can do additional logging/publishing messages to other services. Is there any way to do that?
I have tried to extend the PlaywrightCrawler class instead. My problem with that approach is since the requestHandler
is wrapped by an object, I cannot access its properties.
class CustomCrawler extends PlaywrightCrawler {
categoryPagesQueued: string[];
constructor() {
super({
requestHandler: async ({ page, request, enqueueLinks }) => {
console.log(`Processing: ${request.url}`)
// Wait for the actor cards to render,
// otherwise enqueueLinks wouldn't enqueue anything.
await page.waitForSelector('.ActorStorePagination-pages a');
// Error: this does not access the CustomCrawler.categoryPagesQueued
this.categoryPagesQueued.push("foo");
customLogic(this.categoryPagesQueued);
await enqueueLinks({
selector: '.ActorStorePagination-pages > a',
label: 'LIST',
})
},
})
}
}