Questions tagged [crawlee]
17 questions
4
votes
1 answer
Extending Crawlee scraper requestHandler
I'm using crawlee@3.0.4, following the quick tutorial here to spin up a scraper.
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, enqueueLinks }) => {
…

toanphan19
- 88
- 7
2
votes
2 answers
How to reset crawlee URL cache?
I'm running a crawler that is called via an expressjs call.
When I call the same route again, my crawler runs again but shows that all routes have already finished. I'm even removing the './storage' folder
I read the documentation but can't seem to…

Justin Young
- 2,393
- 3
- 36
- 62
1
vote
1 answer
Blocking specific resources (css, images, videos, etc) using crawlee and playwright
I'm using crawlee@3.0.3 (not released yet, from github), and I'm trying to block specific resources from loading with playwrightUtils.blockRequests (which isn't available in previous versions). When I try the code suggested in the official repo, it…

matrs
- 59
- 1
- 6
0
votes
0 answers
Why does my web crawler work on localhost but not in the docker container?
I'm trying to create a web crawler using crawlee and apify With NodeJS. This crawler works when I run on the localhost, but when I run on docker container, I receive a timeout error waiting for locator. The docker container was built correctly but…
0
votes
0 answers
Crawlee no such file or directory storage/request_queues/default/[id].json
I'm trying to run a fairly simple scraper, but I keep getting the error in the title. I want to scrape around 64,000 pages, but I get the no such file error every time. Setting waitForAllRequestsToBeAdded to true doesn't fix the issue. I get the…

sbrass
- 905
- 7
- 12
0
votes
1 answer
How to get apify to make every request round-robin style?
Using Apify to crawl a job board, but with multiple concurrently. I have an array of proxies but my queued urls aren't using my proxies in a round robin fashion even though I use this setup. How can I set things up so that every new url that gets…

Justin Young
- 2,393
- 3
- 36
- 62
0
votes
1 answer
How to scrape a webpage with infinite scroll using crawlee/apify?
I am trying to scrape some data from twitch, the problem I am facing is that the site uses infinite scroll and I am only able to get data from the first page.
I have tried to scroll by using the built in utility infiniteScroll but it scrolls after…

Syed Hasnain
- 1
- 1
0
votes
0 answers
How to grab links only from sitemap?
I want to grab links only from the sitemap with Crawlee and not grab links from pages found in the sitemap. The main problem is that it starts from a sitemap and then follows all links on the newly discovered page.
The expected workflow should be…

user3389
- 479
- 5
- 10
0
votes
0 answers
How to set data in localStorage by Crawlee Puppeteer? In my case it return Access is denied
I'm using Crawlee with Puppeteer. I want set data in localStorage of related page. So I used the following code
preNavigationHooks: [
async (crawlingContext,) => {
const { page, request } = crawlingContext;
const localStorage =…

Ramin Bateni
- 16,499
- 9
- 69
- 98
0
votes
0 answers
How to parallelly run several crawlers by Crawlee? I get some errors
I use Crawlee in my project.
I want run 2 crawler parallelly by this way:
await Promise.all([
crawler1.run(),
crawler2.run(),
]);
But I get this error:
ENOENT: no such file or directory, open…

Ramin Bateni
- 16,499
- 9
- 69
- 98
0
votes
0 answers
Compiling error importing Crawlee in react
I'm trying to integrate the crawlee library (https://crawlee.dev) inside my react app for a social scrape project, but as soon as I import the PlaywrightCrawler module I get the following compile error:
ERROR in…
0
votes
1 answer
crawlee - How to add the same URL back to the requestQueue
How do i enqueue the same URL that i am currently handling the request for?
I have this code and want to scrape the same URL again (possibly with a delay), i added enviroment variables that cached results will be deleted, according to this…

Jaanis
- 39
- 1
- 7
0
votes
0 answers
How to access page.getby* functions inside crawlee
I'm using crawlee with PlaywrightCrawler. I'm getting a new url to crawl after clicking a few elements in the starting page. The way that i'm clicking those elements is using
page.getByRole().click(), which codegen playwright used it:
import {…

matrs
- 59
- 1
- 6
0
votes
0 answers
Reclaiming failed request back to the list or queue. waiting for selector `ul` failed: timeout 30000ms exceeded Puppeteer Crawler
WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. waiting for selector .show-more-less-html__markup failed: timeout 30000ms exceeded…

Sachin Chillal
- 393
- 4
- 6
0
votes
1 answer
timeoutSecs for RequestQueue ignoring user config?
I use RequestQueue like so:
const requestQueue = await RequestQueue.open();
requestQueue.timeoutSecs = 60;
await requestQueue.addRequest(...);
but when running scraper I still see the default timeout:
WARN CheerioCrawler: Reclaiming…

susdu
- 852
- 9
- 22