Questions tagged [crawlee]

17 questions
4
votes
1 answer

Extending Crawlee scraper requestHandler

I'm using crawlee@3.0.4, following the quick tutorial here to spin up a scraper. import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ requestHandler: async ({ page, request, enqueueLinks }) => { …
toanphan19
  • 88
  • 7
2
votes
2 answers

How to reset crawlee URL cache?

I'm running a crawler that is called via an expressjs call. When I call the same route again, my crawler runs again but shows that all routes have already finished. I'm even removing the './storage' folder I read the documentation but can't seem to…
Justin Young
  • 2,393
  • 3
  • 36
  • 62
1
vote
1 answer

Blocking specific resources (css, images, videos, etc) using crawlee and playwright

I'm using crawlee@3.0.3 (not released yet, from github), and I'm trying to block specific resources from loading with playwrightUtils.blockRequests (which isn't available in previous versions). When I try the code suggested in the official repo, it…
matrs
  • 59
  • 1
  • 6
0
votes
0 answers

Why does my web crawler work on localhost but not in the docker container?

I'm trying to create a web crawler using crawlee and apify With NodeJS. This crawler works when I run on the localhost, but when I run on docker container, I receive a timeout error waiting for locator. The docker container was built correctly but…
0
votes
0 answers

Crawlee no such file or directory storage/request_queues/default/[id].json

I'm trying to run a fairly simple scraper, but I keep getting the error in the title. I want to scrape around 64,000 pages, but I get the no such file error every time. Setting waitForAllRequestsToBeAdded to true doesn't fix the issue. I get the…
sbrass
  • 905
  • 7
  • 12
0
votes
1 answer

How to get apify to make every request round-robin style?

Using Apify to crawl a job board, but with multiple concurrently. I have an array of proxies but my queued urls aren't using my proxies in a round robin fashion even though I use this setup. How can I set things up so that every new url that gets…
Justin Young
  • 2,393
  • 3
  • 36
  • 62
0
votes
1 answer

How to scrape a webpage with infinite scroll using crawlee/apify?

I am trying to scrape some data from twitch, the problem I am facing is that the site uses infinite scroll and I am only able to get data from the first page. I have tried to scroll by using the built in utility infiniteScroll but it scrolls after…
0
votes
0 answers

How to grab links only from sitemap?

I want to grab links only from the sitemap with Crawlee and not grab links from pages found in the sitemap. The main problem is that it starts from a sitemap and then follows all links on the newly discovered page. The expected workflow should be…
user3389
  • 479
  • 5
  • 10
0
votes
0 answers

How to set data in localStorage by Crawlee Puppeteer? In my case it return Access is denied

I'm using Crawlee with Puppeteer. I want set data in localStorage of related page. So I used the following code preNavigationHooks: [ async (crawlingContext,) => { const { page, request } = crawlingContext; const localStorage =…
Ramin Bateni
  • 16,499
  • 9
  • 69
  • 98
0
votes
0 answers

How to parallelly run several crawlers by Crawlee? I get some errors

I use Crawlee in my project. I want run 2 crawler parallelly by this way: await Promise.all([ crawler1.run(), crawler2.run(), ]); But I get this error: ENOENT: no such file or directory, open…
Ramin Bateni
  • 16,499
  • 9
  • 69
  • 98
0
votes
0 answers

Compiling error importing Crawlee in react

I'm trying to integrate the crawlee library (https://crawlee.dev) inside my react app for a social scrape project, but as soon as I import the PlaywrightCrawler module I get the following compile error: ERROR in…
0
votes
1 answer

crawlee - How to add the same URL back to the requestQueue

How do i enqueue the same URL that i am currently handling the request for? I have this code and want to scrape the same URL again (possibly with a delay), i added enviroment variables that cached results will be deleted, according to this…
Jaanis
  • 39
  • 1
  • 7
0
votes
0 answers

How to access page.getby* functions inside crawlee

I'm using crawlee with PlaywrightCrawler. I'm getting a new url to crawl after clicking a few elements in the starting page. The way that i'm clicking those elements is using page.getByRole().click(), which codegen playwright used it: import {…
matrs
  • 59
  • 1
  • 6
0
votes
0 answers

Reclaiming failed request back to the list or queue. waiting for selector `ul` failed: timeout 30000ms exceeded Puppeteer Crawler

WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. waiting for selector .show-more-less-html__markup failed: timeout 30000ms exceeded…
Sachin Chillal
  • 393
  • 4
  • 6
0
votes
1 answer

timeoutSecs for RequestQueue ignoring user config?

I use RequestQueue like so: const requestQueue = await RequestQueue.open(); requestQueue.timeoutSecs = 60; await requestQueue.addRequest(...); but when running scraper I still see the default timeout: WARN CheerioCrawler: Reclaiming…
susdu
  • 852
  • 9
  • 22
1
2