Highest Voted 'crawlee' Questions

4

votes

1 answer

Extending Crawlee scraper requestHandler

I'm using crawlee@3.0.4, following the quick tutorial here to spin up a scraper. import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ requestHandler: async ({ page, request, enqueueLinks }) => { …

typescript web-scraping crawlee

asked Aug 24 '22 at 17:16

toanphan19

88
7

2

votes

2 answers

How to reset crawlee URL cache?

I'm running a crawler that is called via an expressjs call. When I call the same route again, my crawler runs again but shows that all routes have already finished. I'm even removing the './storage' folder I read the documentation but can't seem to…

javascript express crawlee

asked Dec 06 '22 at 22:43

Justin Young

2,393
3
36
62

1

vote

1 answer

Blocking specific resources (css, images, videos, etc) using crawlee and playwright

I'm using crawlee@3.0.3 (not released yet, from github), and I'm trying to block specific resources from loading with playwrightUtils.blockRequests (which isn't available in previous versions). When I try the code suggested in the official repo, it…

node.js apify crawlee

asked Aug 09 '22 at 21:33

matrs

59
1
6

0

votes

0 answers

Why does my web crawler work on localhost but not in the docker container?

I'm trying to create a web crawler using crawlee and apify With NodeJS. This crawler works when I run on the localhost, but when I run on docker container, I receive a timeout error waiting for locator. The docker container was built correctly but…

docker web-scraping web-crawler apify crawlee

asked Aug 28 '23 at 15:55

Eduardo Queiroz

1

0

votes

0 answers

Crawlee no such file or directory storage/request_queues/default/[id].json

I'm trying to run a fairly simple scraper, but I keep getting the error in the title. I want to scrape around 64,000 pages, but I get the no such file error every time. Setting waitForAllRequestsToBeAdded to true doesn't fix the issue. I get the…

node.js web-crawler crawlee

asked Jul 27 '23 at 18:50

sbrass

905
7
12

0

votes

1 answer

How to get apify to make every request round-robin style?

Using Apify to crawl a job board, but with multiple concurrently. I have an array of proxies but my queued urls aren't using my proxies in a round robin fashion even though I use this setup. How can I set things up so that every new url that gets…

playwright apify crawlee

asked Jul 21 '23 at 22:40

Justin Young

2,393
3
36
62

0

votes

1 answer

How to scrape a webpage with infinite scroll using crawlee/apify?

I am trying to scrape some data from twitch, the problem I am facing is that the site uses infinite scroll and I am only able to get data from the first page. I have tried to scroll by using the built in utility infiniteScroll but it scrolls after…

web-scraping web-crawler apify crawlee

asked Jul 17 '23 at 20:25

Syed Hasnain

1
1

0

votes

0 answers

How to grab links only from sitemap?

I want to grab links only from the sitemap with Crawlee and not grab links from pages found in the sitemap. The main problem is that it starts from a sitemap and then follows all links on the newly discovered page. The expected workflow should be…

apify crawlee

asked May 22 '23 at 05:02

user3389

479
5
10

0

votes

0 answers

How to set data in localStorage by Crawlee Puppeteer? In my case it return Access is denied

I'm using Crawlee with Puppeteer. I want set data in localStorage of related page. So I used the following code preNavigationHooks: [ async (crawlingContext,) => { const { page, request } = crawlingContext; const localStorage =…

google-chrome puppeteer crawlee

asked Apr 30 '23 at 08:16

Ramin Bateni

16,499
9
69
98

0

votes

0 answers

How to parallelly run several crawlers by Crawlee? I get some errors

I use Crawlee in my project. I want run 2 crawler parallelly by this way: await Promise.all([ crawler1.run(), crawler2.run(), ]); But I get this error: ENOENT: no such file or directory, open…

node.js crawlee

asked Apr 24 '23 at 06:17

Ramin Bateni

16,499
9
69
98

0

votes

0 answers

Compiling error importing Crawlee in react

I'm trying to integrate the crawlee library (https://crawlee.dev) inside my react app for a social scrape project, but as soon as I import the PlaywrightCrawler module I get the following compile error: ERROR in…

reactjs playwright babel-loader crawlee

asked Mar 23 '23 at 14:55

samuele nencioni

1

0

votes

1 answer

crawlee - How to add the same URL back to the requestQueue

How do i enqueue the same URL that i am currently handling the request for? I have this code and want to scrape the same URL again (possibly with a delay), i added enviroment variables that cached results will be deleted, according to this…

javascript web-scraping apify crawlee

asked Jan 08 '23 at 16:37

Jaanis

39
1
7

0

votes

0 answers

How to access page.getby* functions inside crawlee

I'm using crawlee with PlaywrightCrawler. I'm getting a new url to crawl after clicking a few elements in the starting page. The way that i'm clicking those elements is using page.getByRole().click(), which codegen playwright used it: import {…

node.js playwright apify crawlee

asked Nov 07 '22 at 20:16

matrs

59
1
6

0

votes

0 answers

Reclaiming failed request back to the list or queue. waiting for selector `ul` failed: timeout 30000ms exceeded Puppeteer Crawler

WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. waiting for selector .show-more-less-html__markup failed: timeout 30000ms exceeded…

web-crawler puppeteer crawlee

asked Nov 07 '22 at 15:41

Sachin Chillal

393
4
6

0

votes

1 answer

timeoutSecs for RequestQueue ignoring user config?

I use RequestQueue like so: const requestQueue = await RequestQueue.open(); requestQueue.timeoutSecs = 60; await requestQueue.addRequest(...); but when running scraper I still see the default timeout: WARN CheerioCrawler: Reclaiming…

crawlee

asked Oct 21 '22 at 18:40

susdu

852
9
22

Questions tagged [crawlee]