Is there a way to run Rcrawler without downloading all the HTMLs?

Question

I'm running Rcrawler on a very large website, so it takes a very long time (3+ days with default page depth). Is there a way to not download all the HTMLs to make the process faster?

I only need the URLs that are stored in the INDEX. Or can anyone recommend another way to make Rcrawler run faster?

I have tried running it with a smaller page depth (5), but it is still taking forever.

Can you provide website link and expected output? – Nad Pat Mar 06 '22 at 14:03 — Nad Pat, Mar 06 '22 at 14:03

score 0 · Answer 1 · answered Jun 03 '19 at 10:03

0

I am dealing with the same issue. Depending on the source, in some cases I am even running at depth 1.

Best, Janusz

answered Jun 03 '19 at 10:03

Janush

1
2

I have not tried this yet, but a co-worker suggested using LinkExtractor to only extract URLs without downloading: https://rdrr.io/cran/Rcrawler/man/LinkExtractor.html Let me know if this works for you. Best, Yannick – Yannick Jun 04 '19 at 11:22

Is there a way to run Rcrawler without downloading all the HTMLs?

1 Answers1