0

I'm running Rcrawler on a very large website, so it takes a very long time (3+ days with default page depth). Is there a way to not download all the HTMLs to make the process faster?

I only need the URLs that are stored in the INDEX. Or can anyone recommend another way to make Rcrawler run faster?

I have tried running it with a smaller page depth (5), but it is still taking forever.

1 Answers1

0

I am dealing with the same issue. Depending on the source, in some cases I am even running at depth 1.

Best, Janusz

Janush
  • 1
  • 2
  • I have not tried this yet, but a co-worker suggested using LinkExtractor to only extract URLs without downloading: https://rdrr.io/cran/Rcrawler/man/LinkExtractor.html Let me know if this works for you. Best, Yannick – Yannick Jun 04 '19 at 11:22