Questions tagged [heritrix]

Heritrix is a web-crawler.

Heritrix is a web-crawler created by the Internet Archive for the purpose of archiving websites. It is a free software licence program written in Java.

43 questions
1
vote
0 answers

Heritrix: how to get more uri per sec on single domain?

how to get more uri/sec per domain with Heritrix 3.2.0? i already set the parallels option to max number like maxToeThreads and it still stays on 5 active threads at a single domain crawl in general.
GMAC
  • 9
  • 1
1
vote
1 answer

How to loop through WARC files using HeaderedArchiveRecord with Heritrix 3.1

I'm using the Heritrix 3.1 Java library. Just to be clear, I'm not interested in crawling but only in processing data from compressed WARC (*.warc.gz) files generated by another team. For each WWW document stored in the WARC file, I need some…
AdamF
  • 519
  • 4
  • 11
1
vote
1 answer

Use of Heritrix's HtmlFormCredential and CredentialStore

I am attempting to add authentication to my Heritrix configuration. My .cxml file has the following:
Nielsvh
  • 1,151
  • 1
  • 18
  • 31
1
vote
1 answer

Change the path of MirrorWriterProcessor in Heritrix 3.1.0

I am crawling using Heritrix 3.1.0. I am trying to save the files using the MirrorWriterProcessor. However, this option is not available in the crawler-beans.cxml. What I did was to replace the "warcWriter"…
fanchyna
  • 2,623
  • 7
  • 36
  • 38
1
vote
0 answers

Reading from arc file (commoncrawl dataset) with ARCReader

Well this question may sound stupid, but I did research like hours to find solution but I couldn't so if anyone knows, that would be GREAT!!! I successfully read arc file (from commoncrawl dataset). With arcHeader.getUrl(); I'm getting all URLs.…
code muncher
  • 1,592
  • 2
  • 27
  • 46
0
votes
1 answer

Getting 401 error when trying to make a teardown request to Heritrix via Node.js http module

I'm trying to make a teardown request to Heritrix via Node.js http module and the Heritrix REST API, but I keep getting a 401 error. I know the request works using curl, as I've tested it with the following command: curl -v -d "action=teardown" -k…
Isaac W
  • 23
  • 9
0
votes
1 answer

Crawling rules in heritrix, how to load embedded content?

I want heritrix (version 3.4.0 currently) to crawl site.domain/path and load all pages below that but also include needed things to show the pages, like imgages, scripts and such. According to https://heritrix.readthedocs.io/en/latest/glossary.html…
Erik Melkersson
  • 899
  • 8
  • 19
0
votes
0 answers

Cookie expire quickly during Crawling website using Heritrix

I'm trying to crawl a wordpress website with Heritrix, and I have provided cookies to automatically login to the website and crawl, it works fine but after crossing 20MB (approx. 10 minutes) of downloaded data or so, the website logs out and the…
0
votes
1 answer

How can i rightly configure my crawling program crawl-beans.cxml

When i start my crawling i realized that it took much more time then it should have and still not finished I tried to check the process pid to see what's going on from another termminal and the outputs were not clear to me, they were all of this…
0
votes
1 answer

How to write a cron job for Heritrix3 web crawling?

I build a job to crawl web data by Heritrix3.0. But it must run Heritrix.java as Java application and then the server was built. And I have to open the browser to type https://localhost:8443 to build my job, then launch the job. Then unpause the…
莫绮静
  • 1
  • 1
0
votes
3 answers

Heritrix 3.2.x , how to read content from warc files ?

Using Heritrix 3.2.x, i had crawled a website, Now i want to read the HTML content from the warc files created. Can anyone help ? I tried using python warc tool and java based warc-tools.jar.
0
votes
1 answer

How do we know when Heritrix completes a crawl job?

In our application, Heritrix is being used as the crawl engine and once the crawl job is finished, we are manually kicking off an endpoint to download the PDFs from a website. We would like to automate this downloading pdf task as soon as the crawl…
bking007
  • 59
  • 1
  • 10
0
votes
1 answer

Is Heritrix Crawl Deterministic?

Let's say there is a website abc.com and we crawl abc.com for 100 pages as below. Day 1: create a crawl job in heritrix by specifying maxDocumentsToDownload as 100 Day 2: clone the above job in heritrix and run. If website doesn't change over two…
TechyHarry
  • 301
  • 2
  • 8
  • 25
0
votes
1 answer

find web trace to a web list in heritrix

I have been working with web crawler Heritrix recently in my company where i work for and after a while searching and testing it I can't find how to solve our need. We want to run heritrix automatically in cron everyday to crawl a list of webpages…
0
votes
1 answer

Increasing number of threads

I'm trying to crawl pages from one particular domain using Heritrix. The crawl rate seems to be really slow. And one thing I notice is that while there are 25 threads, 24 of them are always idle. It seems there is only one thread that is actively…
Gant
  • 29,661
  • 6
  • 46
  • 65