How to write a cron job for Heritrix3 web crawling?

Question

I build a job to crawl web data by Heritrix3.0. But it must run Heritrix.java as Java application and then the server was built. And I have to open the browser to type https://localhost:8443 to build my job, then launch the job. Then unpause the job. How can I make a cron job for web crawling automatically? Please use Java language.

Any specific reason why you use Heritrix? Why not go for StormCrawler? It is in Java, runs continuously, can generate WARC files, is modular and pluggable, etc... — Julien Nioche, May 17 '17 at 13:55

score 0 · Answer 1 · answered May 06 '23 at 03:14

I have this automated for my FYP. You can use Java but still according to Heritrix documentation the calls will be CURLs hence best, easiest and fastest would be to use Shell Scripts to invoke the CURL and get the task done.

Get Current Status of Engine:

curl -v -k -u admin:admin --anyauth --location -H "Accept: application/xml"
˓→https://localhost:8443/engine

Create new job for crawling in the Engine:

curl -v -d "createpath=myjob&action=create" -k -u admin:admin --anyauth --
˓→location \
-H "Accept: application/xml" https://localhost:8443/engine

Build the Job:

curl -v -d "action=build" -k -u admin:admin --anyauth --location -H "Accept:
˓→application/xml" https://localhost:8443/engine/job/myjob

Launch the Job:

curl -v -d "action=rescan" -k -u admin:admin --anyauth --location -H "Accept:
˓→application/xml" https://localhost:8443/engine

How to write a cron job for Heritrix3 web crawling?

1 Answers1