I build a job to crawl web data by Heritrix3.0. But it must run Heritrix.java
as Java application and then the server was built. And I have to open the browser to type https://localhost:8443
to build my job, then launch the job. Then unpause the job. How can I make a cron job for web crawling automatically? Please use Java language.
Asked
Active
Viewed 125 times
0
-
Can you show what you have tried so far? – friedemann_bach May 17 '17 at 09:55
-
Any specific reason why you use Heritrix? Why not go for StormCrawler? It is in Java, runs continuously, can generate WARC files, is modular and pluggable, etc... – Julien Nioche May 17 '17 at 13:55
1 Answers
0
I have this automated for my FYP. You can use Java but still according to Heritrix
documentation the calls will be CURLs
hence best, easiest and fastest would be to use Shell
Scripts to invoke the CURL
and get the task done.
Get Current Status of Engine:
curl -v -k -u admin:admin --anyauth --location -H "Accept: application/xml"
˓→https://localhost:8443/engine
Create new job for crawling in the Engine:
curl -v -d "createpath=myjob&action=create" -k -u admin:admin --anyauth --
˓→location \
-H "Accept: application/xml" https://localhost:8443/engine
Build the Job:
curl -v -d "action=build" -k -u admin:admin --anyauth --location -H "Accept:
˓→application/xml" https://localhost:8443/engine/job/myjob
Launch the Job:
curl -v -d "action=rescan" -k -u admin:admin --anyauth --location -H "Accept:
˓→application/xml" https://localhost:8443/engine

Du-Lacoste
- 11,530
- 2
- 71
- 51