Questions tagged [heritrix]

Heritrix is a web-crawler.

Heritrix is a web-crawler created by the Internet Archive for the purpose of archiving websites. It is a free software licence program written in Java.

43 questions
0
votes
1 answer

scraping a heritrix page using python's request module

I want to scrape a Heritrix home page using pythons requests module. When I try to open this page on chrome, I get the error: This server could not prove that it is 10.100.121.41; its security certificate is not trusted by your computer's…
rivu
  • 2,004
  • 2
  • 29
  • 45
0
votes
0 answers

MirrorWriterProcessor in Heritrix 3.2.0 active threads

When im using the MirrorWriterProcessor Class i get only 1 active thread all the time because it wont accepts the de-outcomment properties for increasing max active threads for example. im no java programmer at all so if someone can help me i would…
GMAC
  • 9
  • 1
0
votes
1 answer

Heritrix 3.2.0: Writing and Adding Extensions

I am currently working with Heritrix and I have a standard installation (this one: http://builds.archive.org/maven2/org/archive/heritrix/heritrix/3.2.0/) and it works fine. But now I want to write and add my own extensions e.g. change the priority…
0
votes
1 answer

Heritrix DecidingScope regexp URI

I'm using HERITRIX to crawl a site called octetfarm.com. I would like the crawler to do a regexp on the URI (or URL) and if the string "octetfarm" is present, the crawler should accept it. I made 2 rules 1 MatchesRegExpDecideRule "ACCEPT" and…
user848106
  • 235
  • 1
  • 8
  • 18
0
votes
1 answer

Understanding the "content type" for PDFs in crawling output

Using heritrix, I have crawled a site which contained some PDF files. The crawl log shows that the content type for the pdf link is "application/pdf", whereas the response in .warc file (crawl output) shows that the content type is…
rivu
  • 2,004
  • 2
  • 29
  • 45
0
votes
1 answer

Heritrix retrieves gzip CSS + JS

When I run Heritrix my web-server gzip's JS + CSS assets. This is turning out to be a problem because when loading the .warc file through Wayback, it's still encoded as gzip. I am unable to view the .css + .js files properly in the browser.
Tim Nuwin
  • 2,775
  • 2
  • 29
  • 63
0
votes
1 answer

How do i exclude everything but links/outlinks from a heritrix crawl?

I'm working with Heritrix and I'm a bit stuck with managing its output. I'm studying PageRank and I need Heritrix to generate a file against which to apply the ranking algorithm. The file that I need shall have only links and outlinks for each…
Ein F
  • 1
  • 1
0
votes
1 answer

Java & Heritrix 3.1.x: Web Content parsing?

Since the developer documentation for Heritrix 3.x is largely out of date (most of it pertains to Heritrix 1.x, as most of the classes have been changed or code has been significantly rewritten/refactored), could anyone point me to the relevant…
9codeMan9
  • 802
  • 6
  • 11
  • 25
0
votes
1 answer

Unable to run heritrix job

I am a new to Heritrix 3.1.1. I got some error message when i run a job after startup Heritrix. My job configuration: metadata.operatorContactUrl="http://localhost" metadata.jobName=basic metadata.description=Basic crawl starting with…
peter8015
  • 1
  • 1
0
votes
1 answer

Heritrix: how to exclude everything but pdf from mirroring?

I found this topic How do i exclude everything but text/html from a heritrix crawl? I have changed bean to this
hudvin
  • 63
  • 1
  • 7
0
votes
2 answers

How to use the webUI for Heritrix remotely

Hello I have been playing with Heritrix, and would like to include it on a website/allow remote web access to it. I have a Linux based server where I have a hosted webpage, and I have built a version of Heritrix. The issue is I am at home now and…
Liv MacIntosh
  • 153
  • 1
  • 10
0
votes
1 answer

Is it possible to integrate Nutch Crawler with my existing Lucene project?

I have a project using Lucene3.5 already. Now i need to provide web search function but i don't want to import the whole Nutch project. So i wonder , may be i can only use the crawler part of Nutch to crawl websites and index them into Lucene…
WoooHaaaa
  • 19,732
  • 32
  • 90
  • 138
-1
votes
1 answer

Can't run parallel jobs in Heritrix3 Web Crawler

I created 2 jobs in Heritrix 3.2.0 and I launched both after building, both started running but after 15 to 20 seconds, one job is stopped and other continues and when a job is stopped, the status in jobs log is as follows: 2015-05-12T06:40:33.715Z…
Qasim Javed
  • 27
  • 1
  • 7
1 2
3