Highest Voted 'heritrix' Questions

0

votes

1 answer

scraping a heritrix page using python's request module

I want to scrape a Heritrix home page using pythons requests module. When I try to open this page on chrome, I get the error: This server could not prove that it is 10.100.121.41; its security certificate is not trusted by your computer's…

ssl python-requests heritrix

asked Feb 20 '15 at 19:34

rivu

2,004
2
29
45

0

votes

0 answers

MirrorWriterProcessor in Heritrix 3.2.0 active threads

When im using the MirrorWriterProcessor Class i get only 1 active thread all the time because it wont accepts the de-outcomment properties for increasing max active threads for example. im no java programmer at all so if someone can help me i would…

java heritrix

asked Nov 10 '14 at 23:20

GMAC

9
1

0

votes

1 answer

Heritrix 3.2.0: Writing and Adding Extensions

I am currently working with Heritrix and I have a standard installation (this one: http://builds.archive.org/maven2/org/archive/heritrix/heritrix/3.2.0/) and it works fine. But now I want to write and add my own extensions e.g. change the priority…

spring jar web-crawler heritrix

asked Nov 07 '14 at 20:59

AnswerNotKnownException

41
3

0

votes

1 answer

Heritrix DecidingScope regexp URI

I'm using HERITRIX to crawl a site called octetfarm.com. I would like the crawler to do a regexp on the URI (or URL) and if the string "octetfarm" is present, the crawler should accept it. I made 2 rules 1 MatchesRegExpDecideRule "ACCEPT" and…

regex heritrix

asked Oct 01 '14 at 03:05

user848106

235
1
8
18

0

votes

1 answer

Understanding the "content type" for PDFs in crawling output

Using heritrix, I have crawled a site which contained some PDF files. The crawl log shows that the content type for the pdf link is "application/pdf", whereas the response in .warc file (crawl output) shows that the content type is…

http pdf web-crawler content-type heritrix

asked May 29 '14 at 11:33

rivu

2,004
2
29
45

0

votes

1 answer

Heritrix retrieves gzip CSS + JS

When I run Heritrix my web-server gzip's JS + CSS assets. This is turning out to be a problem because when loading the .warc file through Wayback, it's still encoded as gzip. I am unable to view the .css + .js files properly in the browser.

java javascript css heritrix

asked Sep 17 '13 at 19:04

Tim Nuwin

2,775
2
29
63

0

votes

1 answer

How do i exclude everything but links/outlinks from a heritrix crawl?

I'm working with Heritrix and I'm a bit stuck with managing its output. I'm studying PageRank and I need Heritrix to generate a file against which to apply the ranking algorithm. The file that I need shall have only links and outlinks for each…

web-crawler heritrix

asked Jul 25 '13 at 12:24

Ein F

1
1

0

votes

1 answer

Java & Heritrix 3.1.x: Web Content parsing?

Since the developer documentation for Heritrix 3.x is largely out of date (most of it pertains to Heritrix 1.x, as most of the classes have been changed or code has been significantly rewritten/refactored), could anyone point me to the relevant…

java web-crawler html document-classification heritrix

asked Jul 19 '13 at 15:54

9codeMan9

802
6
11
25

0

votes

1 answer

Unable to run heritrix job

I am a new to Heritrix 3.1.1. I got some error message when i run a job after startup Heritrix. My job configuration: metadata.operatorContactUrl="http://localhost" metadata.jobName=basic metadata.description=Basic crawl starting with…

heritrix

asked Apr 11 '13 at 10:18

peter8015

1
1

0

votes

1 answer

Heritrix: how to exclude everything but pdf from mirroring?

I found this topic How do i exclude everything but text/html from a heritrix crawl? I have changed bean to this

cxml heritrix

asked Nov 25 '12 at 10:50

hudvin

63
1
7

0

votes

2 answers

How to use the webUI for Heritrix remotely

Hello I have been playing with Heritrix, and would like to include it on a website/allow remote web access to it. I have a Linux based server where I have a hosted webpage, and I have built a version of Heritrix. The issue is I am at home now and…

linux remote-access web-crawler heritrix

asked Oct 05 '12 at 00:39

Liv MacIntosh

153
1
10

0

votes

1 answer

Is it possible to integrate Nutch Crawler with my existing Lucene project?

I have a project using Lucene3.5 already. Now i need to provide web search function but i don't want to import the whole Nutch project. So i wonder , may be i can only use the crawler part of Nutch to crawl websites and index them into Lucene…

java lucene web-crawler nutch heritrix

asked Apr 06 '12 at 07:30

WoooHaaaa

19,732
32
90
138

-1

votes

1 answer

Can't run parallel jobs in Heritrix3 Web Crawler

I created 2 jobs in Heritrix 3.2.0 and I launched both after building, both started running but after 15 to 20 seconds, one job is stopped and other continues and when a job is stopped, the status in jobs log is as follows: 2015-05-12T06:40:33.715Z…

linux bash web-crawler heritrix

asked May 12 '15 at 06:51

Qasim Javed

27
1
7

Prev 1 2

3

Questions tagged [heritrix]