Heritrix 3.2.x , how to read content from warc files ?

Question

Using Heritrix 3.2.x, i had crawled a website, Now i want to read the HTML content from the warc files created. Can anyone help ? I tried using python warc tool and java based warc-tools.jar.

score 0 · Answer 1 · answered Aug 26 '16 at 15:28

0

To get an idea what warc file consists, just use some kind of text editor. For graphical view, you need a tool like webarchiveplayer or pywb or openwayback.

answered Aug 26 '16 at 15:28

zuups

1,140
1
11
17

YMomb · Answer 2 · 2017-01-07T08:05:36.650

0

Have you tried programming a reader using JWAT or use the JWAT Tools command line.

jwattools.cmd extract path.to.warc(.gz)

edited Jan 07 '17 at 08:05

answered Jan 05 '17 at 21:29

YMomb

2,366
1
27
36

score 0 · Answer 3 · answered May 06 '23 at 03:18

0

Using the same version of Heritrix you are using. For the playbacks, the OpenWayBack is used.

The OpenWayBack is bundled with CDX-Indexer which could be used to extract the contents which is written to a CDX file where you can obtain the HTML links etc.

answered May 06 '23 at 03:18

Du-Lacoste

11,530
2
71
51

Heritrix 3.2.x , how to read content from warc files ?

3 Answers3