Using Heritrix 3.2.x, i had crawled a website, Now i want to read the HTML content from the warc files created. Can anyone help ? I tried using python warc tool and java based warc-tools.jar.
Asked
Active
Viewed 523 times
3 Answers
0
To get an idea what warc file consists, just use some kind of text editor. For graphical view, you need a tool like webarchiveplayer or pywb or openwayback.

zuups
- 1,140
- 1
- 11
- 17
0
Have you tried programming a reader using JWAT or use the JWAT Tools command line.
jwattools.cmd extract path.to.warc(.gz)

YMomb
- 2,366
- 1
- 27
- 36
0
Using the same version of Heritrix
you are using. For the playbacks, the OpenWayBack
is used.
The OpenWayBack
is bundled with CDX-Indexer
which could be used to extract the contents which is written to a CDX
file where you can obtain the HTML
links etc.

Du-Lacoste
- 11,530
- 2
- 71
- 51