0

Using Heritrix 3.2.x, i had crawled a website, Now i want to read the HTML content from the warc files created. Can anyone help ? I tried using python warc tool and java based warc-tools.jar.

3 Answers3

0

To get an idea what warc file consists, just use some kind of text editor. For graphical view, you need a tool like webarchiveplayer or pywb or openwayback.

zuups
  • 1,140
  • 1
  • 11
  • 17
0

Have you tried programming a reader using JWAT or use the JWAT Tools command line.

jwattools.cmd extract path.to.warc(.gz)
YMomb
  • 2,366
  • 1
  • 27
  • 36
0

Using the same version of Heritrix you are using. For the playbacks, the OpenWayBack is used.

The OpenWayBack is bundled with CDX-Indexer which could be used to extract the contents which is written to a CDX file where you can obtain the HTML links etc.

Du-Lacoste
  • 11,530
  • 2
  • 71
  • 51