How do I archive and retrieve a large HTML dataset?

Question

I am a fresher and I am about to participate in a contest this weekend. The problem is about archiving and retrieving a large HTML dataset and I have no idea about it. My friend suggested to me to use a web archive and common crawl. Please suggest to me a way to convert the HTML dataset into a web archive and how to index them. Thanks in advance.

score 0 · Answer 1 · answered Aug 19 '16 at 09:52

The WARC format is a widely used standard, definitely a good decisions to archive web pages. Also the HTTP headers are contained in the WARC file. As a consequence, you need a crawler to create a WARC file. If the HTML pages are provided as a collection of files, you would need to crawl the file system (ev. via a local HTTP server) to get the content into a WARC file.

Everything else depends on the concrete task: there are many tools and libraries

to crawl and export the content as WARC: the simplest is wget --warc-file but there are many more
to read WARC files and process the content.

See The WARC Ecosystem for a collection of tools. If you just need a serious WARC file to start with, fetch one from Common Crawl, e.g., https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-30/segments/1469257824853.47/warc/CC-MAIN-20160723071024-00101-ip-10-185-27-174.ec2.internal.warc.gz

score 0 · Answer 2 · answered May 06 '23 at 03:06

You could use Heritrix Crawler to crawl the websites you require. This can be automated via CURL requests simply writing a Shell Script.

Once you have crawled the websites, you can install OpenWayBack, to 'play back' archived websites in your browser.

The OpenWayback comes with a tool: CDX-Indexer which could be used for indexing the crawled Websites.

The current OpenWayBack is not under development and you may use WaybackPY for the playing of the WARCs.

How do I archive and retrieve a large HTML dataset?

2 Answers2