2

I would like to download a subset of a WAT archive segment from Amazon S3.

Background:

Searching the Common Crawl index at http://index.commoncrawl.org yields results with information about the location of WARC files on AWS S3. For example, searching for url=www.celebuzz.com/2017-01-04/*&output=json yields JSON-formatted results, one of which is

{ "urlkey":"com,celebuzz)/2017-01-04/watch-james-corden-george-michael-tribute", ... "filename":"crawl-data/CC-MAIN-2017-34/segments/1502886104631.25/warc/CC-MAIN-20170818082911-20170818102911-00023.warc.gz", ... "offset":"504411150", "length":"14169", ... }

The filename entry indicates which archive segment contains the WARC file for this particular page. This archive file is huge; but fortunately the entry also contains offset and length fields, which can be used to request the range of bytes containing the relevant subset of the archive segment (see, e.g., lines 22-30 in this gist).

My question:

Given the location of a WARC file segment, I know how to construct the name of the corresponding WAT archive segment (see, e.g., this tutorial). I only need a subset of the WAT file, so I would like to request a range of bytes. But how do I find the corresponding offset and length for the WAT archive segment?

I have checked the API documentation for the Common Crawl index server, and it isn't clear to me that this is even possible. But in case it is, I'm posting this question.

jmtroos
  • 123
  • 4

2 Answers2

4

The Common Crawl index does not contain offsets into WAT and WET files. So, the only way is to search the whole WAT/WET file for the desired record/URL. Eventually, it would be possible to estimate the offset because the record order in WARC and WAT/WET files is the same.

Sebastian Nagel
  • 2,049
  • 10
  • 10
1

After many trial and error I had managed to get a range from a warc file in python and boto3 the following way:

# You have this form the index
offset, length, filename = 2161478, 12350, "crawl-data/[...].warc.gz"

import boto3
from botocore import UNSIGNED
from botocore.client import Config

# Boto3 anonymous login to common crawl
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

# Count the range
offset_end = offset + length - 1
byte_range = 'bytes={offset}-{end}'.format(offset=2161478, end=offset_end)
gzipped_text = s3.get_object(Bucket='commoncrawl', Key=filename, Range=byte_range)['Body'].read()

# The requested file in GZIP
with open("file.gz", 'w') as f:
  f.write(gzipped_text)

The rest is optimisation... Hope it helps! :)

dlazesz
  • 168
  • 17
  • 1
    This is useful for getting the offset of a WARC archive, but my original question was about doing the same for a WAT file... – jmtroos Oct 05 '17 at 13:36
  • Consider using [this Python tool](https://github.com/lxucs/commoncrawl-warc-retrieval) or the [Java WARCReaderFactory.get](https://github.com/iipc/webarchive-commons/blob/master/src/main/java/org/archive/io/warc/WARCReaderFactory.java) version. – Alex Moore-Niemi Nov 27 '19 at 16:36