Retrieving records from WARC file based on url

Question

I have to retrieve records from a *.warc.gz file based on Target-URI. The documentation says that this requires external CDXJ index files to be created.

I've tried opening the file as gzip.open() and do a seek(offset), but the seek operation is taking quite some time(seconds).

Is there any other correct way to retrieve the records.

Edit:I'm using warc python library and they don't seem to provide a direct f.seek() on the warc file.

Sebastian Nagel · Accepted Answer · 2018-03-20T08:29:10.463

3

You should do the seek on the file before decompressing. Usually, WARC files are compressed record by record and the offset and length in the CDXJ allow to clip out a single WARC record, then do a gzip.open() then on the single record. In doubt, better use a library. Warcio even provides a command-line tool to extract a single record by offset: warcio extract xyz.warc.gz offset.

edited Mar 20 '18 at 08:29

answered Mar 20 '18 at 07:42

Sebastian Nagel

2,049
10
10

I'm not able to clip out a single record directly using offset as internet archives warc library is not providing any function to do so. And also can u explain little about how WARC files are compressed record by record – kartheek7895 Mar 20 '18 at 08:58
Also, I'm not sure if Warcio can be directly integrated to the custom built crawler. – kartheek7895 Mar 20 '18 at 09:05
The result of two gzipped files concatenated is also a valied gzipped file, or in other words `zcat file1.gz file2.gz` and `cat file1.gz file2.gz | gzip -dc` give the same output. But if you now the offset to the beginning of file2.gz, it's possible to extract the content of file2.gz separately. This procedure is recommended in the [WARC standard](http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#record-at-time-compression). – Sebastian Nagel Mar 20 '18 at 14:52
How do I get the offset of each record, as I'm doing f.tell() while iterating, however, when I tried to extract the record based on that offset, I'm not able to retrieve the record? Upon inspection, there is a difference between my offset() and offset generated by warcio index. Can someone throw some light on how warcio index is getting the offset? – kartheek7895 Mar 21 '18 at 06:52
The CDX index contains offsets in the compressed *.warc.gz file. The offsets returned on the decompressed file object returned by gzip.open(...) will differ. Have a look at warcio's archiveiterator.py how to read the uncompressed content while keeping offsets to the compressed file. Caveat: it's not trivial, cf. [unresolved issue in the warc module](https://github.com/internetarchive/warc/issues/21). – Sebastian Nagel Mar 21 '18 at 08:51

Retrieving records from WARC file based on url

1 Answers1