2

I have to retrieve records from a *.warc.gz file based on Target-URI. The documentation says that this requires external CDXJ index files to be created.

I've tried opening the file as gzip.open() and do a seek(offset), but the seek operation is taking quite some time(seconds).

Is there any other correct way to retrieve the records.

Edit:I'm using warc python library and they don't seem to provide a direct f.seek() on the warc file.

kartheek7895
  • 341
  • 1
  • 12

1 Answers1

3

You should do the seek on the file before decompressing. Usually, WARC files are compressed record by record and the offset and length in the CDXJ allow to clip out a single WARC record, then do a gzip.open() then on the single record. In doubt, better use a library. Warcio even provides a command-line tool to extract a single record by offset: warcio extract xyz.warc.gz offset.

Sebastian Nagel
  • 2,049
  • 10
  • 10
  • I'm not able to clip out a single record directly using offset as internet archives warc library is not providing any function to do so. And also can u explain little about how WARC files are compressed record by record – kartheek7895 Mar 20 '18 at 08:58
  • Also, I'm not sure if Warcio can be directly integrated to the custom built crawler. – kartheek7895 Mar 20 '18 at 09:05
  • The result of two gzipped files concatenated is also a valied gzipped file, or in other words `zcat file1.gz file2.gz` and `cat file1.gz file2.gz | gzip -dc` give the same output. But if you now the offset to the beginning of file2.gz, it's possible to extract the content of file2.gz separately. This procedure is recommended in the [WARC standard](http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#record-at-time-compression). – Sebastian Nagel Mar 20 '18 at 14:52
  • How do I get the offset of each record, as I'm doing f.tell() while iterating, however, when I tried to extract the record based on that offset, I'm not able to retrieve the record? Upon inspection, there is a difference between my offset() and offset generated by warcio index. Can someone throw some light on how warcio index is getting the offset? – kartheek7895 Mar 21 '18 at 06:52
  • The CDX index contains offsets in the compressed *.warc.gz file. The offsets returned on the decompressed file object returned by gzip.open(...) will differ. Have a look at warcio's archiveiterator.py how to read the uncompressed content while keeping offsets to the compressed file. Caveat: it's not trivial, cf. [unresolved issue in the warc module](https://github.com/internetarchive/warc/issues/21). – Sebastian Nagel Mar 21 '18 at 08:51