I have to retrieve records from a *.warc.gz file based on Target-URI. The documentation says that this requires external CDXJ index files to be created.
I've tried opening the file as gzip.open()
and do a seek(offset)
, but the seek operation is taking quite some time(seconds).
Is there any other correct way to retrieve the records.
Edit:I'm using warc python library and they don't seem to provide a direct f.seek() on the warc file.