2

For my work, I scrape web-sites and write them to gzipped web-archives (with extension "warc.gz"). I use Python 2.7.11 and the warc 0.2.1 library.

I noticed that for majority of files I cannot read them completely with the warc-library. For example if the warc.gz file has 517 records, I can read only about 200 of them.

After some research I found out that this problem happens only with the gzipped files. The files with extension "warc" do not have this problem.

I have found out that some people have this problem as well (https://github.com/internetarchive/warc/issues/21), while no solution for it is found.

I guess that there might be a bug in "gzip" in Python 2.7.11. Does maybe someone have experience with this, and know what can be done about this problem?

Thanks in advance!

Example:

I create new warc.gz files like this:

import warc
warc_path = "\\some_path\file_name.warc.gz"
warc_file = warc.open(warc_path, "wb")

To write records I use:

record = warc.WARCRecord(payload=value, headers=headers)
warc_file.write_record(record)

This creates perfect "warc.gz" files. There are no problems with them. All, including "\r\n" is correct. But the problem starts when I read these files.

To read files I use:

warc_file = warc.open(warc_path, "rb")

To loop through records I use:

for record in warc_file:
    ...

The problem is that not all records are found during this looping for "warc.gz" file, while they all are found for "warc" files. Working with both types of files is addressed in the warc-library itself.

Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
  • 1
    Please add an [MCVE](http://stackoverflow.com/help/mcve), a minimal, complete and verifiable example. Even the linked issue in github is quite vague. – Ilja Everilä Mar 23 '16 at 09:08
  • Ilja, the warc library is very small, there is really not much code to give as an example. To create a warc.gz file I use `self.warc_file = warc.open(self.warc_path, "wb")`. To write records I use `record = warc.WARCRecord(payload=value, headers=headers)`. To read records I use `self.warc_file = warc.open(self.warc_path, "rb")` and `for record in self.warc_file:`. The problem is that not all records are found. – Ekaterina Ermilova Mar 23 '16 at 09:25
  • I can attach an example warc.gz file is I found out how to attach it here... – Ekaterina Ermilova Mar 23 '16 at 09:31
  • Please add the code examples to your question to make them easier to find (and that's where they belong anyway). Make the examples minimal, but complete in the sense that they readily display what you want to happen and what actually happens. A small example dataset that will reproduce your observed behaviour will also make it easier for people to answer. – Ilja Everilä Mar 23 '16 at 09:32
  • Ilja, the above are the code examples. There is no more relevant code examples. – Ekaterina Ermilova Mar 23 '16 at 09:35
  • The warc-library documentation is here: http://warc.readthedocs.org/en/latest/. I use those small commands exactly as it shows in the documentation. It works with "warc" files well, but exactly the same code does not work with "warc.gz" files, which get read only partially. Since other complains about the same problem started exactly after the release of Python 2.7.11 I guess that it is related to updating "gzip" in this release. – Ekaterina Ermilova Mar 23 '16 at 09:44
  • Please edit the question and add all this there. And your code examples are not complete. For example you do not include the actual writing part. – Ilja Everilä Mar 23 '16 at 09:47
  • Reproduced your results with python 2.7.9, 2.7.10 and 2.7.11. Checked that raw stdlib `gzip` reads the file just fine and so does the bastardation [`warc.gzip2`](https://github.com/internetarchive/warc/blob/master/warc/gzip2.py). It would seem that the logic in `WARCReader` is broken. – Ilja Everilä Mar 23 '16 at 12:06

1 Answers1

5

It seems that the custom gzip handling in warc.gzip2.GzipFile, file splitting with warc.utils.FilePart and reading in warc.warc.WARCReader is broken as a whole (tested with python 2.7.9, 2.7.10 and 2.7.11). It stops short when it receives no data instead of a new header.

It would seem that basic stdlib gzip handles the catenated files just fine and so this should work as well:

import gzip
import warc

with gzip.open('my_test_file.warc.gz', mode='rb') as gzf:
    for record in warc.WARCFile(fileobj=gzf):
        print record.payload.read()
Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
  • Thanks! :) I have implemented a similar work-around. The question is mainly if it is a bug in gzip-library. Shall it be reported to the Python developers team? – Ekaterina Ermilova Mar 23 '16 at 12:52
  • As your own workaround and mine show, the stdlib `gzip` works fine. I'm pretty sure the problem lays in the custom `warc.gzip2` library. It did not work with 3 different python 2.7 versions for me. – Ilja Everilä Mar 23 '16 at 12:55