How to read a subset of records from a warc file

Question

I'm trying to parse .warc files from Common Crawl in Python.

Since the files are huge, I want to start with a sample/subset of the first few records.

How do I truncate the file the file to only include the first X lines while preserving the newlines/carriage returns that are in place?

Here's what I tried already:

head -n 250 oldfile > newfile This removes some of the returns that are needed to parse the file. Here's the error I get if I try to use this file in my Hadoop job (reading it with the warc package):

  Traceback (most recent call last):
      File "test.py", line 46, in <module>
        TagGrabber.run()
      File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/job.py", line 461, in run
        mr_job.execute()
      File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/job.py", line 479, in execute
        super(MRJob, self).execute()
      File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/launch.py", line 151, in execute
        self.run_job()
      File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/launch.py", line 214, in run_job
        runner.run()
      File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/runner.py", line 464, in run
        self._run()
      File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/sim.py", line 173, in _run
        self._invoke_step(step_num, 'mapper')
      File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/sim.py", line 264, in _invoke_step
        self.per_step_runner_finish(step_num)
      File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/local.py", line 152, in per_step_runner_finish
        self._wait_for_process(proc_dict, step_num)
      File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/local.py", line 268, in _wait_for_process
        (proc_dict['args'], returncode, ''.join(tb_lines)))
    Exception: Command ['sh', '-ex', 'setup-wrapper.sh', '/var/cc-mrjob/venv/bin/python', 'test.py', '--step-num=0', '--mapper', '/tmp/test.root.20150520.071726.549519/input_part-00000'] returned non-zero exit status 1:
    Traceback (most recent call last):
      File "test.py", line 46, in <module>
        TagGrabber.run()
      File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 461, in run
        mr_job.execute()
      File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 470, in execute
        self.run_mapper(self.options.step_num)
      File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 535, in run_mapper
        for out_key, out_value in mapper(key, value) or ():
      File "/var/cc-mrjob/mrcc.py", line 33, in mapper
        for i, record in enumerate(f):
      File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 390, in __iter__
        record = self.read_record()
      File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 373, in read_record
        header = self.read_header(fileobj)
      File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 331, in read_header
        raise IOError("Bad version line: %r" % version_line)
    IOError: Bad version line: 'WARC/1.0\n'

same as #1 but with tail command
same as #1 but using tr or sed after to replace any lost newline or ^M (carriage return) characters. This causes the warc package to still complain that expected carriage returns or newlines were not in place.
unix2dos oldfile

Looking at the warc python lib it doesn't read the entire .warc file at once, but a record at a time. What do you require the truncation for? An honest question, perhaps moving over a network or some such? — Ilja Everilä, May 20 '15 at 12:57
Adding to previous "it doesn't read the entire .warc" it's quite trivial to implement a "read N first records" using the warc lib only: `islice(warc_file, N)`, if that's what you are looking for. — Ilja Everilä, May 20 '15 at 13:17
@Ilja thanks - that's exactly what I was looking for. Can you add that as an answer? — okoboko, May 20 '15 at 15:26

score 1 · Accepted Answer · answered May 21 '15 at 10:07

It would be difficult to handle newlines correctly because the .warc files may contain binary data as well. Truncation would also probably produce broken .warc files, since the python library for example trusts that the Content-Length headers are valid.

The warc python lib reads only a record at a time from the .warc file (avoiding reading the entire file to memory at once), and thus it is possible to handle subsets using python only. For example:

import warc
from itertools import islice

N = 10
warc_file = warc.open('/path/to/file.warc')
for record in islice(warc_file, N):
    do_stuff_with(record)

How to read a subset of records from a warc file

1 Answers1