0

I'm trying to parse .warc files from Common Crawl in Python.

Since the files are huge, I want to start with a sample/subset of the first few records.

How do I truncate the file the file to only include the first X lines while preserving the newlines/carriage returns that are in place?

Here's what I tried already:

  1. head -n 250 oldfile > newfile This removes some of the returns that are needed to parse the file. Here's the error I get if I try to use this file in my Hadoop job (reading it with the warc package):

      Traceback (most recent call last):
          File "test.py", line 46, in <module>
            TagGrabber.run()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/job.py", line 461, in run
            mr_job.execute()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/job.py", line 479, in execute
            super(MRJob, self).execute()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/launch.py", line 151, in execute
            self.run_job()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/launch.py", line 214, in run_job
            runner.run()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/runner.py", line 464, in run
            self._run()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/sim.py", line 173, in _run
            self._invoke_step(step_num, 'mapper')
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/sim.py", line 264, in _invoke_step
            self.per_step_runner_finish(step_num)
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/local.py", line 152, in per_step_runner_finish
            self._wait_for_process(proc_dict, step_num)
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/mrjob/local.py", line 268, in _wait_for_process
            (proc_dict['args'], returncode, ''.join(tb_lines)))
        Exception: Command ['sh', '-ex', 'setup-wrapper.sh', '/var/cc-mrjob/venv/bin/python', 'test.py', '--step-num=0', '--mapper', '/tmp/test.root.20150520.071726.549519/input_part-00000'] returned non-zero exit status 1:
        Traceback (most recent call last):
          File "test.py", line 46, in <module>
            TagGrabber.run()
          File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 461, in run
            mr_job.execute()
          File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 470, in execute
            self.run_mapper(self.options.step_num)
          File "/tmp/test.root.20150520.071726.549519/job_local_dir/0/mapper/0/mrjob.tar.gz/mrjob/job.py", line 535, in run_mapper
            for out_key, out_value in mapper(key, value) or ():
          File "/var/cc-mrjob/mrcc.py", line 33, in mapper
            for i, record in enumerate(f):
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 390, in __iter__
            record = self.read_record()
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 373, in read_record
            header = self.read_header(fileobj)
          File "/var/cc-mrjob/venv/local/lib/python2.7/site-packages/warc/warc.py", line 331, in read_header
            raise IOError("Bad version line: %r" % version_line)
        IOError: Bad version line: 'WARC/1.0\n'
    
  2. same as #1 but with tail command

  3. same as #1 but using tr or sed after to replace any lost newline or ^M (carriage return) characters. This causes the warc package to still complain that expected carriage returns or newlines were not in place.
  4. unix2dos oldfile
Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
okoboko
  • 4,332
  • 8
  • 40
  • 67
  • Looking at the warc python lib it doesn't read the entire .warc file at once, but a record at a time. What do you require the truncation for? An honest question, perhaps moving over a network or some such? – Ilja Everilä May 20 '15 at 12:57
  • Adding to previous "it doesn't read the entire .warc" it's quite trivial to implement a "read N first records" using the warc lib only: `islice(warc_file, N)`, if that's what you are looking for. – Ilja Everilä May 20 '15 at 13:17
  • @Ilja thanks - that's exactly what I was looking for. Can you add that as an answer? – okoboko May 20 '15 at 15:26

1 Answers1

1

It would be difficult to handle newlines correctly because the .warc files may contain binary data as well. Truncation would also probably produce broken .warc files, since the python library for example trusts that the Content-Length headers are valid.

The warc python lib reads only a record at a time from the .warc file (avoiding reading the entire file to memory at once), and thus it is possible to handle subsets using python only. For example:

import warc
from itertools import islice

N = 10
warc_file = warc.open('/path/to/file.warc')
for record in islice(warc_file, N):
    do_stuff_with(record)
Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127