1

I want to parse warc.gz file downloaded from common crawl. I have a requirement where I have to parse the news warc.gz file manually. What is the delimiter between two records?

Ravi Ranjan
  • 353
  • 1
  • 6
  • 22

2 Answers2

2

I don't think you can parse the gzipped file manually. Your best option is to use the index to find out the offset and length of each record. See the api documentation and the guides for more info.

If you do what to parse the WARC files manually, unzip the .gz file first.

WARC records are separated by two newlines:

A WARC format file is the simple concatenation of one or more WARC records. A record consists of a record header followed by a record content block and two newlines. (Newlines are CRLF as per other Internet standards.)

Mark Lapierre
  • 1,067
  • 9
  • 15
  • even if I unzip the .gz file, I have no way of getting each records separately. Is there a way? – Ravi Ranjan Aug 30 '17 at 04:36
  • As I just noted, *each record is separated by two newlines*. If you need more help, then I need more info about what you're trying to do. And why do you have to do it manually? – Mark Lapierre Aug 30 '17 at 09:33
  • thanks. I have to create an rdd of that file. The default delimiter in spark is something which is not that warc has. Hence I am getting way more number of records than the file has. – Ravi Ranjan Aug 30 '17 at 10:42
  • i tried this file2 = sc.newAPIHadoopFile( 'hdfs://master:54310/CC-NEWS-20170803215756-00005.warc', 'org.apache.hadoop.mapreduce.lib.input.TextInputFormat', 'org.apache.hadoop.io.LongWritable', 'org.apache.hadoop.io.Text', conf={'textinputformat.record.delimiter': '\n\n'} ) this still gives me more number of records than expected. – Ravi Ranjan Aug 30 '17 at 10:46
  • Ah. Spark. Unfortunately I can't help with that. I suggest you ask another question with **all** the details of exactly what you're trying to do, what you've tried, and what problem you're having. Then someone familiar with Spark may be able to help. – Mark Lapierre Aug 30 '17 at 12:23
0

There is no unambiguous record separator in a WARC file. A record always ends with '\r\n\r\n' but this is also used to separate a record header from a record body and may occur anywhere in the HTML documents. The length of a WARC record is defined by the Content-Length in the record header.

To process Common Crawl WARC files with PySpark, see cc-pyspark.

Sebastian Nagel
  • 2,049
  • 10
  • 10