1

I'm using the Heritrix 3.1 Java library. Just to be clear, I'm not interested in crawling but only in processing data from compressed WARC (*.warc.gz) files generated by another team. For each WWW document stored in the WARC file, I need some information from the record header, some from the HTTP headers, and the full content of the HTTP payload/body, so I think I need to use the HeaderedArchiveRecord class.

WARCReader warcReader = WARCReaderFactory.get(warcFile);
int inputSequence = -1;

ArchiveRecord record = warcReader.get();
while(record != null){
  inputSequence++;

  // Skip the 0th record, which is just the archive guff.
  if (inputSequence == 0) {
    // print some info but do not process this record
  }
  else if (! record.hasContentHeaders()) {
    // print some info but do not process this record
  }
  else  {
    HeaderedArchiveRecord hRecord = new HeaderedArchiveRecord(record);
    ArchiveRecordHeader archiveHeader = hRecord.getHeader();
    gate.Document document = makeDocumentHeritrix(archiveHeader,
       inputSequence,  hRecord);
    //...
  }
  record.close();
  record = warcReader.get();  // line 754
}

warcReader.close();

When I run this, I get an exception with this cause

Caused by: java.io.IOException: Failed to read WARC_MAGIC
    at org.archive.io.warc.WARCRecord.parseHeaders(WARCRecord.java:116)
    at org.archive.io.warc.WARCRecord.<init>(WARCRecord.java:90)
    at org.archive.io.warc.WARCReader.createArchiveRecord(WARCReader.java:94)
    at org.archive.io.warc.WARCReader.createArchiveRecord(WARCReader.java:44)
    at org.archive.io.ArchiveReader.get(ArchiveReader.java:159)
    at
gate.arcomem.batch.Enrichment.makeCorpusWithHeritrix(Enrichment.java:754)

where my line 754 is as marked above. The code in my makeDocumentHeritrix(...) method used to throw a similar exception but with Failed to find WARC_MAGIC until I moved the line hrecord.skipHttpHeader(); to before Header[] httpHeader = record.getContentHeaders(); inside it.

I have tried to search the web for examples of code to loop through records in WARC files, but haven't found any, and I recall that when I used heritrix 1.14 several years ago to do something similar, I had to do some weird things to manipulate the offsets in the files, but the related methods in WARCReader are now all private or protected, so I would not expect to have to do that with the newer library.

AdamF
  • 519
  • 4
  • 11

1 Answers1

1

I had success with the following code:

Iterator<ArchiveRecord> archIt = WARCReaderFactory.get(new File(args[0])).iterator();
while (archIt.hasNext()) {
     handleRecord(archIt.next());
}
Hannes Mühleisen
  • 2,542
  • 11
  • 13