Questions tagged [warc]

Use this tag for questions related to Web ARChive files.

WARC (Web ARChive) is an extension of the ARC file format

The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format

WIKI: https://en.wikipedia.org/wiki/Web_ARChive

Python: https://warc.readthedocs.io/en/latest/

58 questions
1
vote
0 answers

Interaction of spark configurations with input structure

Spark has many configurable options. Here, I would like to know what the optimal configuration is under certain constraints. I have seen many of these post and do not think the approach of neglecting the structure of the data can yield in a…
1
vote
0 answers

how to reconstruct an org.archive.io.warc.WARCRecordInfo from an org.archive.io.ArchiveRecord?

Using java, I need to read a warc archive file, filter it depending on the content of the html page, and write a new archive file. the following code reads the archive. how to reconstruct an org.archive.io.warc.WARCRecordInfo from an…
David Portabella
  • 12,390
  • 27
  • 101
  • 182
1
vote
1 answer

Creating a warc record with requests.get() response using warcio

I'm using the warcio library to read and write warc files. When trying to write a record of a response object from requests.get(URL,stream=False), warcio is writing only HTTP headers to the record but not the payload. However, when stream mode is…
kartheek7895
  • 341
  • 1
  • 12
1
vote
0 answers

requests.get() not crawling entire common crawl records for a given warc path

i have implemented https://dmorgan.info/posts/common-crawl-python/ as described in this link. However, I want to crawl entire data rather than partial data unlike as described in this post. So, in this code chunk, def get_partial_warc_file(url,…
Ravi Ranjan
  • 353
  • 1
  • 6
  • 22
1
vote
1 answer

How to compress warc records with lzma (*.warc.xz) in python3?

I have a list of warc records. Every single item in list is created like this: header = warc.WARCHeader({ "WARC-Type": "response", "WARC-Target-URI": "www.somelink.com", }, defaults=True) data = "Some string" record =…
Tehryn
  • 57
  • 1
  • 10
1
vote
0 answers

Changing delimiter for reading the file in pyspark

I am trying to read a .warc.gz file to an RDD with PySpark. I would like the delimiter to be three newline characters so I can read each record as an element of the RDD in order to parse them and use the information. Primarily, I am interested in…
1
vote
2 answers

Read warc file with python

I want to read a warc file and I wrote the follwoing code based on this page but nothing was printted!! >>import warc >>f = warc.open("01.warc.gz") >>for record in f: print record['WARC-Target-URI'], record['Content-Length'] However, when I…
user3487667
  • 519
  • 5
  • 22
1
vote
1 answer

Scrapy Spider which reads from Warc file

I am looking for a Scrapy Spider that instead of getting URL's and crawls them, it gets as input a WARC file (preferably from S3) and send to the parse method the content. I actually need to skip all the download phase, that means that from…
Udy
  • 2,492
  • 4
  • 23
  • 33
1
vote
1 answer

How to loop through WARC files using HeaderedArchiveRecord with Heritrix 3.1

I'm using the Heritrix 3.1 Java library. Just to be clear, I'm not interested in crawling but only in processing data from compressed WARC (*.warc.gz) files generated by another team. For each WWW document stored in the WARC file, I need some…
AdamF
  • 519
  • 4
  • 11
0
votes
0 answers

Generate a WARC from local site files

I have downloaded a Twitter archive export. It's a few folders of JavaScript, HTML, etc. and a top-level "Your archive.html". This is all viewable via a local browser, but I would like to generate a WARC from these local site files. I have looked at…
wxs
  • 288
  • 3
  • 18
0
votes
1 answer

wget --warc-file gets only main page and robot pages?

I am trying to do a little project on a small-ish WARC file. I used this command: [ ! -f course.warc.gz ] && wget -r -l 3 "https://www.ru.nl/datascience/" --delete-after --no-directories --warc-file="course" || echo Most likely, course.warc.gz…
Spiridon
  • 23
  • 5
0
votes
2 answers

Converting warc.gz to .warc

My attempt to extract a warc.gz file, using gzip, resulted in a WARC, but it won't load in http://replayweb.page. Extracting it using The Unarchiver gave me all the expanded html and other files. What is the latest recommended method for converting…
Jack P
  • 1
  • 1
0
votes
1 answer

Number of records in WARC file

I currently parsing WARC files from CommonCrawl corpus and I would like to know upfront, without iterating through all WARC records, how many records are there. Does WARC 1.1 standard defines such information?
dzieciou
  • 4,049
  • 8
  • 41
  • 85
0
votes
0 answers

Open Clueweb warc file with python 3

I would like to open the ClueWeb09 warc file in Python3, i was able to open it in python2 using this library, but I need to open it in the other python version since i need other library that are present just in python3. I have tried to adapt this…
0
votes
0 answers

How to parse warc data for robots.txt info

I have the following code that I am writing to get values from a warc file. My goal is to find sites that have: User-Agent: * Disallow: / I would like it to only print URLs that have the above robots.txt rules ^ My Python code that currently only…
Trey Copeland
  • 3,387
  • 7
  • 29
  • 46