Questions tagged [warc]

Use this tag for questions related to Web ARChive files.

WARC (Web ARChive) is an extension of the ARC file format

The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format

WIKI: https://en.wikipedia.org/wiki/Web_ARChive

Python: https://warc.readthedocs.io/en/latest/

58 questions

votes

1 answer

Mapreduce carriage return

I want to process CommonCrawl WARC files in MapReduce using the input format s3a. The problem is that the carriage return char at the end of the input lines is removed and tab is put instead (as it is the default delimiter). Why does this…

python mapreduce warc

asked Jan 18 '19 at 20:53

Afe

votes

1 answer

Variable not set in function nodejs

I want to assign JSON data to a variable by parsing a warc file in a function. The variable is inaccessible outside a function and returns an empty array on the console. var metadataObj = { metadata: [] }; fs …

node.js file variables warc

asked Jan 09 '19 at 07:16

Elliot

votes

1 answer

spark parallelise on iterator with a function

I have an iterator which operates on sequence of WARC documents and yields modified lists of tokens for each document: class MyCorpus(object): def __init__(self, warc_file_instance): self.warc_file = warc_file_instance def clean_text(self,…

apache-spark pyspark warc

asked Aug 25 '18 at 13:09

Akshansh Gupta

votes

2 answers

Reading WARC Files Efficiently

I am reading a WARC file with python's 'warc' library. Current file that I am using, is around 4.50 GB. The thing is ; file = warc.open("random.warc") html_lists = [line for line in file] Executing these 2 lines takes up to 40 seconds. Since there…

python byte common-crawl warc

asked Aug 10 '18 at 12:19

MeteHan

votes

1 answer

Converting a warc.gz file downloaded from Common Crawl to an RDD

I have downloaded a warc.gz file from common crawl and I have to process it using spark. How can convert the file into an RDD?sc.textFile("filepath") does not seem to help. When rdd.take(1) is printed, it gives me [u'WARC/1.0'] whereas it should…

apache-spark pyspark rdd common-crawl warc

asked Aug 23 '17 at 12:33

Ravi Ranjan

votes

1 answer

cannot find url from a warc file crawled from common crawl

I have crawled data from common crawl and I want to find out url corresponding to each of the records. for record in files: print record['WARC-Target-URI'] This outputs an empty list. I am referring to the following…

python record common-crawl warc

asked Jul 17 '17 at 11:56

Ravi Ranjan

votes

1 answer

How to dump Nutch 2.3 data into WARC file?

I need to dump data from Nutch 2.3 into a WARC file. However, i couldn't find the necessary module. Nutch 1.x had this capability. I would like to know the proper way to do it.

nutch warc

asked Jan 26 '17 at 10:16

Abdullah Khan

votes

1 answer

Confused about Kibana import

I would like to know about how I can import data using kibana. Actually, its a confusion for me. I have tried to load json file using kibana, but it is not importing it. second, if I want to work with Warc file, they do I need to convert it into…

json elasticsearch kibana-4 bitnami warc

asked Nov 19 '16 at 09:02

Jaffer Wilson

7,029
10
62
139

votes

2 answers

How do I archive and retrieve a large HTML dataset?

I am a fresher and I am about to participate in a contest this weekend. The problem is about archiving and retrieving a large HTML dataset and I have no idea about it. My friend suggested to me to use a web archive and common crawl. Please suggest…

war common-crawl warc bigdata

asked Aug 18 '16 at 13:06

Sriram S

votes

1 answer

How to read a subset of records from a warc file

I'm trying to parse .warc files from Common Crawl in Python. Since the files are huge, I want to start with a sample/subset of the first few records. How do I truncate the file the file to only include the first X lines while preserving the…

python webarchive warc

asked May 20 '15 at 07:37

okoboko

4,332
8
40
67

-1

votes

1 answer

Unknown archive format! How can I extract URLs from the WARC file by Jupyter?

I'm trying to extract website URLs from a .WARC (Web ARChive) file from a common crawl dataset commoncrawl.org. After decompressing the file and writing the code to read this file, I attached the code: import pandas as pd from warcio.archiveiterator…

url jupyter-notebook python-3.10 common-crawl warc

asked Jun 04 '23 at 15:49

Jawaher

-1

votes

1 answer

Optimize WARC generation in order to save space and time

I am trying to do a WARC file of a very large list of links of several domains like that: wget --no-check-certificate \ --no-verbose \ --execute robots=off \ --delete-after \ --no-directories \ --page-requisites \ …

wget warc

asked Mar 06 '22 at 17:40

santos82h

-1

votes

1 answer

Half of read buffer is corrupt when using ReadFile

Half of the buffer used with ReadFile is corrupt. Regardless of the size of the buffer, half of it has the same corrupted character. I have look for anything that could be causing the read to stop early, etc. If I increase the size of the buffer, I…

c++ winapi readfile warc

asked Dec 03 '20 at 16:51

kbaud

Prev 1 2 3