Questions tagged [warc]

Use this tag for questions related to Web ARChive files.

WARC (Web ARChive) is an extension of the ARC file format

The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format

WIKI: https://en.wikipedia.org/wiki/Web_ARChive

Python: https://warc.readthedocs.io/en/latest/

58 questions
0
votes
1 answer

Mapreduce carriage return

I want to process CommonCrawl WARC files in MapReduce using the input format s3a. The problem is that the carriage return char at the end of the input lines is removed and tab is put instead (as it is the default delimiter). Why does this…
Afe
  • 167
  • 1
  • 10
0
votes
1 answer

Variable not set in function nodejs

I want to assign JSON data to a variable by parsing a warc file in a function. The variable is inaccessible outside a function and returns an empty array on the console. var metadataObj = { metadata: [] }; fs …
Elliot
  • 1
  • 2
0
votes
1 answer

spark parallelise on iterator with a function

I have an iterator which operates on sequence of WARC documents and yields modified lists of tokens for each document: class MyCorpus(object): def __init__(self, warc_file_instance): self.warc_file = warc_file_instance def clean_text(self,…
0
votes
2 answers

Reading WARC Files Efficiently

I am reading a WARC file with python's 'warc' library. Current file that I am using, is around 4.50 GB. The thing is ; file = warc.open("random.warc") html_lists = [line for line in file] Executing these 2 lines takes up to 40 seconds. Since there…
MeteHan
  • 289
  • 2
  • 16
0
votes
1 answer

Converting a warc.gz file downloaded from Common Crawl to an RDD

I have downloaded a warc.gz file from common crawl and I have to process it using spark. How can convert the file into an RDD?sc.textFile("filepath") does not seem to help. When rdd.take(1) is printed, it gives me [u'WARC/1.0'] whereas it should…
Ravi Ranjan
  • 353
  • 1
  • 6
  • 22
0
votes
1 answer

cannot find url from a warc file crawled from common crawl

I have crawled data from common crawl and I want to find out url corresponding to each of the records. for record in files: print record['WARC-Target-URI'] This outputs an empty list. I am referring to the following…
Ravi Ranjan
  • 353
  • 1
  • 6
  • 22
0
votes
1 answer

How to dump Nutch 2.3 data into WARC file?

I need to dump data from Nutch 2.3 into a WARC file. However, i couldn't find the necessary module. Nutch 1.x had this capability. I would like to know the proper way to do it.
Abdullah Khan
  • 551
  • 1
  • 4
  • 17
0
votes
1 answer

Confused about Kibana import

I would like to know about how I can import data using kibana. Actually, its a confusion for me. I have tried to load json file using kibana, but it is not importing it. second, if I want to work with Warc file, they do I need to convert it into…
Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139
0
votes
2 answers

How do I archive and retrieve a large HTML dataset?

I am a fresher and I am about to participate in a contest this weekend. The problem is about archiving and retrieving a large HTML dataset and I have no idea about it. My friend suggested to me to use a web archive and common crawl. Please suggest…
Sriram S
  • 1
  • 1
0
votes
1 answer

How to read a subset of records from a warc file

I'm trying to parse .warc files from Common Crawl in Python. Since the files are huge, I want to start with a sample/subset of the first few records. How do I truncate the file the file to only include the first X lines while preserving the…
okoboko
  • 4,332
  • 8
  • 40
  • 67
-1
votes
1 answer

Unknown archive format! How can I extract URLs from the WARC file by Jupyter?

I'm trying to extract website URLs from a .WARC (Web ARChive) file from a common crawl dataset commoncrawl.org. After decompressing the file and writing the code to read this file, I attached the code: import pandas as pd from warcio.archiveiterator…
Jawaher
  • 3
  • 2
-1
votes
1 answer

Optimize WARC generation in order to save space and time

I am trying to do a WARC file of a very large list of links of several domains like that: wget --no-check-certificate \ --no-verbose \ --execute robots=off \ --delete-after \ --no-directories \ --page-requisites \ …
santos82h
  • 452
  • 5
  • 15
-1
votes
1 answer

Half of read buffer is corrupt when using ReadFile

Half of the buffer used with ReadFile is corrupt. Regardless of the size of the buffer, half of it has the same corrupted character. I have look for anything that could be causing the read to stop early, etc. If I increase the size of the buffer, I…
kbaud
  • 25
  • 7
1 2 3
4