Questions tagged [warc]

Use this tag for questions related to Web ARChive files.

WARC (Web ARChive) is an extension of the ARC file format

The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format

WIKI: https://en.wikipedia.org/wiki/Web_ARChive

Python: https://warc.readthedocs.io/en/latest/

58 questions

votes

1 answer

Retrieving records from WARC file based on url

I have to retrieve records from a *.warc.gz file based on Target-URI. The documentation says that this requires external CDXJ index files to be created. I've tried opening the file as gzip.open() and do a seek(offset), but the seek operation is…

python python-3.x warc

asked Mar 20 '18 at 06:46

kartheek7895

votes

1 answer

Common Crawl Keyword Lookup

I want to find a list of all the websites which is having a specific keywords.For example if i search for a keyword "Sports" or "Football" only the related website URLS , Title , Description and image needs to be extracted from common crawl warc…

python-2.7 python-3.x elasticsearch common-crawl warc

asked Oct 02 '17 at 08:10

Dinesh Manne

votes

0 answers

Fetch Common crawl data using Apache Nutch

I find my data on common crawl website and i downloads that data from there and now i have to fetch that data using Apache Nutch but don't know how. This file is in warc file format.

nutch warc common-crawl

asked Jan 17 '17 at 07:44

Sahil Rohila

votes

2 answers

Dump data from a Nutch crawl into multiple warc files

I have crawled a list of websites using Nutch 1.12. I can dump the crawl data into separate HTML files by using: ./bin/nutch dump -segment crawl/segments/ -o outputDir nameOfDir And into a single WARC file by using: ./bin/nutch warc crawl/warcs…

web-crawler nutch warc

asked Oct 24 '16 at 14:41

Chronus

votes

0 answers

How to find the number of records in the warc.gz file in Java

I am extracting the required content of the html files that are stored in the warc.gz file. But i am not sure how many html files are in the .gz achieve record.

java warc

asked Oct 06 '16 at 18:56

Bhavana

votes

1 answer

Python cannot read "warc.gz" file completely

For my work, I scrape web-sites and write them to gzipped web-archives (with extension "warc.gz"). I use Python 2.7.11 and the warc 0.2.1 library. I noticed that for majority of files I cannot read them completely with the warc-library. For example…

python gzip warc

asked Mar 23 '16 at 09:05

Ekaterina Ermilova

votes

2 answers

Can we index the WARC files directly into Solr?

Can we index the WARC files directly into Solr without extracting and storing some intermediate files (ex. html format) from the WARC files first on the hard disk? In other words, can we index those files without storing anything on the hard disks?

solr indexing warc

asked Aug 31 '14 at 19:06

Tarek

vote

0 answers

Is there a way to create a warc from a dynamic website with chromote and another R library?

I would like to save a page rendered with headless chrome using chromote to a warc file. Rendering the page works fine, but I am a bit stuck at saving it as a warc file. First I wanted to use jwatr but due to some policies on our laptops this is not…

r headless warc chromote

asked Jun 14 '23 at 08:36

Lod

vote

1 answer

How to decompress a warc.zst file?

I am trying to decompress a WARC ZST file that I downloaded from here: https://archive.org/details/archiveteam_yahooanswers_20210422220546_c4fac540 I tried the command zstd -d yahooanswers_20210422220546_c4fac540.1619026173.megawarc.warc.zst but I…

archive webarchive warc zstd

asked Jul 12 '21 at 15:25

Arundhati

vote

1 answer

Error "No module named 'builtin'" when importing warc

How to use warc package in python 3 ? I installed warc with no problem. But when I call import warc I am getting the error: Exception has occurred: ModuleNotFoundError No module named 'builtin'

python python-3.x windows warc

asked Mar 25 '21 at 06:29

Andrey

5,932
3
17
35

vote

0 answers

how should I parse a 5gb WARC file using C++?

The WARC files are from the Common Crawl. A sample: WARC-Type: response WARC-Date: 2018-12-09T20:26:32Z WARC-Record-ID:

c++ xml winapi warc

asked Nov 25 '20 at 22:33

kbaud

vote

1 answer

Python: How to split WARC file?

My goal is to split and sort WARC file from CommonCrawl into its individual records. Example file: WARC/1.0 WARC-Type: warcinfo WARC-Date: 2020-08-04T01:43:40Z WARC-Record-ID: Content-Length:…

python split warc

asked Oct 22 '20 at 04:24

user14233932

vote

2 answers

Python: Reading a file and adding keys and values to dictionaries from different lines

I'm very new to Python and I'm having trouble working on an assignment which basically is like this: #Read line by line a WARC file to identify string1. #When string1 found, add part of the string as a key to a dictionary. #Then continue reading…

python dictionary warc

asked Sep 30 '20 at 12:44

geo47

vote

1 answer

Why does my Apache Nutch warc and commoncrawldump fail after crawl?

I have successfully crawled a website using Nutch and now I want to create a warc from the results. However, running both the warc and commoncrawldump commands fail. Also, running bin/nutch dump -segement .... works successfully on the same segment…

java nutch common-crawl warc

asked Sep 15 '20 at 09:43

cc100

vote

0 answers

Openwayback search does not work with arabic website in URL

I have installed and setup the basic of openwayback and now try to make it work with the following resource https://moj.gov.ae/documents/21128/102233/قرار+مجلس+الوزراء+رقم+18+لسنة+2017+بشأن+اعتماد+قائمة+الاشخاص+والتنظيمات+الارهابية.pdf Setup: I…

arabic webarchive warc

asked Nov 06 '18 at 10:24

Loredra L

1,485
2
16
32

Prev 1

3 4 Next