Questions tagged [warc]

Use this tag for questions related to Web ARChive files.

WARC (Web ARChive) is an extension of the ARC file format

The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format

WIKI: https://en.wikipedia.org/wiki/Web_ARChive

Python: https://warc.readthedocs.io/en/latest/

58 questions
2
votes
1 answer

Retrieving records from WARC file based on url

I have to retrieve records from a *.warc.gz file based on Target-URI. The documentation says that this requires external CDXJ index files to be created. I've tried opening the file as gzip.open() and do a seek(offset), but the seek operation is…
kartheek7895
  • 341
  • 1
  • 12
2
votes
1 answer

Common Crawl Keyword Lookup

I want to find a list of all the websites which is having a specific keywords.For example if i search for a keyword "Sports" or "Football" only the related website URLS , Title , Description and image needs to be extracted from common crawl warc…
2
votes
0 answers

Fetch Common crawl data using Apache Nutch

I find my data on common crawl website and i downloads that data from there and now i have to fetch that data using Apache Nutch but don't know how. This file is in warc file format.
Sahil Rohila
  • 51
  • 1
  • 6
2
votes
2 answers

Dump data from a Nutch crawl into multiple warc files

I have crawled a list of websites using Nutch 1.12. I can dump the crawl data into separate HTML files by using: ./bin/nutch dump -segment crawl/segments/ -o outputDir nameOfDir And into a single WARC file by using: ./bin/nutch warc crawl/warcs…
Chronus
  • 301
  • 3
  • 17
2
votes
0 answers

How to find the number of records in the warc.gz file in Java

I am extracting the required content of the html files that are stored in the warc.gz file. But i am not sure how many html files are in the .gz achieve record.
Bhavana
  • 68
  • 12
2
votes
1 answer

Python cannot read "warc.gz" file completely

For my work, I scrape web-sites and write them to gzipped web-archives (with extension "warc.gz"). I use Python 2.7.11 and the warc 0.2.1 library. I noticed that for majority of files I cannot read them completely with the warc-library. For example…
2
votes
2 answers

Can we index the WARC files directly into Solr?

Can we index the WARC files directly into Solr without extracting and storing some intermediate files (ex. html format) from the WARC files first on the hard disk? In other words, can we index those files without storing anything on the hard disks?
Tarek
  • 21
  • 1
1
vote
0 answers

Is there a way to create a warc from a dynamic website with chromote and another R library?

I would like to save a page rendered with headless chrome using chromote to a warc file. Rendering the page works fine, but I am a bit stuck at saving it as a warc file. First I wanted to use jwatr but due to some policies on our laptops this is not…
Lod
  • 609
  • 7
  • 19
1
vote
1 answer

How to decompress a warc.zst file?

I am trying to decompress a WARC ZST file that I downloaded from here: https://archive.org/details/archiveteam_yahooanswers_20210422220546_c4fac540 I tried the command zstd -d yahooanswers_20210422220546_c4fac540.1619026173.megawarc.warc.zst but I…
Arundhati
  • 11
  • 2
1
vote
1 answer

Error "No module named '__builtin__'" when importing warc

How to use warc package in python 3 ? I installed warc with no problem. But when I call import warc I am getting the error: Exception has occurred: ModuleNotFoundError No module named 'builtin'
Andrey
  • 5,932
  • 3
  • 17
  • 35
1
vote
0 answers

how should I parse a 5gb WARC file using C++?

The WARC files are from the Common Crawl. A sample: WARC-Type: response WARC-Date: 2018-12-09T20:26:32Z WARC-Record-ID:
kbaud
  • 25
  • 7
1
vote
1 answer

Python: How to split WARC file?

My goal is to split and sort WARC file from CommonCrawl into its individual records. Example file: WARC/1.0 WARC-Type: warcinfo WARC-Date: 2020-08-04T01:43:40Z WARC-Record-ID: Content-Length:…
1
vote
2 answers

Python: Reading a file and adding keys and values to dictionaries from different lines

I'm very new to Python and I'm having trouble working on an assignment which basically is like this: #Read line by line a WARC file to identify string1. #When string1 found, add part of the string as a key to a dictionary. #Then continue reading…
geo47
  • 13
  • 5
1
vote
1 answer

Why does my Apache Nutch warc and commoncrawldump fail after crawl?

I have successfully crawled a website using Nutch and now I want to create a warc from the results. However, running both the warc and commoncrawldump commands fail. Also, running bin/nutch dump -segement .... works successfully on the same segment…
cc100
  • 31
  • 6
1
vote
0 answers

Openwayback search does not work with arabic website in URL

I have installed and setup the basic of openwayback and now try to make it work with the following resource https://moj.gov.ae/documents/21128/102233/قرار+مجلس+الوزراء+رقم+18+لسنة+2017+بشأن+اعتماد+قائمة+الاشخاص+والتنظيمات+الارهابية.pdf Setup: I…
Loredra L
  • 1,485
  • 2
  • 16
  • 32