Unknown archive format! How can I extract URLs from the WARC file by Jupyter?

Question

I'm trying to extract website URLs from a .WARC (Web ARChive) file from a common crawl dataset commoncrawl.org. After decompressing the file and writing the code to read this file, I attached the code:

import pandas as pd
from warcio.archiveiterator import ArchiveIterator
import http.client

# Function to parse WARC file and extract URLs
def extract_urls_from_warc(file_path):
    urls = []
    with open(file_path, 'rb') as file:
        for record in ArchiveIterator(file):
            if record.rec_type == 'response':
                payload = record.content_stream().read()
                http_response = http.client.HTTPResponse(
                    io.BytesIO(payload),
                    method='GET'
                )
                http_response.begin()
                url = http_response.getheader('WARC-Target-URI')
                urls.append(url)
    
    # Create DataFrame with extracted URLs
    df = pd.DataFrame(urls, columns=['URL'])
    return df

# Provide the path to WARC file
warc_file_path = r"./commoncrawl.warc/commoncrawl.warc"

# Call the function to extract URLs from the WARC file and create a DataFrame
df = extract_urls_from_warc(warc_file_path)

# Display the DataFrame with URLs
print(df)

after running this code I received this error message:

ArchiveLoadFailed: Unknown archive format, first line: ['crawl-data/CC-MAIN-2023-14/segments/1679296943471.24/warc/CC-MAIN-20230320083513-20230320113513-00000.warc.gz']

I using Python 3.10.9 in Jupyter.

I want to read and extract URLs pages from .WARC file by using Jupyter

Sebastian Nagel · Answer 1 · 2023-06-05T11:31:54.050

1

The error message indicates that the input file is not a WARC file but a listing of WARC file locations. One Common Crawl main dataset consists of several 10,000 WARC files and the listing references all of them. To process the WARC files:

select one or some of the WARC files in the listing (processing all of them is not possible on a laptop, a desktop computer or a Jupyter notebook).
add https://data.commoncrawl.org/ in front of every WARC file path which gives you the download URL(s). For further details, please see https://commoncrawl.org/access-the-data/

edited Jun 05 '23 at 11:31

answered Jun 04 '23 at 20:16

Sebastian Nagel

2,049
10
10

1

Note: if you only want a list of the URLs of all pages crawled by Common Crawl, the [URL index](https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/) is the most efficient option. – Sebastian Nagel Jun 04 '23 at 20:17

Unknown archive format! How can I extract URLs from the WARC file by Jupyter?

1 Answers1