I'm trying to extract website URLs from a .WARC (Web ARChive) file from a common crawl dataset commoncrawl.org. After decompressing the file and writing the code to read this file, I attached the code:
import pandas as pd
from warcio.archiveiterator import ArchiveIterator
import http.client
# Function to parse WARC file and extract URLs
def extract_urls_from_warc(file_path):
urls = []
with open(file_path, 'rb') as file:
for record in ArchiveIterator(file):
if record.rec_type == 'response':
payload = record.content_stream().read()
http_response = http.client.HTTPResponse(
io.BytesIO(payload),
method='GET'
)
http_response.begin()
url = http_response.getheader('WARC-Target-URI')
urls.append(url)
# Create DataFrame with extracted URLs
df = pd.DataFrame(urls, columns=['URL'])
return df
# Provide the path to WARC file
warc_file_path = r"./commoncrawl.warc/commoncrawl.warc"
# Call the function to extract URLs from the WARC file and create a DataFrame
df = extract_urls_from_warc(warc_file_path)
# Display the DataFrame with URLs
print(df)
after running this code I received this error message:
ArchiveLoadFailed: Unknown archive format, first line: ['crawl-data/CC-MAIN-2023-14/segments/1679296943471.24/warc/CC-MAIN-20230320083513-20230320113513-00000.warc.gz']
I using Python 3.10.9 in Jupyter.
I want to read and extract URLs pages from .WARC file by using Jupyter