Questions tagged [common-crawl]

Open crawl of the web that can be accessed and analyzed by everyone.

Common Crawl builds and maintains an open crawl of the web that can be accessed and analyzed by everyone. It is a non-profit organization that crawls and archives the web with the intent of providing access to everyone. The organization claims to respect nofollow and robot.txt policies.

Common Crawl makes available a 100 TB web archive of web page data from 2008 to 2012 of about 6 billion webpages. Web crawl data is kept in the Amazon public datasets S3 bucket and is freely downloadable. Common Crawl publishes an Open Source library for processing their data using Hadoop as well as their crawler.

Web site: http://commoncrawl.org/

71 questions

vote

2 answers

Delimiter between two records of a warc.gz file of common crawl

I want to parse warc.gz file downloaded from common crawl. I have a requirement where I have to parse the news warc.gz file manually. What is the delimiter between two records?

common-crawl

asked Aug 28 '17 at 05:09

Ravi Ranjan

vote

0 answers

requests.get() not crawling entire common crawl records for a given warc path

i have implemented https://dmorgan.info/posts/common-crawl-python/ as described in this link. However, I want to crawl entire data rather than partial data unlike as described in this post. So, in this code chunk, def get_partial_warc_file(url,…

python-requests common-crawl warc

asked Aug 10 '17 at 04:58

Ravi Ranjan

vote

0 answers

Fixing broken punctuation in CommonCrawl Text

I'm processing the text from Common Crawl (the WET format) and from what I see, there's a lot of broken punctuation - most likely caused when linebreaks were removed from the original data. For example, in This Massive Rally?The 52, the question…

regex nlp common-crawl

asked Oct 08 '15 at 12:50

Alexey Grigorev

2,415
28
47

vote

1 answer

How to download subset of Amazon CommonCrawel (only the text (WET files?) is needed)

For research purposes, I want a large (~100K) set of web pages, though I am only interested in their text. I plan to use them for gensim LDA topic model. CommonCrawler seems like a good place to start, but I am not sure how to do it. Could someone…

download lda gensim common-crawl

asked Dec 17 '14 at 20:09

UriCS

vote

2 answers

Reading the first 100 lines

Please have a look at the following code: wcmapper.php (mapper for hadoop streaming job) #!/usr/bin/php

php web-services hadoop web-crawler common-crawl

asked Dec 31 '13 at 09:05

Dongle

vote

0 answers

Copying HDFS-format files from S3 to local

We are using Amazon EMR and commoncrawl to perform crawling. EMR writes the output to Amazon S3 in a binary-like format. We'd like to copy that to our local in raw-text format. How can we achieve that? What's the best way? Normally we could hadoop…

hadoop amazon-s3 hdfs amazon-emr common-crawl

asked Sep 29 '13 at 22:37

aladagemre

votes

1 answer

How to access Columnar URL INDEX using Amazon Athena

I am new to AWS and I'm following this tutorial to access Columnar dataset in Common Crawl. I executed this query: SELECT COUNT(*) AS count, url_host_registered_domain FROM "ccindex"."ccindex" WHERE crawl = 'CC-MAIN-2018-05' AND subset =…

amazon-web-services amazon-s3 amazon-athena common-crawl

asked Jan 08 '23 at 13:01

Gladiator

votes

1 answer

Extracting the payload of a single Common Crawl WARC

I can query all occurances of certain base url within a given common crawl index, saving them all to a file and get a specific article (test_article_num) using the code below. However, I have not come across a way to extract the raw html for that…

html python-3.x common-crawl

asked Dec 01 '22 at 22:14

js16

votes

1 answer

Common crawl request with node-fetch, axios or got

I am trying to port my C# common-crawl code to Node.js and getting error in with all HTTP libraries(node-fetch, axios of got) in getting the single page HTML from common-crawl S3 archive. const offset = 994879995; const length = 27549; const…

node.js axios node-fetch common-crawl

asked Apr 23 '22 at 13:00

Vikash Rathee

1,776
2
25
43

votes

1 answer

How to get a listing of WARC files using HTTP for Common Crawl News Dataset?

I can obtain listing for Common Crawl by: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-09/wet.paths.gz How can I do this with Common Crawl News Dataset ? I tried different options, but always getting…

amazon-web-services http common-crawl

asked Mar 20 '21 at 18:36

Andrey

5,932
3
17
35

votes

1 answer

Getting date of first crawl of URL by Common Crawl?

In Common Crawl same URL can be harvested multiple times. For instance, Reddit blog post can be crawled when it was created and then when subsequent comments were added. Is there a way to find when a given URL has been crawled for the first by…

common-crawl

asked Mar 05 '21 at 13:08

dzieciou

4,049
8
41
85

votes

1 answer

Streaming in a gzipped file from s3 in python

Hi I'm working on a project for fun with the common crawl data I have a subset of the most current crawls warc file paths from here so basically I have a url like…

python gzip zlib common-crawl

asked Nov 30 '20 at 00:04

Tyler

2,346
6
33
59

votes

1 answer

How to retrieve the HTML of a page from CommonCrawl?

Assuming I have: the link of the CC*.warc file (and the file itself, if it helps); offset; and length How can I get the HTML content of that page? Thanks for your time and attention.

common-crawl

asked Oct 23 '20 at 22:54

Lucas Azevedo

1,867
22
39

votes

0 answers

Deploying pyspark CommonCrawl repo to EMR

I'm trying to extract WET files from the public CommonCrawl data hosted on S3 from my EMR cluster. To do this, CommonCrawl has a cc-pyspark repo where they provide examples and instructions, however, I don't understand the instructions to get things…

python apache-spark pyspark amazon-emr common-crawl

asked Sep 28 '20 at 07:09

willwrighteng

1,411
11
25

votes

1 answer

AWS credentials required for Common Crawl S3 buckets

I'm trying to get at the Common Crawl news S3 bucket, but I keep getting a "fatal error: Unable to locate credentials" message. Any suggestions for how to get around this? As far as I was aware Common Crawl doesn't even require credentials?

amazon-web-services amazon-s3 common-crawl aws-credentials

asked Sep 06 '20 at 02:46

Jen

Prev 1 2

4 5 Next