Questions tagged [common-crawl]

Open crawl of the web that can be accessed and analyzed by everyone.

Common Crawl builds and maintains an open crawl of the web that can be accessed and analyzed by everyone. It is a non-profit organization that crawls and archives the web with the intent of providing access to everyone. The organization claims to respect nofollow and robot.txt policies.

Common Crawl makes available a 100 TB web archive of web page data from 2008 to 2012 of about 6 billion webpages. Web crawl data is kept in the Amazon public datasets S3 bucket and is freely downloadable. Common Crawl publishes an Open Source library for processing their data using Hadoop as well as their crawler.

Web site: http://commoncrawl.org/

71 questions
1
vote
2 answers

Delimiter between two records of a warc.gz file of common crawl

I want to parse warc.gz file downloaded from common crawl. I have a requirement where I have to parse the news warc.gz file manually. What is the delimiter between two records?
Ravi Ranjan
  • 353
  • 1
  • 6
  • 22
1
vote
0 answers

requests.get() not crawling entire common crawl records for a given warc path

i have implemented https://dmorgan.info/posts/common-crawl-python/ as described in this link. However, I want to crawl entire data rather than partial data unlike as described in this post. So, in this code chunk, def get_partial_warc_file(url,…
Ravi Ranjan
  • 353
  • 1
  • 6
  • 22
1
vote
0 answers

Fixing broken punctuation in CommonCrawl Text

I'm processing the text from Common Crawl (the WET format) and from what I see, there's a lot of broken punctuation - most likely caused when linebreaks were removed from the original data. For example, in This Massive Rally?The 52, the question…
Alexey Grigorev
  • 2,415
  • 28
  • 47
1
vote
1 answer

How to download subset of Amazon CommonCrawel (only the text (WET files?) is needed)

For research purposes, I want a large (~100K) set of web pages, though I am only interested in their text. I plan to use them for gensim LDA topic model. CommonCrawler seems like a good place to start, but I am not sure how to do it. Could someone…
UriCS
  • 175
  • 2
  • 11
1
vote
2 answers

Reading the first 100 lines

Please have a look at the following code: wcmapper.php (mapper for hadoop streaming job) #!/usr/bin/php
Dongle
  • 602
  • 1
  • 8
  • 18
1
vote
0 answers

Copying HDFS-format files from S3 to local

We are using Amazon EMR and commoncrawl to perform crawling. EMR writes the output to Amazon S3 in a binary-like format. We'd like to copy that to our local in raw-text format. How can we achieve that? What's the best way? Normally we could hadoop…
aladagemre
  • 592
  • 5
  • 16
0
votes
1 answer

How to access Columnar URL INDEX using Amazon Athena

I am new to AWS and I'm following this tutorial to access Columnar dataset in Common Crawl. I executed this query: SELECT COUNT(*) AS count, url_host_registered_domain FROM "ccindex"."ccindex" WHERE crawl = 'CC-MAIN-2018-05' AND subset =…
0
votes
1 answer

Extracting the payload of a single Common Crawl WARC

I can query all occurances of certain base url within a given common crawl index, saving them all to a file and get a specific article (test_article_num) using the code below. However, I have not come across a way to extract the raw html for that…
js16
  • 43
  • 5
0
votes
1 answer

Common crawl request with node-fetch, axios or got

I am trying to port my C# common-crawl code to Node.js and getting error in with all HTTP libraries(node-fetch, axios of got) in getting the single page HTML from common-crawl S3 archive. const offset = 994879995; const length = 27549; const…
Vikash Rathee
  • 1,776
  • 2
  • 25
  • 43
0
votes
1 answer

How to get a listing of WARC files using HTTP for Common Crawl News Dataset?

I can obtain listing for Common Crawl by: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-09/wet.paths.gz How can I do this with Common Crawl News Dataset ? I tried different options, but always getting…
Andrey
  • 5,932
  • 3
  • 17
  • 35
0
votes
1 answer

Getting date of first crawl of URL by Common Crawl?

In Common Crawl same URL can be harvested multiple times. For instance, Reddit blog post can be crawled when it was created and then when subsequent comments were added. Is there a way to find when a given URL has been crawled for the first by…
dzieciou
  • 4,049
  • 8
  • 41
  • 85
0
votes
1 answer

Streaming in a gzipped file from s3 in python

Hi I'm working on a project for fun with the common crawl data I have a subset of the most current crawls warc file paths from here so basically I have a url like…
Tyler
  • 2,346
  • 6
  • 33
  • 59
0
votes
1 answer

How to retrieve the HTML of a page from CommonCrawl?

Assuming I have: the link of the CC*.warc file (and the file itself, if it helps); offset; and length How can I get the HTML content of that page? Thanks for your time and attention.
Lucas Azevedo
  • 1,867
  • 22
  • 39
0
votes
0 answers

Deploying pyspark CommonCrawl repo to EMR

I'm trying to extract WET files from the public CommonCrawl data hosted on S3 from my EMR cluster. To do this, CommonCrawl has a cc-pyspark repo where they provide examples and instructions, however, I don't understand the instructions to get things…
willwrighteng
  • 1,411
  • 11
  • 25
0
votes
1 answer

AWS credentials required for Common Crawl S3 buckets

I'm trying to get at the Common Crawl news S3 bucket, but I keep getting a "fatal error: Unable to locate credentials" message. Any suggestions for how to get around this? As far as I was aware Common Crawl doesn't even require credentials?